How Knowledge Sources Work
Learn how your documents become searchable AI knowledge through embeddings and semantic search
When you upload a document to your Chipp app, something powerful happens behind the scenes. Your files get transformed into a format that AI can search and understand semantically—not just by keywords, but by meaning. This guide explains how it works in plain terms.
The Big Picture
Here's what happens when you add a knowledge source:
- You upload a file (PDF, document, spreadsheet, URL, etc.)
- We extract the text from your file
- We split it into chunks (smaller, manageable pieces)
- We create embeddings (numerical representations of meaning)
- We store everything in a searchable database
- When users ask questions, we find the most relevant chunks and give them to the AI
Let's break down each step.
Step 1: Text Extraction
Different file types need different handling. We support:
| File Type | What We Extract |
|---|---|
| PDFs | Text, tables, and content from complex layouts |
| Word Documents | Formatted text and structure |
| Excel/CSV | Tabular data converted to readable text |
| Web URLs | Page content, cleaned of navigation and ads |
| YouTube Videos | Transcripts and captions |
| Images | Text extracted using AI vision |
| Plain Text/Markdown | Direct content |
The goal is to get clean, readable text regardless of the original format.
Step 2: Chunking
AI models have limits on how much text they can process at once. Plus, when answering a question, you don't need an entire 50-page document—you need the relevant paragraph.
Chunking splits your text into smaller pieces (typically a few pages worth each). We try to split at natural boundaries—between paragraphs or sections—so each chunk contains a complete thought.
Example: A 20-page PDF might become 15-20 chunks, each containing related content that can stand on its own.
Step 3: Creating Embeddings
This is where the magic happens.
What's an Embedding?
An embedding is a list of numbers that represents the meaning of text. Think of it as coordinates in a giant map of concepts.
Simple analogy: Imagine a map where:
- "Dog" and "puppy" are placed close together (similar meaning)
- "Dog" and "cat" are somewhat close (both pets)
- "Dog" and "refrigerator" are far apart (unrelated concepts)
Embeddings work the same way, but in thousands of dimensions instead of two. Text with similar meaning gets similar numbers.
Why This Matters
Traditional search looks for exact keyword matches. If your document says "automobile" but someone searches "car," traditional search might miss it.
Semantic search using embeddings understands that "automobile" and "car" mean the same thing. It finds relevant content based on meaning, not just matching words.
The Numbers
Each chunk of text becomes a vector of ~3,000 numbers. These numbers encode:
- What the text is about
- The context and tone
- Relationships to other concepts
You don't need to understand the math—just know that similar content produces similar numbers, which makes searching incredibly powerful.
Step 4: Storage
We store both:
- The original text chunks (so we can show them to the AI)
- The embeddings (so we can search by meaning)
This lives in a specialized vector database optimized for finding similar embeddings quickly.
Step 5: Retrieval (When Users Ask Questions)
Here's what happens when someone asks your app a question:
1. Query Embedding
We create an embedding for the user's question using the same process. "What's your refund policy?" becomes a vector of numbers.
2. Similarity Search
We compare the question's embedding to all your stored chunk embeddings. The database finds chunks whose numbers are closest to the question's numbers.
3. Relevance Scoring
Each chunk gets a similarity score (0 to 1). Higher scores mean more relevant content.
4. Context Assembly
We take the top-scoring chunks and give them to the AI as context: "Here's relevant information from the knowledge base. Use it to answer this question."
5. AI Response
The AI reads the retrieved chunks and generates an answer grounded in your actual content—not just its general training data.
Why This Approach Works
Semantic Understanding
Users don't need to know the exact words in your documents. Ask about "pricing" and find content about "costs," "fees," or "rates."
Accuracy
By giving the AI specific, relevant content, answers are grounded in your actual information rather than generic responses.
Scalability
Whether you have 10 documents or 10,000, the search remains fast because embeddings can be compared efficiently.
Relevance
Instead of dumping entire documents into the AI (which would exceed limits and include irrelevant content), we surface just the pieces that matter.
Hybrid Search: Smarter Retrieval for Large Knowledge Bases
When you have dozens or hundreds of knowledge sources, finding the right information becomes more challenging. A simple chunk-by-chunk search might return text that sounds relevant but comes from the wrong document entirely.
This is where hybrid search comes in.
The Problem with Pure Chunk Search
Imagine you have 50 documents uploaded:
- Employee Handbook
- API Documentation
- Password Reset Guide
- System Administration Manual
- ...and 46 more
A user asks: "How do I reset my password?"
Pure chunk search might find:
- "Reset API tokens in the developer console" (API docs)
- "Reset your password in account settings" (Password Guide)
- "Reset deployment configuration" (Admin Manual)
The first result has high text similarity to "reset" but comes from the wrong context entirely. The user wanted account passwords, not API tokens.
How Hybrid Search Solves This
We use a two-tier approach:
Tier 1: Document-Level Understanding
When you upload a document, we also generate an AI summary of the entire document and create an embedding for it. This captures what the document is about at a high level.
For example:
- Employee Handbook → "Company policies covering PTO, benefits, conduct..."
- API Documentation → "Technical reference for developer integrations..."
- Password Reset Guide → "Instructions for resetting user account passwords..."
Tier 2: Combined Scoring
When a user asks a question, we:
- First find which documents are most relevant to the question
- Then search for chunks, but boost chunks from relevant documents
For "How do I reset my password?":
- Password Reset Guide gets a high document relevance score (0.92)
- API Documentation gets a low score (0.45)
A chunk from the Password Guide with slightly lower text similarity (0.78) now outranks a chunk from API docs with higher text similarity (0.85)—because it comes from the right context.
Why Document Summaries Matter
Document summaries serve as a "table of contents" for your entire knowledge base. They help the system understand:
- What each document is about
- Which documents are relevant to a query
- How to prioritize chunks from contextually appropriate sources
This is especially powerful when you have:
- Many similar documents (multiple policy manuals, product guides)
- Overlapping terminology (technical terms used in different contexts)
- Large knowledge bases (hundreds of documents with thousands of chunks)
The Result
Hybrid search delivers more accurate answers by understanding context at both the document and chunk level. Users get responses from the right source, not just text that happens to contain similar words.
Real-World Example
Scenario: You've uploaded your company's 50-page employee handbook.
User asks: "How many vacation days do I get?"
What happens:
- Question gets converted to an embedding
- System searches your handbook chunks
- Finds the chunk about PTO policy (high similarity score)
- Also finds related chunks about holiday schedule (moderate similarity)
- AI receives these chunks as context
- AI responds: "According to the employee handbook, full-time employees receive 15 vacation days per year, plus 10 company holidays..."
The AI didn't have to read all 50 pages—just the relevant sections.
Tips for Better Results
1. Quality Content
The AI can only find what's in your documents. Make sure your knowledge sources contain the information users will ask about.
2. Clear Writing
Well-organized, clearly written content produces better embeddings. If a human would struggle to understand your document, the AI will too.
3. Comprehensive Coverage
If users frequently ask questions you can't answer, consider adding more knowledge sources to cover those topics.
4. Keep Sources Updated
Outdated documents lead to outdated answers. Regularly refresh your knowledge sources with current information.
Supported File Types
| Type | Extensions | Best For |
|---|---|---|
| Documents | PDF, DOCX, DOC | Manuals, guides, reports |
| Spreadsheets | XLSX, CSV | Data tables, lists, structured info |
| Text | TXT, MD | Simple content, FAQs |
| Web | URLs | Online documentation, articles |
| Media | YouTube links | Video transcripts, tutorials |
| Images | PNG, JPG | Scanned documents, diagrams with text |
Continue Reading
Advanced RAG Settings
Fine-tune how your AI retrieves and uses knowledge from your documents
Understanding AI Evaluations
Learn how to test your AI chatbot before launching it to customers. Evals help you know exactly what your AI will say in any situation, so you can ship with confidence.
Understanding Tokens
Learn exactly what tokens are, how they're counted, and why they matter for AI pricing