Guides

How Knowledge Sources Work

Learn how your documents become searchable AI knowledge through embeddings and semantic search

|View as Markdown
Hunter HodnettCPTO at Chipp
|8 min read

When you upload a document to your Chipp app, something powerful happens behind the scenes. Your files get transformed into a format that AI can search and understand semantically—not just by keywords, but by meaning. This guide explains how it works in plain terms.

The Big Picture

Here's what happens when you add a knowledge source:

  1. You upload a file (PDF, document, spreadsheet, URL, etc.)
  2. We extract the text from your file
  3. We split it into chunks (smaller, manageable pieces)
  4. We create embeddings (numerical representations of meaning)
  5. We store everything in a searchable database
  6. When users ask questions, we find the most relevant chunks and give them to the AI

Let's break down each step.

Step 1: Text Extraction

Different file types need different handling. We support:

File TypeWhat We Extract
PDFsText, tables, and content from complex layouts
Word DocumentsFormatted text and structure
Excel/CSVTabular data converted to readable text
Web URLsPage content, cleaned of navigation and ads
YouTube VideosTranscripts and captions
ImagesText extracted using AI vision
Plain Text/MarkdownDirect content

The goal is to get clean, readable text regardless of the original format.

Step 2: Chunking

AI models have limits on how much text they can process at once. Plus, when answering a question, you don't need an entire 50-page document—you need the relevant paragraph.

Chunking splits your text into smaller pieces (typically a few pages worth each). We try to split at natural boundaries—between paragraphs or sections—so each chunk contains a complete thought.

Example: A 20-page PDF might become 15-20 chunks, each containing related content that can stand on its own.

Step 3: Creating Embeddings

This is where the magic happens.

What's an Embedding?

An embedding is a list of numbers that represents the meaning of text. Think of it as coordinates in a giant map of concepts.

Simple analogy: Imagine a map where:

  • "Dog" and "puppy" are placed close together (similar meaning)
  • "Dog" and "cat" are somewhat close (both pets)
  • "Dog" and "refrigerator" are far apart (unrelated concepts)

Embeddings work the same way, but in thousands of dimensions instead of two. Text with similar meaning gets similar numbers.

Why This Matters

Traditional search looks for exact keyword matches. If your document says "automobile" but someone searches "car," traditional search might miss it.

Semantic search using embeddings understands that "automobile" and "car" mean the same thing. It finds relevant content based on meaning, not just matching words.

The Numbers

Each chunk of text becomes a vector of ~3,000 numbers. These numbers encode:

  • What the text is about
  • The context and tone
  • Relationships to other concepts

You don't need to understand the math—just know that similar content produces similar numbers, which makes searching incredibly powerful.

Step 4: Storage

We store both:

  • The original text chunks (so we can show them to the AI)
  • The embeddings (so we can search by meaning)

This lives in a specialized vector database optimized for finding similar embeddings quickly.

Step 5: Retrieval (When Users Ask Questions)

Here's what happens when someone asks your app a question:

1. Query Embedding

We create an embedding for the user's question using the same process. "What's your refund policy?" becomes a vector of numbers.

We compare the question's embedding to all your stored chunk embeddings. The database finds chunks whose numbers are closest to the question's numbers.

3. Relevance Scoring

Each chunk gets a similarity score (0 to 1). Higher scores mean more relevant content.

4. Context Assembly

We take the top-scoring chunks and give them to the AI as context: "Here's relevant information from the knowledge base. Use it to answer this question."

5. AI Response

The AI reads the retrieved chunks and generates an answer grounded in your actual content—not just its general training data.

Why This Approach Works

Semantic Understanding

Users don't need to know the exact words in your documents. Ask about "pricing" and find content about "costs," "fees," or "rates."

Accuracy

By giving the AI specific, relevant content, answers are grounded in your actual information rather than generic responses.

Scalability

Whether you have 10 documents or 10,000, the search remains fast because embeddings can be compared efficiently.

Relevance

Instead of dumping entire documents into the AI (which would exceed limits and include irrelevant content), we surface just the pieces that matter.

Hybrid Search: Smarter Retrieval for Large Knowledge Bases

When you have dozens or hundreds of knowledge sources, finding the right information becomes more challenging. A simple chunk-by-chunk search might return text that sounds relevant but comes from the wrong document entirely.

This is where hybrid search comes in.

Imagine you have 50 documents uploaded:

  • Employee Handbook
  • API Documentation
  • Password Reset Guide
  • System Administration Manual
  • ...and 46 more

A user asks: "How do I reset my password?"

Pure chunk search might find:

  1. "Reset API tokens in the developer console" (API docs)
  2. "Reset your password in account settings" (Password Guide)
  3. "Reset deployment configuration" (Admin Manual)

The first result has high text similarity to "reset" but comes from the wrong context entirely. The user wanted account passwords, not API tokens.

How Hybrid Search Solves This

We use a two-tier approach:

Tier 1: Document-Level Understanding

When you upload a document, we also generate an AI summary of the entire document and create an embedding for it. This captures what the document is about at a high level.

For example:

  • Employee Handbook → "Company policies covering PTO, benefits, conduct..."
  • API Documentation → "Technical reference for developer integrations..."
  • Password Reset Guide → "Instructions for resetting user account passwords..."

Tier 2: Combined Scoring

When a user asks a question, we:

  1. First find which documents are most relevant to the question
  2. Then search for chunks, but boost chunks from relevant documents

For "How do I reset my password?":

  • Password Reset Guide gets a high document relevance score (0.92)
  • API Documentation gets a low score (0.45)

A chunk from the Password Guide with slightly lower text similarity (0.78) now outranks a chunk from API docs with higher text similarity (0.85)—because it comes from the right context.

Why Document Summaries Matter

Document summaries serve as a "table of contents" for your entire knowledge base. They help the system understand:

  • What each document is about
  • Which documents are relevant to a query
  • How to prioritize chunks from contextually appropriate sources

This is especially powerful when you have:

  • Many similar documents (multiple policy manuals, product guides)
  • Overlapping terminology (technical terms used in different contexts)
  • Large knowledge bases (hundreds of documents with thousands of chunks)

The Result

Hybrid search delivers more accurate answers by understanding context at both the document and chunk level. Users get responses from the right source, not just text that happens to contain similar words.

Real-World Example

Scenario: You've uploaded your company's 50-page employee handbook.

User asks: "How many vacation days do I get?"

What happens:

  1. Question gets converted to an embedding
  2. System searches your handbook chunks
  3. Finds the chunk about PTO policy (high similarity score)
  4. Also finds related chunks about holiday schedule (moderate similarity)
  5. AI receives these chunks as context
  6. AI responds: "According to the employee handbook, full-time employees receive 15 vacation days per year, plus 10 company holidays..."

The AI didn't have to read all 50 pages—just the relevant sections.

Tips for Better Results

1. Quality Content

The AI can only find what's in your documents. Make sure your knowledge sources contain the information users will ask about.

2. Clear Writing

Well-organized, clearly written content produces better embeddings. If a human would struggle to understand your document, the AI will too.

3. Comprehensive Coverage

If users frequently ask questions you can't answer, consider adding more knowledge sources to cover those topics.

4. Keep Sources Updated

Outdated documents lead to outdated answers. Regularly refresh your knowledge sources with current information.

Supported File Types

TypeExtensionsBest For
DocumentsPDF, DOCX, DOCManuals, guides, reports
SpreadsheetsXLSX, CSVData tables, lists, structured info
TextTXT, MDSimple content, FAQs
WebURLsOnline documentation, articles
MediaYouTube linksVideo transcripts, tutorials
ImagesPNG, JPGScanned documents, diagrams with text