How Knowledge Sources Work for AI Agents- Chipp Docs

When you upload a document to your Chipp app, something powerful happens behind the scenes. Your files get transformed into a format that AI can search and understand semantically—not just by keywords, but by meaning. This guide explains how it works in plain terms.

The Big Picture

Here's what happens when you add a knowledge source:

You upload a file (PDF, document, spreadsheet, URL, etc.)
We extract the text from your file
We split it into chunks (smaller, manageable pieces)
We create embeddings (numerical representations of meaning)
We store everything in a searchable database
When users ask questions, we find the most relevant chunks and give them to the AI

Let's break down each step.

Step 1: Text Extraction

Different file types need different handling. We support:

File Type	What We Extract
PDFs	Text, tables, and content from complex layouts
Word Documents	Formatted text and structure
Excel/CSV	Tabular data converted to readable text
Web URLs	Page content, cleaned of navigation and ads
YouTube Videos	Transcripts and captions
Images	Text extracted using AI vision
Plain Text/Markdown	Direct content

The goal is to get clean, readable text regardless of the original format.

Step 2: Chunking

AI models have limits on how much text they can process at once. Plus, when answering a question, you don't need an entire 50-page document—you need the relevant paragraph.

Chunking splits your text into smaller pieces (typically a few pages worth each). We try to split at natural boundaries—between paragraphs or sections—so each chunk contains a complete thought.

Example: A 20-page PDF might become 15-20 chunks, each containing related content that can stand on its own.

Step 3: Creating Embeddings

This is where the magic happens.

What's an Embedding?

An embedding is a list of numbers that represents the meaning of text. Think of it as coordinates in a giant map of concepts.

Simple analogy: Imagine a map where:

"Dog" and "puppy" are placed close together (similar meaning)
"Dog" and "cat" are somewhat close (both pets)
"Dog" and "refrigerator" are far apart (unrelated concepts)

Embeddings work the same way, but in thousands of dimensions instead of two. Text with similar meaning gets similar numbers.

Why This Matters

Traditional search looks for exact keyword matches. If your document says "automobile" but someone searches "car," traditional search might miss it.

Semantic search using embeddings understands that "automobile" and "car" mean the same thing. It finds relevant content based on meaning, not just matching words.

The Numbers

Each chunk of text becomes a vector of ~3,000 numbers. These numbers encode:

What the text is about
The context and tone
Relationships to other concepts

You don't need to understand the math—just know that similar content produces similar numbers, which makes searching incredibly powerful.

Step 4: Storage

We store both:

The original text chunks (so we can show them to the AI)
The embeddings (so we can search by meaning)

This lives in a specialized vector database optimized for finding similar embeddings quickly.

Step 5: Retrieval (When Users Ask Questions)

Here's what happens when someone asks your app a question:

1. Query Embedding

We create an embedding for the user's question using the same process. "What's your refund policy?" becomes a vector of numbers.

2. Similarity Search

We compare the question's embedding to all your stored chunk embeddings. The database finds chunks whose numbers are closest to the question's numbers.

3. Relevance Scoring

Each chunk gets a similarity score (0 to 1). Higher scores mean more relevant content.

4. Context Assembly

We take the top-scoring chunks and give them to the AI as context: "Here's relevant information from the knowledge base. Use it to answer this question."

5. AI Response

The AI reads the retrieved chunks and generates an answer grounded in your actual content—not just its general training data.

Why This Approach Works

Semantic Understanding

Users don't need to know the exact words in your documents. Ask about "pricing" and find content about "costs," "fees," or "rates."

Accuracy

By giving the AI specific, relevant content, answers are grounded in your actual information rather than generic responses.

Scalability

Whether you have 10 documents or 10,000, the search remains fast because embeddings can be compared efficiently.

Relevance

Instead of dumping entire documents into the AI (which would exceed limits and include irrelevant content), we surface just the pieces that matter.

Hybrid Search: Smarter Retrieval for Large Knowledge Bases

When you have dozens or hundreds of knowledge sources, finding the right information becomes more challenging. A simple chunk-by-chunk search might return text that sounds relevant but comes from the wrong document entirely.

This is where hybrid search comes in.

The Problem with Pure Chunk Search

Imagine you have 50 documents uploaded:

Employee Handbook
API Documentation
Password Reset Guide
System Administration Manual
...and 46 more

A user asks: "How do I reset my password?"

Pure chunk search might find:

"Reset API tokens in the developer console" (API docs)
"Reset your password in account settings" (Password Guide)
"Reset deployment configuration" (Admin Manual)

The first result has high text similarity to "reset" but comes from the wrong context entirely. The user wanted account passwords, not API tokens.

How Hybrid Search Solves This

We use a two-tier approach:

Tier 1: Document-Level Understanding

When you upload a document, we also generate an AI summary of the entire document and create an embedding for it. This captures what the document is about at a high level.

For example:

Employee Handbook → "Company policies covering PTO, benefits, conduct..."
API Documentation → "Technical reference for developer integrations..."
Password Reset Guide → "Instructions for resetting user account passwords..."

Tier 2: Combined Scoring

When a user asks a question, we:

First find which documents are most relevant to the question
Then search for chunks, but boost chunks from relevant documents

For "How do I reset my password?":

Password Reset Guide gets a high document relevance score (0.92)
API Documentation gets a low score (0.45)

A chunk from the Password Guide with slightly lower text similarity (0.78) now outranks a chunk from API docs with higher text similarity (0.85)—because it comes from the right context.

Why Document Summaries Matter

Document summaries serve as a "table of contents" for your entire knowledge base. They help the system understand:

What each document is about
Which documents are relevant to a query
How to prioritize chunks from contextually appropriate sources

This is especially powerful when you have:

Many similar documents (multiple policy manuals, product guides)
Overlapping terminology (technical terms used in different contexts)
Large knowledge bases (hundreds of documents with thousands of chunks)

The Result

Hybrid search delivers more accurate answers by understanding context at both the document and chunk level. Users get responses from the right source, not just text that happens to contain similar words.

Real-World Example

Scenario: You've uploaded your company's 50-page employee handbook.

User asks: "How many vacation days do I get?"

What happens:

Question gets converted to an embedding
System searches your handbook chunks
Finds the chunk about PTO policy (high similarity score)
Also finds related chunks about holiday schedule (moderate similarity)
AI receives these chunks as context
AI responds: "According to the employee handbook, full-time employees receive 15 vacation days per year, plus 10 company holidays..."

The AI didn't have to read all 50 pages—just the relevant sections.

Tips for Better Results

1. Quality Content

The AI can only find what's in your documents. Make sure your knowledge sources contain the information users will ask about.

2. Clear Writing

Well-organized, clearly written content produces better embeddings. If a human would struggle to understand your document, the AI will too.

3. Comprehensive Coverage

If users frequently ask questions you can't answer, consider adding more knowledge sources to cover those topics.

4. Keep Sources Updated

Outdated documents lead to outdated answers. Regularly refresh your knowledge sources with current information.

Supported File Types

Type	Extensions	Best For
Documents	PDF, DOCX, DOC	Manuals, guides, reports
Spreadsheets	XLSX, CSV	Data tables, lists, structured info
Text	TXT, MD	Simple content, FAQs
Web	URLs	Online documentation, articles
Media	YouTube links	Video transcripts, tutorials
Images	PNG, JPG	Scanned documents, diagrams with text

Using with MCP

You can manage knowledge sources programmatically using the Chipp MCP Server, enabling bulk operations and automated workflows.

Available Tools

Tool	Description
`list_knowledge_sources`	List all sources for an app with status
`add_url_source`	Add URL (auto-detects YouTube, social media, web pages)
`add_document_source`	Upload a document (base64 encoded)
`get_knowledge_source`	Get details and processing status
`delete_knowledge_source`	Remove a knowledge source
`refresh_knowledge_source`	Re-crawl or re-process a source

Example Prompts

"Add https://docs.example.com to my FAQ app with crawl depth 2"
"List all knowledge sources for app 123"
"Add this YouTube video to my training bot"
"Refresh all outdated knowledge sources"

Workflow: Bulk Knowledge Migration

You: List knowledge sources in my old FAQ app (ID 100)

AI: [Calls list_knowledge_sources]
Found 8 sources:
- docs.example.com (URL, COMPLETED)
- Product Manual.pdf (FILE, COMPLETED)
- FAQ-2024.docx (FILE, COMPLETED)
...

You: Add all the URLs to my new FAQ app (ID 200)

AI: [Calls add_url_source for each URL]
Added 5 URL sources to app 200. Processing will complete
in a few minutes.

💡

MCP server access requires a Pro plan or higher. See the MCP Setup Guide to get started.