Understanding AI Evaluations for AI Agents- Chipp Docs

You've built an AI chatbot. It looks great in demos. But how do you know it won't embarrass you in front of real customers? What if it hallucinates incorrect pricing? What if it gives wrong business hours? What if it goes completely off-topic?

This is where evaluations (evals) come in. Think of evals as automated QA testing for your AI. Before your chatbot goes live, you can verify it gives the right answers to important questions.

Why Evals Matter

Imagine you run a SaaS company. A customer asks your AI chatbot: "What's your pricing?"

Without evals, you might discover too late that your AI confidently stated "$99/month" when your actual pricing is "$29/month." That's not just embarrassing - it can damage trust and create legal issues.

With evals, you can:

Define test cases - Questions your AI should handle correctly
Specify expected answers - What the correct response should be
Run automated checks - Before every deployment
Catch problems early - Before customers ever see them

💡

Evals are like unit tests for traditional software, but for AI responses. They let you ship with confidence.

Prerequisites for Each Evaluation Type

Not all evaluation types work for all chatbots. Here's what each one requires:

Evaluation Type	Requirements
Factuality	None - works with any chatbot
Faithfulness	Requires knowledge sources (documents, URLs)
Answer Relevancy	None - works with any chatbot
Context Relevancy	Requires knowledge sources (documents, URLs)
All	Requires knowledge sources for full metrics

⚠️

If your chatbot doesn't use a knowledge base, stick with Factuality and Answer Relevancy. The other metrics evaluate how well your AI uses retrieved documents—without documents to retrieve, those scores won't be meaningful.

The Four Types of Evaluations

Chipp uses an "LLM-as-Judge" approach, where a separate AI model evaluates your chatbot's responses. This is like having an expert reviewer automatically grade every answer.

There are four evaluation types, each measuring something different:

1. Factuality - "Is the answer correct?"

Factuality

"Does the AI's response match the expected answer?"

Measures:

Compares the AI's output against a known correct answer

Best for:

Testing specific, objective facts like business hours, pricing, or policies

Pass Example

"Our office opens at 9 AM" matches "We open at 9:00 AM"

Fail Example

"We're open 24/7" does NOT match "We open at 9:00 AM"

Factuality is the most straightforward evaluation. You provide a question, the expected correct answer, and then check if your AI's response matches.

This doesn't require exact word-for-word matching - the system understands semantic equivalence. "We're open 9 AM to 5 PM" and "Our hours are nine in the morning until five in the evening" would both pass.

Use Factuality when:

You have specific, known-correct answers
Testing objective facts (pricing, hours, policies)
Verifying your AI learned from training correctly

Try Factuality Evaluation

See how factuality checking compares responses to expected answers

Quick Start Examples

User Question

Expected Response (Ground Truth)

AI Response (What your chatbot said)

This demo runs real LLM-as-Judge evaluations using GPT-4o-mini. Try editing the fields above to see how different responses affect the score.

💡

Tip for Expected Answers: Keep your expected answers focused on the key facts. The evaluation compares semantic content, not word count. If your AI gives a more comprehensive answer than expected, it may score lower because it includes "extra" information. Example:

Expected: "Our return policy is 30 days" AI Response: "Our return policy is 30 days for all products. Items must be unused and in original packaging." Score: ~60% (the AI is correct but added details not in the expected answer)

For best results, either make expected answers comprehensive, or use Faithfulness to test that responses are grounded in your knowledge base.

2. Faithfulness - "Is it making things up?"

Faithfulness

"Is the AI only saying things from the provided context?"

Measures:

Checks if the response is grounded in source documents

Best for:

Preventing hallucinations when using knowledge bases or RAG

Pass Example

Only mentions facts found in the knowledge base

Fail Example

Makes up details not in the knowledge base (hallucination)

Faithfulness checks if your AI is "hallucinating" - making up information that isn't in its knowledge base. This is critical when your chatbot uses uploaded documents or a knowledge base.

For example, if your knowledge base says "Returns accepted within 30 days" but your AI responds "Returns accepted within 90 days with free shipping," that's a faithfulness failure. The AI added details that don't exist in the source material.

Use Faithfulness when:

Your chatbot uses a knowledge base or uploaded documents
You want to prevent hallucinations
Accuracy is critical (legal, medical, financial content)

Try Faithfulness Evaluation

See how faithfulness checking catches hallucinations

Quick Start Examples

User Question

Knowledge Base Context

AI Response (What your chatbot said)

This demo runs real LLM-as-Judge evaluations using GPT-4o-mini. Try editing the fields above to see how different responses affect the score.

3. Answer Relevancy - "Did it answer the question?"

Answer Relevancy

"Did the AI actually answer what was asked?"

Measures:

Checks if the response addresses the user's question

Best for:

Ensuring the AI stays on topic and doesn't go off on tangents

Pass Example

User asks about pricing, AI explains pricing options

Fail Example

User asks about pricing, AI talks about company history

Answer Relevancy checks if the AI actually answered what was asked. Sometimes an AI will give a perfectly accurate response that has nothing to do with the question.

"How do I reset my password?" shouldn't be answered with "Our company was founded in 2015 with a mission to revolutionize customer service..."

Use Answer Relevancy when:

Users complain the AI doesn't answer their questions
You want responses to stay focused
Your AI tends to be verbose or go off-topic

Try Answer Relevancy Evaluation

See how relevancy checking ensures responses stay on topic

Quick Start Examples

User Question

AI Response (What your chatbot said)

This demo runs real LLM-as-Judge evaluations using GPT-4o-mini. Try editing the fields above to see how different responses affect the score.

4. Context Relevancy - "Did we retrieve the right information?"

Context Relevancy

"Was the right information retrieved from the knowledge base?"

Measures:

Checks if the retrieved context is relevant to the question

Best for:

Debugging RAG systems - ensuring the right documents are found

Pass Example

Question about refunds retrieves the refund policy document

Fail Example

Question about refunds retrieves the shipping policy document

Context Relevancy is specifically for RAG (Retrieval-Augmented Generation) systems - chatbots that search through documents to find relevant information before answering.

This eval checks if the search step worked correctly. Even if your AI gives a perfect response, if it retrieved the wrong documents, that's a problem waiting to happen.

Use Context Relevancy when:

Debugging why your RAG chatbot gives wrong answers
Your knowledge base is large and complex
You want to optimize document retrieval

Try Context Relevancy Evaluation

See how context relevancy checks document retrieval quality

Quick Start Examples

User Question

Retrieved Context

AI Response (What your chatbot said)

This demo runs real LLM-as-Judge evaluations using GPT-4o-mini. Try editing the fields above to see how different responses affect the score.

⚠️

Important Distinction: Context Relevancy evaluates your document retrieval, not your AI's answer. A low score means your RAG system is pulling irrelevant documents—even if the AI gives a good answer despite having poor context. If you see low Context Relevancy but good Answer Relevancy, focus on improving your knowledge base organization or embedding quality rather than your AI's instructions.

Choosing the Right Evaluation

Which Evaluation Should You Use?

Do you have specific expected answers for each question?

Yes

Use Factuality - it will compare responses against your expected answers

Continue to next question...

Does your AI chatbot use a knowledge base or uploaded documents?

Yes

Use Faithfulness to catch hallucinations, or Context Relevancy to verify retrieval

Use Answer Relevancy to check responses stay on topic

Not sure? Start with Factuality. It's the most straightforward evaluation - you provide example questions and the correct answers, and the system checks if your AI gives equivalent responses.

Running All Evaluations Together

For the most thorough testing, you can select All as your evaluation method. This runs all four evaluations on each test case and produces scores for each metric.

Complete Evaluation Suite

See all four evaluation metrics run on a single response

Question

What is your return policy for electronics?

Knowledge Base Context

Electronics Return Policy: Items can be returned within 15 days with original packaging. Opened items may be subject to a 15% restocking fee. Defective items can be exchanged within 90 days.

Expected Answer

Electronics can be returned within 15 days if in original packaging. Opened items have a 15% restocking fee. Defective products can be exchanged within 90 days.

AI Response

You can return electronics within 15 days as long as they're in the original packaging. If you've opened the item, there's a 15% restocking fee. For defective items, we offer exchanges up to 90 days after purchase.

Factuality

Faithfulness

Answer Relevancy

Context Relevancy

When to use "All":

Initial chatbot testing before launch
After major knowledge base updates
Debugging complex answer quality issues

Understanding combined results:

When running "All", a test case passes only if ALL metrics meet the 70% threshold. This is strict but ensures comprehensive quality. If you need more flexibility, run individual evaluation types and set your own quality bar for each.

Metric	What Failure Means
Factuality fails	Answer doesn't match expected content
Faithfulness fails	Answer contains hallucinated information
Answer Relevancy fails	Answer doesn't address the question
Context Relevancy fails	RAG retrieved irrelevant documents

💡

Pro Tip: When using "All", don't be surprised if Context Relevancy scores lower than other metrics. Retrieval is the hardest part of RAG systems. If your other metrics pass but Context Relevancy fails, your chatbot is likely working well despite imperfect search results.

Why Factuality is the Best Starting Point

If you're new to evals and not sure where to start, choose Factuality. Here's why:

It's intuitive - You just provide questions and correct answers
No setup required - Works with any chatbot, with or without a knowledge base
Clear pass/fail - Easy to understand if something's wrong
Most common issues - Catches the problems that matter most to customers

The other evaluation types (Faithfulness, Answer Relevancy, Context Relevancy) are more specialized. They're powerful for specific use cases, but Factuality covers the most ground for most chatbots.

✅

Quick Start: Create 10-20 test cases covering your most important questions. Run Factuality evaluation. Fix any failures. Repeat before every major change to your chatbot.

Creating Effective Test Cases

A good evaluation suite includes:

1. Happy Path Cases

Questions your AI should answer perfectly. These are the bread-and-butter interactions:

"What are your business hours?"
"How much does it cost?"
"How do I contact support?"

2. Edge Cases

Tricky questions that might trip up your AI:

Ambiguous questions
Questions about things not in your knowledge base
Multi-part questions

3. Red Lines

Things your AI should never say:

Competitor comparisons you don't want made
Promises you can't keep
Sensitive topics you want avoided

The Eval Workflow

Here's how to integrate evals into your chatbot development:

Create test cases - Start with your most important customer questions

Write expected answers - What should the correct response be?

Run evaluation - See how your current chatbot performs

Fix failures - Update your chatbot's instructions or knowledge base

Iterate - Add more test cases as you discover new scenarios

Understanding Scores

Each evaluation produces a score from 0% to 100%. The default passing threshold is 70%.

Score Ranges

70%+ (Pass) - The response meets quality standards
40-69% (Warning) - The response has issues that should be reviewed
Below 40% (Fail) - The response has significant problems

You can adjust thresholds based on your needs. A medical chatbot might require 90%+, while a casual FAQ bot might be fine with 60%.

What Different Scores Mean by Metric

Metric	High Score (70%+)	Low Score (<70%)
Factuality	Answer matches expected content	Answer missing key facts or has different info
Faithfulness	Answer is grounded in knowledge base	Answer contains hallucinated/invented info
Answer Relevancy	Answer directly addresses the question	Answer is off-topic or tangential
Context Relevancy	RAG retrieved relevant documents	RAG pulled irrelevant documents

💡

Expect Variation: In practice, Answer Relevancy often scores highest (your AI probably answers what's asked), while Context Relevancy scores lowest (retrieval is hard!). Don't panic if your Context Relevancy is around 20-40%—focus on it only if your chatbot is giving wrong answers despite having correct information in its knowledge base.

What Happens Behind the Scenes

When you run an evaluation, here's what happens:

Your chatbot receives the test question
It generates a response (just like it would for a real user)
A separate "judge" AI evaluates the response
The judge scores the response against the relevant criteria
You see the results and can act on failures

This "LLM-as-Judge" approach is more sophisticated than simple string matching. It understands semantic meaning, handles paraphrasing, and provides reasoning for its scores.

Next Steps

Ready to start evaluating your chatbot?

Navigate to your chatbot in the Chipp dashboard
Go to the Evals tab to create your first evaluation suite
Start with Factuality and your 10 most important questions
Run the evaluation and review results
Iterate based on what you learn

Remember: Evals aren't a one-time thing. As your chatbot evolves, your test suite should grow with it. The goal is to catch problems before your customers do.

💡

Pro Tip: Schedule evals to run automatically whenever you update your chatbot's knowledge base or instructions. This catches regressions before they reach production.

Using with MCP

You can create and run evaluations programmatically using the Chipp MCP Server, enabling automated testing workflows.

Available Tools

Tool	Description
`list_evals`	List all evaluations for an app
`create_eval`	Create an evaluation from a conversation or manual test cases
`run_eval`	Execute an evaluation and get results
`get_eval_result`	Get detailed results from a run
`update_eval`	Modify test cases
`delete_eval`	Remove an evaluation

Example Prompts

"Create an evaluation for my FAQ bot from a recent good conversation"
"Run the 'Quality Check' evaluation and show results"
"List all evaluations for app 123"
"Show details of the last evaluation run"

Workflow: Automated Quality Testing

You: Create an evaluation from my best-rated conversation yesterday

AI: [Calls search_conversations to find it]
[Calls create_eval with chatSessionId]
Created evaluation "FAQ Quality Test" with 8 test cases.

You: Run it

AI: [Calls run_eval with format="terminal"]

┌────────────────────────────────────────────────────────┐
│              FAQ Quality Test Results                  │
├────────────────────────────────────────────────────────┤
│ Overall: PASSED (7/8 tests)                           │
│ Pass Rate: ████████░░ 87.5%                           │
│                                                        │
│ ✓ Test 1: Password reset     [PASS] 92%               │
│ ✓ Test 2: Business hours     [PASS] 88%               │
│ ✗ Test 7: Complex billing    [FAIL] 45%               │
└────────────────────────────────────────────────────────┘

💡

MCP server access requires a Pro plan or higher. See the MCP Setup Guide to get started.

Understanding AI Evaluations

Why Evals Matter

Prerequisites for Each Evaluation Type

The Four Types of Evaluations

1. Factuality - "Is the answer correct?"

Factuality

Try Factuality Evaluation

2. Faithfulness - "Is it making things up?"

Faithfulness

Try Faithfulness Evaluation

3. Answer Relevancy - "Did it answer the question?"

Answer Relevancy

Try Answer Relevancy Evaluation

4. Context Relevancy - "Did we retrieve the right information?"

Context Relevancy

Try Context Relevancy Evaluation

Choosing the Right Evaluation

Which Evaluation Should You Use?

Running All Evaluations Together

Complete Evaluation Suite

Why Factuality is the Best Starting Point

Creating Effective Test Cases

1. Happy Path Cases

2. Edge Cases

3. Red Lines

The Eval Workflow

Understanding Scores

Score Ranges

What Different Scores Mean by Metric

What Happens Behind the Scenes

Next Steps

Using with MCP

Available Tools

Example Prompts

Workflow: Automated Quality Testing

Continue Reading

How Knowledge Sources Work

Advanced RAG Settings

How to Use Chipp with Claude Code

Want to learn more?