# Understanding AI Evaluations Learn how to test your AI chatbot before launching it to customers. Evals help you know exactly what your AI will say in any situation, so you can ship with confidence. You've built an AI chatbot. It looks great in demos. But how do you know it won't embarrass you in front of real customers? What if it hallucinates incorrect pricing? What if it gives wrong business hours? What if it goes completely off-topic? This is where **evaluations** (evals) come in. Think of evals as automated QA testing for your AI. Before your chatbot goes live, you can verify it gives the right answers to important questions. ## Why Evals Matter Imagine you run a SaaS company. A customer asks your AI chatbot: "What's your pricing?" Without evals, you might discover too late that your AI confidently stated "\$99/month" when your actual pricing is "\$29/month." That's not just embarrassing - it can damage trust and create legal issues. With evals, you can: 1. **Define test cases** - Questions your AI should handle correctly 2. **Specify expected answers** - What the correct response should be 3. **Run automated checks** - Before every deployment 4. **Catch problems early** - Before customers ever see them Evals are like unit tests for traditional software, but for AI responses. They let you ship with confidence. ## The Four Types of Evaluations Chipp uses an "LLM-as-Judge" approach, where a separate AI model evaluates your chatbot's responses. This is like having an expert reviewer automatically grade every answer. There are four evaluation types, each measuring something different: ### 1. Factuality - "Is the answer correct?" Factuality is the most straightforward evaluation. You provide a question, the expected correct answer, and then check if your AI's response matches. This doesn't require exact word-for-word matching - the system understands semantic equivalence. "We're open 9 AM to 5 PM" and "Our hours are nine in the morning until five in the evening" would both pass. **Use Factuality when:** - You have specific, known-correct answers - Testing objective facts (pricing, hours, policies) - Verifying your AI learned from training correctly ### 2. Faithfulness - "Is it making things up?" Faithfulness checks if your AI is "hallucinating" - making up information that isn't in its knowledge base. This is critical when your chatbot uses uploaded documents or a knowledge base. For example, if your knowledge base says "Returns accepted within 30 days" but your AI responds "Returns accepted within 90 days with free shipping," that's a faithfulness failure. The AI added details that don't exist in the source material. **Use Faithfulness when:** - Your chatbot uses a knowledge base or uploaded documents - You want to prevent hallucinations - Accuracy is critical (legal, medical, financial content) ### 3. Answer Relevancy - "Did it answer the question?" Answer Relevancy checks if the AI actually answered what was asked. Sometimes an AI will give a perfectly accurate response that has nothing to do with the question. "How do I reset my password?" shouldn't be answered with "Our company was founded in 2015 with a mission to revolutionize customer service..." **Use Answer Relevancy when:** - Users complain the AI doesn't answer their questions - You want responses to stay focused - Your AI tends to be verbose or go off-topic ### 4. Context Relevancy - "Did we retrieve the right information?" Context Relevancy is specifically for RAG (Retrieval-Augmented Generation) systems - chatbots that search through documents to find relevant information before answering. This eval checks if the search step worked correctly. Even if your AI gives a perfect response, if it retrieved the wrong documents, that's a problem waiting to happen. **Use Context Relevancy when:** - Debugging why your RAG chatbot gives wrong answers - Your knowledge base is large and complex - You want to optimize document retrieval ## Choosing the Right Evaluation ## Running All Evaluations Together For the most thorough testing, you can run all four evaluations on the same response. This gives you a complete picture of your AI's quality. ## Why Factuality is the Best Starting Point If you're new to evals and not sure where to start, **choose Factuality**. Here's why: 1. **It's intuitive** - You just provide questions and correct answers 2. **No setup required** - Works with any chatbot, with or without a knowledge base 3. **Clear pass/fail** - Easy to understand if something's wrong 4. **Most common issues** - Catches the problems that matter most to customers The other evaluation types (Faithfulness, Answer Relevancy, Context Relevancy) are more specialized. They're powerful for specific use cases, but Factuality covers the most ground for most chatbots. **Quick Start**: Create 10-20 test cases covering your most important questions. Run Factuality evaluation. Fix any failures. Repeat before every major change to your chatbot. ## Creating Effective Test Cases A good evaluation suite includes: ### 1. Happy Path Cases Questions your AI should answer perfectly. These are the bread-and-butter interactions: - "What are your business hours?" - "How much does it cost?" - "How do I contact support?" ### 2. Edge Cases Tricky questions that might trip up your AI: - Ambiguous questions - Questions about things not in your knowledge base - Multi-part questions ### 3. Red Lines Things your AI should never say: - Competitor comparisons you don't want made - Promises you can't keep - Sensitive topics you want avoided ## The Eval Workflow Here's how to integrate evals into your chatbot development: Create test cases - Start with your most important customer questions Write expected answers - What should the correct response be? Run evaluation - See how your current chatbot performs Fix failures - Update your chatbot's instructions or knowledge base Iterate - Add more test cases as you discover new scenarios ## Understanding Scores Each evaluation produces a score from 0% to 100%: - **70%+ (Pass)** - The response meets quality standards - **40-69% (Warning)** - The response has issues that should be reviewed - **Below 40% (Fail)** - The response has significant problems The default passing threshold is 70%, but you can adjust this based on your needs. A medical chatbot might require 90%+, while a casual FAQ bot might be fine with 60%. ## What Happens Behind the Scenes When you run an evaluation, here's what happens: 1. Your chatbot receives the test question 2. It generates a response (just like it would for a real user) 3. A separate "judge" AI evaluates the response 4. The judge scores the response against the relevant criteria 5. You see the results and can act on failures This "LLM-as-Judge" approach is more sophisticated than simple string matching. It understands semantic meaning, handles paraphrasing, and provides reasoning for its scores. ## Next Steps Ready to start evaluating your chatbot? 1. **Navigate to your chatbot** in the Chipp dashboard 2. **Go to the Evals tab** to create your first evaluation suite 3. **Start with Factuality** and your 10 most important questions 4. **Run the evaluation** and review results 5. **Iterate** based on what you learn Remember: Evals aren't a one-time thing. As your chatbot evolves, your test suite should grow with it. The goal is to catch problems before your customers do. **Pro Tip**: Schedule evals to run automatically whenever you update your chatbot's knowledge base or instructions. This catches regressions before they reach production.