Definition
Evals (evaluations) are systematic testing frameworks that measure how well an LLM performs on a specific task, use case, or quality dimension. Unlike broad benchmarks that compare models generally, evals are purpose-built for a specific application — they answer "is this model good enough for MY use case?"
Why Evals Are Critical
- Benchmarks measure general capability; evals measure YOUR application's performance
- "Vibes-based" development (it feels right) doesn't scale or catch regressions
- Without evals, you can't know if a model change improved or degraded quality
- Evals enable data-driven model selection and prompt iteration
- Exact match: model output == expected output
- Regex match: output matches a pattern
- JSON schema validation: output conforms to schema
- Code execution: generated code runs and produces correct output
- Contains check: output contains required phrases/keywords
- API call verification: tool was called with correct parameters
- Gold standard but expensive and slow
- Best for final validation, not iteration
- A/B preference: "Which response is better, A or B?"
- Annotation platforms: Scale AI, Surge, Prolific
- "Model A response" vs. "Model B response"
- Judge determines winner (or tie)
- Foundation of leaderboards like Chatbot Arena
- 50–500 representative examples from your actual use case
- Diverse coverage of user intents, edge cases, failure modes
- Should include adversarial / tricky examples
- Must NOT be in the model's training data (contamination)
- Binary (pass/fail)
- Likert scale (1–5)
- Continuous (0.0–1.0)
- Multi-dimensional (separate scores per criterion)
- Use a stronger model as judge than the model being evaluated
- Include reference answers when available
- Use chain-of-thought for the judge (reasoning before score)
- Use multiple judge models and aggregate to reduce bias
- Validate judge against human labels on a calibration set
- Benchmarks, LLM-as-Judge, Fine-Tuning, Alignment, Hallucination, RAGAS, Observability
The Eval Mindset
`
Hypothesis: "Claude Sonnet is better than GPT-4o for my customer support use case"
Eval: Run 200 representative customer queries through both models
Metric: Accuracy on intent classification + quality score on responses
Result: Data-backed decision, not intuition
`
Types of Evals
1. Automated Evals (Deterministic)
Pass/fail based on exact or programmatic criteria:
Best for: structured outputs, factual Q&A with ground truth, code generation
2. Model-Based Evals (LLM-as-Judge)
Use another LLM to evaluate output quality:
`
Judge prompt:
"Rate the following response on helpfulness, accuracy, and safety.
Question: [question]
Response: [model output]
Rate each dimension 1-5 and explain your rating."
`
Best for: subjective quality, format adherence, safety, tone, instruction following
3. Human Evals
Human raters score model outputs:
4. Pairwise Comparison Evals
Compare two model outputs head-to-head:
Key Eval Components
Test Set
Evaluation Criteria
Define what "good" means explicitly:
| Criterion | Description |
|-----------|-------------|
| Correctness | Is the answer factually right? |
| Completeness | Does it cover all required points? |
| Conciseness | Is it appropriately brief? |
| Safety | No harmful/inappropriate content? |
| Format | Does it match the required format? |
| Tone | Appropriate for the audience? |
| Instruction adherence | Does it follow all constraints? |
Scoring System
Eval Frameworks and Tools
| Tool | Description | Best For |
|------|-------------|---------|
| RAGAS | RAG-specific evaluation (faithfulness, relevance, recall) | RAG systems |
| TruLens | LLM app evaluation + tracing | General LLM apps |
| Braintrust | Eval platform + experiment tracking | Team eval workflows |
| LangSmith | LangChain eval + observability | LangChain apps |
| Evals (OpenAI) | OpenAI's eval framework | API-based evals |
| Promptfoo | Prompt testing and comparison | Prompt iteration |
| PromptBench | Adversarial prompt evaluation | Robustness testing |
| Inspect AI | UK AISI's eval framework | Safety evals |
| EleutherAI lm-eval | Open-source benchmark harness | Academic benchmarks |
The LLM-as-Judge Pattern
The most scalable eval approach for subjective quality:
`python
JUDGE_PROMPT = """
You are an expert evaluator. Rate the AI response below.
Question: {question}
AI Response: {response}
Reference Answer: {reference}
Score on these dimensions (1-5):
1. Accuracy: Is the information correct?
2. Helpfulness: Does it address the question?
3. Clarity: Is it easy to understand?
Return JSON: {{"accuracy": X, "helpfulness": X, "clarity": X, "reasoning": "..."}}
"""
score = judge_model.complete(JUDGE_PROMPT.format(...))
`
Best practices:
Regression Testing
Evals as CI/CD — run before every model/prompt change:
`
git push → CI triggers eval suite → compare scores → alert if regression > 2%
`
Catches: prompt changes that help one dimension but hurt another, model upgrades that regress on edge cases, fine-tuning that breaks general capabilities.
Eval Anti-Patterns
| Anti-Pattern | Problem |
|-------------|---------|
| Small test set (<50 examples) | High variance, unreliable scores |
| Only happy-path examples | Misses edge cases |
| Eval on training data | Inflated scores from memorization |
| Single-judge LLM | Judge bias, inconsistency |
| Ignoring failure modes | Only measuring when model is right |
| No human validation of judge | Judge may have systematic errors |
Practical Eval Workflow
1. Define metrics that map to business outcomes
2. Collect test data from real user interactions
3. Build automated evals for deterministic criteria
4. Add LLM-as-judge for subjective quality
5. Establish baselines before making changes
6. Run evals on every iteration of model/prompt
7. Human spot-check eval failures periodically
8. Monitor production for distribution drift