Benchmarks tell you how models perform on academic tests. What you actually need to know is how they’ll perform on your tasks. This guide shows you how to build an evaluation framework for your specific use case.
Why Benchmarks Aren’t Enough
Public benchmarks (MMLU, HumanEval, MATH) measure performance on standardized tests. They’re useful for:
- Rough model tier comparisons
- Tracking progress over time
- Identifying obvious capability gaps
They don’t tell you:
- How the model performs on your domain
- Whether the output format meets your requirements
- Whether edge cases in your data are handled correctly
- Cost/quality tradeoffs for your specific workload
Always run your own evals before choosing a model for production.
Building a Test Set
The most important thing you’ll do: create a representative test set.
Requirements for a Good Test Set
Diverse coverage: Include the full range of inputs your users will actually send. If your application handles customer support tickets, include questions across every category you receive.
Known correct answers: You need a “ground truth” to evaluate against. This means human-labeled examples, rule-based expected outputs, or programmatically verifiable outputs.
Edge cases: Include the tricky cases — unusual formats, ambiguous inputs, the failure modes you care most about.
Size: 100-500 examples is usually enough to get statistically meaningful results for most applications.
Building the Test Set
import json
from pathlib import Path
# Example test case structure
test_cases = [
{
"id": "test_001",
"category": "billing_question",
"input": "I was charged twice for my subscription last month.",
"expected_output": None, # For generative tasks, use rubric instead
"expected_intent": "billing_dispute",
"rubric": [
"Acknowledges the double charge",
"Apologizes for the inconvenience",
"Provides clear next steps",
"Tone is empathetic"
]
},
# ... more cases
]
# Save test set
with open("eval_dataset.json", "w") as f:
json.dump(test_cases, f, indent=2)
Evaluation Types
1. Exact Match (for structured outputs)
When output must match exactly (classification, extraction):
def evaluate_classification(model_output: str, expected: str) -> bool:
return model_output.strip().lower() == expected.strip().lower()
def run_classification_eval(test_cases: list, model_fn) -> dict:
correct = 0
total = len(test_cases)
errors = []
for case in test_cases:
output = model_fn(case["input"])
is_correct = evaluate_classification(output, case["expected_output"])
if is_correct:
correct += 1
else:
errors.append({
"id": case["id"],
"input": case["input"],
"expected": case["expected_output"],
"got": output
})
return {
"accuracy": correct / total,
"correct": correct,
"total": total,
"errors": errors
}
2. LLM-as-Judge (for generative outputs)
When outputs are text that can’t be exact-matched, use another LLM to judge quality:
import anthropic
client = anthropic.Anthropic()
def llm_judge(
original_input: str,
model_output: str,
rubric: list[str]
) -> dict:
rubric_str = "\n".join([f"- {criterion}" for criterion in rubric])
prompt = f"""You are evaluating the quality of an AI assistant's response.
Original user message: {original_input}
AI Response: {model_output}
Evaluate this response against each criterion. For each criterion, respond with PASS or FAIL and a brief explanation.
Criteria:
{rubric_str}
Respond as JSON:
{{"evaluations": [{{"criterion": "...", "result": "PASS/FAIL", "explanation": "..."}}], "overall_score": X/Y, "summary": "..."}}"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.content[0].text)
# Run evaluation
def run_generative_eval(test_cases: list, model_fn) -> dict:
results = []
for case in test_cases:
output = model_fn(case["input"])
evaluation = llm_judge(
case["input"],
output,
case["rubric"]
)
results.append({
"id": case["id"],
"output": output,
"evaluation": evaluation,
"score": evaluation["overall_score"]
})
avg_score = sum(r["score"] for r in results) / len(results)
return {"average_score": avg_score, "results": results}
3. Human Evaluation
For highest-stakes decisions, use human raters:
def create_human_eval_task(test_cases: list, model_outputs: dict) -> list:
"""Create tasks for human raters."""
tasks = []
for case in test_cases:
tasks.append({
"input": case["input"],
"output_a": model_outputs["model_a"].get(case["id"], ""),
"output_b": model_outputs["model_b"].get(case["id"], ""),
"rating_criteria": case["rubric"],
"instructions": "Rate which response better meets the criteria, or if they're equal."
})
return tasks
Comparing Multiple Models
import anthropic
from openai import OpenAI
anthropic_client = anthropic.Anthropic()
openai_client = OpenAI()
def claude_model(prompt: str, system: str = "") -> str:
response = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def gpt4o_model(prompt: str, system: str = "") -> str:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
# Run same eval on both models
claude_results = run_generative_eval(test_cases, claude_model)
gpt4o_results = run_generative_eval(test_cases, gpt4o_model)
print(f"Claude score: {claude_results['average_score']:.2f}")
print(f"GPT-4o score: {gpt4o_results['average_score']:.2f}")
Cost/Quality Analysis
Include cost in your evaluation:
def evaluate_with_cost(test_cases: list, model_fn_configs: list) -> list:
"""Compare models including cost analysis."""
results = []
for config in model_fn_configs:
model_results = []
total_input_tokens = 0
total_output_tokens = 0
for case in test_cases:
response_data = config["fn_with_usage"](case["input"])
model_results.append(response_data["output"])
total_input_tokens += response_data["input_tokens"]
total_output_tokens += response_data["output_tokens"]
# Calculate cost
cost = (total_input_tokens * config["input_price_per_1m"] / 1_000_000 +
total_output_tokens * config["output_price_per_1m"] / 1_000_000)
# Run quality eval
quality = run_eval(test_cases, lambda x: model_results.pop(0))
results.append({
"model": config["name"],
"quality_score": quality["average_score"],
"cost_for_test_set": cost,
"cost_per_1000_requests": cost / len(test_cases) * 1000
})
return results
Continuous Evaluation in Production
Once deployed, continue evaluating:
import random
def log_for_evaluation(input_text: str, output_text: str, metadata: dict):
"""Log a sample of production calls for offline evaluation."""
if random.random() < 0.01: # Sample 1% of traffic
log_entry = {
"timestamp": datetime.now().isoformat(),
"input": input_text,
"output": output_text,
"metadata": metadata
}
# Write to your eval dataset for weekly/monthly review
append_to_eval_log(log_entry)
Production sampling gives you ongoing insight into real-world performance — which often differs from test set performance.
Evaluation Red Flags
Signs your eval framework has problems:
- Too easy: Model scores 95%+ — your test set doesn’t include the hard cases
- Score inflation: LLM judges tend to rate outputs more highly than humans
- Distribution mismatch: Your test set doesn’t reflect real user inputs
- Single metric: One score hides failures in specific categories
Good evaluations are hard. Invest time in your test set design — it’s the foundation of everything else.