How to Evaluate LLM Performance for Your Application

Benchmarks tell you how models perform on academic tests. What you actually need to know is how they’ll perform on your tasks. This guide shows you how to build an evaluation framework for your specific use case.

Why Benchmarks Aren’t Enough

Public benchmarks (MMLU, HumanEval, MATH) measure performance on standardized tests. They’re useful for:

Rough model tier comparisons
Tracking progress over time
Identifying obvious capability gaps

They don’t tell you:

How the model performs on your domain
Whether the output format meets your requirements
Whether edge cases in your data are handled correctly
Cost/quality tradeoffs for your specific workload

Always run your own evals before choosing a model for production.

Building a Test Set

The most important thing you’ll do: create a representative test set.

Requirements for a Good Test Set

Diverse coverage: Include the full range of inputs your users will actually send. If your application handles customer support tickets, include questions across every category you receive.

Known correct answers: You need a “ground truth” to evaluate against. This means human-labeled examples, rule-based expected outputs, or programmatically verifiable outputs.

Edge cases: Include the tricky cases — unusual formats, ambiguous inputs, the failure modes you care most about.

Size: 100-500 examples is usually enough to get statistically meaningful results for most applications.

Building the Test Set

import json
from pathlib import Path

# Example test case structure
test_cases = [
    {
        "id": "test_001",
        "category": "billing_question",
        "input": "I was charged twice for my subscription last month.",
        "expected_output": None,  # For generative tasks, use rubric instead
        "expected_intent": "billing_dispute",
        "rubric": [
            "Acknowledges the double charge",
            "Apologizes for the inconvenience", 
            "Provides clear next steps",
            "Tone is empathetic"
        ]
    },
    # ... more cases
]

# Save test set
with open("eval_dataset.json", "w") as f:
    json.dump(test_cases, f, indent=2)

Evaluation Types

1. Exact Match (for structured outputs)

When output must match exactly (classification, extraction):

def evaluate_classification(model_output: str, expected: str) -> bool:
    return model_output.strip().lower() == expected.strip().lower()

def run_classification_eval(test_cases: list, model_fn) -> dict:
    correct = 0
    total = len(test_cases)
    errors = []
    
    for case in test_cases:
        output = model_fn(case["input"])
        is_correct = evaluate_classification(output, case["expected_output"])
        
        if is_correct:
            correct += 1
        else:
            errors.append({
                "id": case["id"],
                "input": case["input"],
                "expected": case["expected_output"],
                "got": output
            })
    
    return {
        "accuracy": correct / total,
        "correct": correct,
        "total": total,
        "errors": errors
    }

2. LLM-as-Judge (for generative outputs)

When outputs are text that can’t be exact-matched, use another LLM to judge quality:

import anthropic

client = anthropic.Anthropic()

def llm_judge(
    original_input: str,
    model_output: str, 
    rubric: list[str]
) -> dict:
    rubric_str = "\n".join([f"- {criterion}" for criterion in rubric])
    
    prompt = f"""You are evaluating the quality of an AI assistant's response.

Original user message: {original_input}

AI Response: {model_output}

Evaluate this response against each criterion. For each criterion, respond with PASS or FAIL and a brief explanation.

Criteria:
{rubric_str}

Respond as JSON:
{{"evaluations": [{{"criterion": "...", "result": "PASS/FAIL", "explanation": "..."}}], "overall_score": X/Y, "summary": "..."}}"""

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return json.loads(response.content[0].text)

# Run evaluation
def run_generative_eval(test_cases: list, model_fn) -> dict:
    results = []
    
    for case in test_cases:
        output = model_fn(case["input"])
        evaluation = llm_judge(
            case["input"],
            output,
            case["rubric"]
        )
        
        results.append({
            "id": case["id"],
            "output": output,
            "evaluation": evaluation,
            "score": evaluation["overall_score"]
        })
    
    avg_score = sum(r["score"] for r in results) / len(results)
    return {"average_score": avg_score, "results": results}

3. Human Evaluation

For highest-stakes decisions, use human raters:

def create_human_eval_task(test_cases: list, model_outputs: dict) -> list:
    """Create tasks for human raters."""
    tasks = []
    for case in test_cases:
        tasks.append({
            "input": case["input"],
            "output_a": model_outputs["model_a"].get(case["id"], ""),
            "output_b": model_outputs["model_b"].get(case["id"], ""),
            "rating_criteria": case["rubric"],
            "instructions": "Rate which response better meets the criteria, or if they're equal."
        })
    return tasks

Comparing Multiple Models

import anthropic
from openai import OpenAI

anthropic_client = anthropic.Anthropic()
openai_client = OpenAI()

def claude_model(prompt: str, system: str = "") -> str:
    response = anthropic_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

def gpt4o_model(prompt: str, system: str = "") -> str:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

# Run same eval on both models
claude_results = run_generative_eval(test_cases, claude_model)
gpt4o_results = run_generative_eval(test_cases, gpt4o_model)

print(f"Claude score: {claude_results['average_score']:.2f}")
print(f"GPT-4o score: {gpt4o_results['average_score']:.2f}")

Cost/Quality Analysis

Include cost in your evaluation:

def evaluate_with_cost(test_cases: list, model_fn_configs: list) -> list:
    """Compare models including cost analysis."""
    results = []
    
    for config in model_fn_configs:
        model_results = []
        total_input_tokens = 0
        total_output_tokens = 0
        
        for case in test_cases:
            response_data = config["fn_with_usage"](case["input"])
            model_results.append(response_data["output"])
            total_input_tokens += response_data["input_tokens"]
            total_output_tokens += response_data["output_tokens"]
        
        # Calculate cost
        cost = (total_input_tokens * config["input_price_per_1m"] / 1_000_000 + 
                total_output_tokens * config["output_price_per_1m"] / 1_000_000)
        
        # Run quality eval
        quality = run_eval(test_cases, lambda x: model_results.pop(0))
        
        results.append({
            "model": config["name"],
            "quality_score": quality["average_score"],
            "cost_for_test_set": cost,
            "cost_per_1000_requests": cost / len(test_cases) * 1000
        })
    
    return results

Continuous Evaluation in Production

Once deployed, continue evaluating:

import random

def log_for_evaluation(input_text: str, output_text: str, metadata: dict):
    """Log a sample of production calls for offline evaluation."""
    if random.random() < 0.01:  # Sample 1% of traffic
        log_entry = {
            "timestamp": datetime.now().isoformat(),
            "input": input_text,
            "output": output_text,
            "metadata": metadata
        }
        # Write to your eval dataset for weekly/monthly review
        append_to_eval_log(log_entry)

Production sampling gives you ongoing insight into real-world performance — which often differs from test set performance.

Evaluation Red Flags

Signs your eval framework has problems:

Too easy: Model scores 95%+ — your test set doesn’t include the hard cases
Score inflation: LLM judges tend to rate outputs more highly than humans
Distribution mismatch: Your test set doesn’t reflect real user inputs
Single metric: One score hides failures in specific categories

Good evaluations are hard. Invest time in your test set design — it’s the foundation of everything else.