Evaluation Best Practices

Effective evaluation is the foundation of reliable AI applications. This guide covers best practices for designing, implementing, and maintaining evaluations for your LLM-powered features.

Design Your Evaluation Process

A systematic approach to evaluation follows five steps:

Define your objective - What does success look like for this feature?
Collect a dataset - Which examples help evaluate your objective?
Define metrics - How will you measure success quantitatively?
Run and compare - Test changes against your baseline
Continuously evaluate - Monitor production quality over time

Set Up Tracing Early

Before deploying to production, instrument your application with tracing. This gives you visibility into every LLM call and enables post-hoc evaluation.

Arize Phoenix
W&B Weave
Logfire

from phoenix.otel import register

tracer_provider = register(project_name="my-heroku-app")

import weave

weave.init("my-heroku-app")

import logfire

logfire.configure()
logfire.instrument_openai()

Define Success Metrics

Choose metrics that align with your business objectives:

Metric Type	Examples	Use Case
Accuracy	Exact match, semantic similarity	Factual Q&A, data extraction
Quality	Coherence, helpfulness, relevance	Content generation, chat
Safety	Toxicity, PII detection, refusals	User-facing applications
Performance	Latency, token usage, cost	Production optimization

Types of Evaluators

Metric-Based Evaluations

Quantitative scores for automated regression testing:

Exact match - Output matches expected answer exactly
String containment - Output contains required keywords
Regex patterns - Output matches expected format
Function call accuracy - Correct tool invocations

Best for: Structured outputs, tool use, classification tasks.

Human Evaluations

High quality but resource-intensive:

Use randomized, blinded tests to avoid bias
Require multiple reviewer consensus for subjective tasks
Reserve for high-stakes decisions and calibrating automated evals

Best for: Creative content, nuanced quality judgments, establishing ground truth.

LLM-as-a-Judge

Scalable automated evaluation using another LLM:

JUDGE_PROMPT = """
Rate the following response on a scale of 1-5 for helpfulness.

User question: {question}
Assistant response: {response}

Provide your rating and a brief explanation.
"""

Common patterns:

Pairwise comparison - Which of two responses is better?
Reference-guided grading - How close is this to the ideal answer?
Rubric-based scoring - Rate against specific criteria

Best for: Quality assessment at scale, comparing prompt variations.

Build Your Evaluation Dataset

Start Small, Grow Intentionally

Begin with 20-50 carefully curated examples that represent:

Common use cases (70% of dataset)
Edge cases (20% of dataset)
Known failure modes (10% of dataset)

Collect from Production

Use tracing to identify valuable evaluation examples:

# With Phoenix, export interesting traces to datasets
from phoenix.trace import TraceDataset

# Filter for traces that need review
traces = px.Client().get_trace_dataset(
    project_name="my-app",
    filter="latency_ms > 5000 OR error IS NOT NULL"
)

Handle Edge Cases

Ensure your dataset covers:

Non-English or multilingual inputs
Multiple questions in a single request
Typos and misspellings
Very long or very short inputs
Ambiguous requests
Adversarial inputs (jailbreak attempts)

Run Evaluations Effectively

Establish a Baseline

Before making changes, measure your current performance:

# Run evaluation on current prompt
baseline_results = evaluate(
    dataset=eval_dataset,
    prompt=current_prompt,
    metrics=[accuracy, latency, cost]
)
print(f"Baseline accuracy: {baseline_results.accuracy:.2%}")

Compare Changes Systematically

Test one change at a time:

# Compare new prompt against baseline
comparison = compare_evaluations(
    baseline=baseline_results,
    candidate=evaluate(dataset=eval_dataset, prompt=new_prompt, metrics=metrics)
)

if comparison.is_significant_improvement():
    print("New prompt is better - deploy it")

Watch for Regressions

Improvements in one area may cause regressions in another:

Change	Accuracy	Latency	Cost
Baseline	85%	1.2s	$0.02
Longer prompt	92%	1.8s	$0.04
Different model	88%	0.8s	$0.01

Continuous Evaluation in Production

Monitor Key Metrics

Set up dashboards and alerts for:

Error rates by endpoint and model
Latency percentiles (p50, p95, p99)
Token usage and cost per request
User feedback signals (thumbs up/down, regenerations)

Sample Production Traffic

Periodically evaluate a sample of production requests:

# Weekly evaluation job
production_sample = get_random_traces(n=100, period="7d")
results = evaluate(production_sample, metrics=quality_metrics)

if results.quality_score < THRESHOLD:
    alert("Quality degradation detected")

Create a Feedback Loop

Collect traces from production
Identify failure cases through monitoring or user feedback
Add failures to evaluation dataset
Test prompt improvements against expanded dataset
Deploy with confidence

Common Pitfalls to Avoid

Don’t overfit to your eval set. If you tune prompts specifically to pass your evaluations, you may not improve real-world performance. Regularly add new examples and rotate test sets.

Don’t ignore cost. A 5% accuracy improvement that doubles your LLM costs may not be worth it. Always consider the cost-quality tradeoff.

Don’t skip human review. Automated metrics can miss subtle quality issues. Periodically review outputs manually, especially after significant changes.

Next Steps

Set up Arize Phoenix for tracing and evaluation
Configure W&B Weave for experiment tracking
Deploy Logfire for production monitoring

Get started

Core concepts

Agents

Tools

Evaluation

Integrations

Reference

Cookbook

Evaluation Best Practices

Design Your Evaluation Process

Set Up Tracing Early

Define Success Metrics

Types of Evaluators

Metric-Based Evaluations

Human Evaluations

LLM-as-a-Judge

Build Your Evaluation Dataset

Start Small, Grow Intentionally

Collect from Production

Handle Edge Cases

Run Evaluations Effectively

Establish a Baseline

Compare Changes Systematically

Watch for Regressions

Continuous Evaluation in Production

Monitor Key Metrics

Sample Production Traffic

Create a Feedback Loop

Common Pitfalls to Avoid

Next Steps

Get started

Core concepts

Agents

Tools

Evaluation

Integrations

Reference

Cookbook

​Design Your Evaluation Process

​Set Up Tracing Early

​Define Success Metrics

​Types of Evaluators

​Metric-Based Evaluations

​Human Evaluations

​LLM-as-a-Judge

​Build Your Evaluation Dataset

​Start Small, Grow Intentionally

​Collect from Production

​Handle Edge Cases

​Run Evaluations Effectively

​Establish a Baseline

​Compare Changes Systematically

​Watch for Regressions

​Continuous Evaluation in Production

​Monitor Key Metrics

​Sample Production Traffic

​Create a Feedback Loop

​Common Pitfalls to Avoid

​Next Steps

Design Your Evaluation Process

Set Up Tracing Early

Define Success Metrics

Types of Evaluators

Metric-Based Evaluations

Human Evaluations

LLM-as-a-Judge

Build Your Evaluation Dataset

Start Small, Grow Intentionally

Collect from Production

Handle Edge Cases

Run Evaluations Effectively

Establish a Baseline

Compare Changes Systematically

Watch for Regressions

Continuous Evaluation in Production

Monitor Key Metrics

Sample Production Traffic

Create a Feedback Loop

Common Pitfalls to Avoid

Next Steps