Why Evaluation Matters
Building AI applications is iterative. Evaluation helps you:- Track model performance over time and across versions
- Catch regressions before they impact users
- Compare prompts and models to find the best approach
- Debug unexpected behaviors with full request/response visibility
- Build confidence before deploying changes to production
Evaluation Approaches on Heroku
Heroku integrates with leading observability platforms that provide evaluation capabilities tailored for LLM applications.Tracing with Arize Phoenix
Capture every LLM call with full request/response payloads for post-hoc analysis. Phoenix provides:- Automatic trace collection with OpenTelemetry
- LLM-as-a-judge evaluations for quality scoring
- Dataset management for building evaluation sets
- Visual debugging of multi-step agent workflows
Experiment Tracking with W&B Weave
Run experiments on prompts and compare model performance across versions:- Track prompt iterations and model configurations
- Compare outputs side-by-side
- Log custom metrics and evaluations
- Collaborate with your team on improvements
Production Monitoring with Logfire
Set up comprehensive observability for production workloads:- Real-time error tracking and alerts
- Latency percentiles (p50, p95, p99)
- Token usage and cost monitoring
- Structured logging for LLM calls
Getting Started
Choose an observability tool based on your needs:Arize Phoenix
Full tracing and LLM evaluation
W&B Weave
Experiment tracking and comparison
Logfire
Production observability
Evaluation Workflow
A typical evaluation workflow on Heroku follows these steps:- Instrument your application with tracing (Phoenix, Weave, or Logfire)
- Collect traces from development and production
- Build evaluation datasets from interesting cases
- Run evaluations to measure quality metrics
- Iterate and improve prompts, then repeat
Next Steps
- Set up Arize Phoenix for comprehensive tracing
- Review Evaluation Best Practices for guidance on building effective evals
- Explore Prompt Patterns to improve your prompts