smoke (which validates infrastructure), eval focuses on collecting performance metrics and judge scores across multiple seeds.
Usage
Quick Start
Create an eval config (eval.toml):
Configuration
Core Options
Policy Overrides
Metadata Filtering
Filter tasks by metadata:Custom Judges
Add custom scoring functions:CLI Options
Examples
Basic Evaluation
Remote Task App
With Metadata Filtering
With Custom SQL
Parallel Evaluation
Output
The command prints:- Per-seed results (seed, status, metrics, judge scores)
- Aggregate statistics (mean scores, Pearson correlation)
- Detailed table of all evaluations
Comparison: eval vs smoke
| Feature | eval | smoke |
|---|---|---|
| Purpose | Performance evaluation | Infrastructure validation |
| Seeds | Many (5-100+) | Few (1-5) |
| Judges | Custom scoring | None |
| Metrics | Detailed stats | Pass/fail |
| Traces | Always stored | Optional |
| Use case | Model comparison | Deployment testing |
Troubleshooting
”No supported models”
Add model info to your task app config:“No traces found”
Ensure task app returns traces:- Set
TASKAPP_TRACING_ENABLED=1 - Include
return_trace=truein rollout request - Use
trace_format="structured"
Judge errors
- Verify judge module is importable
- Check judge function signature
- Ensure judge returns numeric score (0.0-1.0)
“422 Validation Error”
Check rollout payload matches task app schema:env_namematches task app configpolicy_nameis validopssequence is supported