synth-ai eval executes rollouts against a task app and summarizes judge scores. It supports two execution modes:
- Direct Mode: Calls task apps directly (no backend required)
- Backend Mode: Routes through backend for trace capture and cost tracking
Quick Start
Basic Usage
Execution Modes
Direct Mode
When Used:--backendflag is NOT provided- No
SYNTH_BASE_URLorSYNTH_API_KEYenvironment variables
- Creates
TaskAppClientwith task app URL - Makes direct HTTP POST requests to
/rolloutendpoint - Collects results and formats output
- No trace capture via interceptor
- No automatic token cost calculation
- No backend job management
- Quick local testing
- Legacy workflows
- Simple evaluations without cost tracking
Backend Mode
When Used:--backendflag is provided, ORSYNTH_BASE_URLandSYNTH_API_KEYenvironment variables are set
- Create Job: POST
/api/eval/jobswith eval configuration - Poll Status: GET
/api/eval/jobs/{job_id}until status is “completed” or “failed” - Fetch Results: GET
/api/eval/jobs/{job_id}/resultsfor detailed metrics
- Automatic trace capture via inference interceptor
- Token usage tracking and cost calculation
- Centralized job management
- Support for async execution
Configuration
Config File Format
Legacy Eval Format:Options
Required:APP_ID— Task app identifier. Omit for the interactive picker.--url VALUEorurlin config — Base URL of a running task app. When omitted the CLI spins up a local server automatically.- Seeds — Either
--seeds VALUEorseedsin config (comma-separated:"0,1,2")
--config PATH— Optional TOML config. If omitted the CLI auto-discovers the first*.tomlthat matches the task app.--backend VALUE— Backend URL (enables backend mode for trace capture)--model VALUE— Override the model in the config/metadata--concurrency VALUE— Number of parallel rollouts (default: 1)--return-trace— Include traces in response--traces-dir PATH— Directory to save trace files--output-json PATH— Path to write JSON report--output-txt PATH— Path to write text report--split VALUE— Dataset split name to request (train,validation, etc.)--env-file PATH— One or more.envfiles holding credentials. Repeat the flag to merge multiple files.--trace-db PATH— SQLite/Turso path for trace persistence. Usenoneto disable saving traces.--metadata KEY=VALUE— Filter tasks by metadata key/value pairs (repeatable).--metadata-sql QUERY— Advanced seed selection using a SQLite query (returns seeds to evaluate).
Environment Variables
SYNTH_BASE_URL— Backend URL (alternative to--backendflag)SYNTH_API_KEY— Backend API key (required for backend mode)ENVIRONMENT_API_KEY— Task app API key (for task app authentication)
Output Formats
Console Output
Results are printed as a formatted table:JSON Output
Use--output-json to write results to a JSON file:
Trace Files
Use--traces-dir to save traces as individual JSON files:
Code Reference
CLI Implementation:- Entry Point:
synth_ai/cli/commands/eval/core.py - Execution Logic:
synth_ai/cli/commands/eval/runner.py - Config Loading:
synth_ai/cli/commands/eval/config.py
- Routes:
monorepo/backend/app/routes/eval/routes.py - Service:
monorepo/backend/app/routes/eval/job_service.py - See Backend Eval API Reference for details
Notes
- Remote evaluations require both
SYNTH_API_KEYandENVIRONMENT_API_KEYin your env files so the CLI can forward authentication headers. - Trace DB paths accept both file locations and SQLAlchemy URLs (
sqlite+aiosqlite:///absolute/path.db). Set--trace-db noneif you want to keep runs ephemeral. - Metadata filters stack: you can combine
--split validation,--metadata difficulty=hard, and an SQL query to target highly specific samples. - When the CLI starts a temporary server it automatically chooses an open port, injects credentials, and cleans up once evaluation finishes.
- Backend mode provides automatic trace capture via the inference interceptor, enabling token usage tracking and cost calculation.