Skip to main content
synth-ai eval executes rollouts against a task app and summarizes judge scores. It supports two execution modes:
  1. Direct Mode: Calls task apps directly (no backend required)
  2. Backend Mode: Routes through backend for trace capture and cost tracking

Quick Start

Basic Usage

# Direct mode (no backend)
python -m synth_ai.cli eval \
  --config banking77_eval.toml \
  --url http://localhost:8103

# Backend mode (with trace capture)
python -m synth_ai.cli eval \
  --config banking77_eval.toml \
  --url http://localhost:8103 \
  --backend http://localhost:8000

Execution Modes

Direct Mode

When Used:
  • --backend flag is NOT provided
  • No SYNTH_BASE_URL or SYNTH_API_KEY environment variables
How It Works:
  1. Creates TaskAppClient with task app URL
  2. Makes direct HTTP POST requests to /rollout endpoint
  3. Collects results and formats output
Limitations:
  • No trace capture via interceptor
  • No automatic token cost calculation
  • No backend job management
Use Cases:
  • Quick local testing
  • Legacy workflows
  • Simple evaluations without cost tracking

Backend Mode

When Used:
  • --backend flag is provided, OR
  • SYNTH_BASE_URL and SYNTH_API_KEY environment variables are set
How It Works:
  1. Create Job: POST /api/eval/jobs with eval configuration
  2. Poll Status: GET /api/eval/jobs/{job_id} until status is “completed” or “failed”
  3. Fetch Results: GET /api/eval/jobs/{job_id}/results for detailed metrics
Benefits:
  • Automatic trace capture via inference interceptor
  • Token usage tracking and cost calculation
  • Centralized job management
  • Support for async execution
See Backend Eval API Reference for API details.

Configuration

Config File Format

Legacy Eval Format:
[eval]
app_id = "banking77"
url = "http://localhost:8103"
env_name = "banking77"
seeds = [0, 1, 2, 3, 4]
concurrency = 5

[eval.policy_config]
model = "gpt-4"
provider = "openai"
inference_url = "https://api.openai.com/v1"

[eval.env_config]
split = "test"
Prompt Learning Format:
[prompt_learning]
task_app_id = "banking77"
task_app_url = "http://localhost:8103"

[prompt_learning.gepa]
env_name = "banking77"

[prompt_learning.gepa.evaluation]
seeds = [0, 1, 2, 3, 4]

Options

Required:
  • APP_ID — Task app identifier. Omit for the interactive picker.
  • --url VALUE or url in config — Base URL of a running task app. When omitted the CLI spins up a local server automatically.
  • Seeds — Either --seeds VALUE or seeds in config (comma-separated: "0,1,2")
Optional:
  • --config PATH — Optional TOML config. If omitted the CLI auto-discovers the first *.toml that matches the task app.
  • --backend VALUE — Backend URL (enables backend mode for trace capture)
  • --model VALUE — Override the model in the config/metadata
  • --concurrency VALUE — Number of parallel rollouts (default: 1)
  • --return-trace — Include traces in response
  • --traces-dir PATH — Directory to save trace files
  • --output-json PATH — Path to write JSON report
  • --output-txt PATH — Path to write text report
  • --split VALUE — Dataset split name to request (train, validation, etc.)
  • --env-file PATH — One or more .env files holding credentials. Repeat the flag to merge multiple files.
  • --trace-db PATH — SQLite/Turso path for trace persistence. Use none to disable saving traces.
  • --metadata KEY=VALUE — Filter tasks by metadata key/value pairs (repeatable).
  • --metadata-sql QUERY — Advanced seed selection using a SQLite query (returns seeds to evaluate).
uvx synth-ai eval grpo-crafter --config configs/eval.toml --seeds 0,5,9
Example session (local server auto-started):
synth@Nomans-Resolve sdk % uvx synth-ai eval grpo-crafter --seeds 0,1,2
Starting temporary grpo-crafter server on port 8765...
Waiting for server to start...
 Server started
Using env file(s): /Users/synth/qa/sdk/.env
Evaluating seeds: 0, 1, 2
Official mean: 0.742
[judge qa_accuracy] mean: 0.810
    Pearson r: 0.64
  Seed Prompt Official qa_accuracy
  0    12     0.750     0.833
  1    98     0.708     0.792
  2    45     0.769     0.805
Stopping temporary server...

Environment Variables

  • SYNTH_BASE_URL — Backend URL (alternative to --backend flag)
  • SYNTH_API_KEY — Backend API key (required for backend mode)
  • ENVIRONMENT_API_KEY — Task app API key (for task app authentication)

Output Formats

Console Output

Results are printed as a formatted table:
seed | score | mean_return | outcome | events | latency_ms | verifier | tokens | cost_usd | error
-----+-------+-------------+---------+--------+------------+----------+--------+----------+-------
0    | 0.95  | 0.95       | 0.95    | 0.95   | 1234.5     | -        | 150    | 0.002    | -
1    | 0.92  | 0.92       | 0.92    | 0.92   | 1156.2     | -        | 145    | 0.0019   | -
avg  | 0.935 | 0.935      | 0.935   | 0.935  | 1195.35    | -        | 147    | 0.00195  | -

JSON Output

Use --output-json to write results to a JSON file:
{
  "config": {
    "app_id": "banking77",
    "task_app_url": "http://localhost:8103",
    "env_name": "banking77",
    "seeds": [0, 1, 2, 3, 4],
    "policy_config": {
      "model": "gpt-4"
    }
  },
  "results": [
    {
      "seed": 0,
      "score": 0.95,
      "mean_return": 0.95,
      "tokens": 150,
      "cost_usd": 0.002,
      "latency_ms": 1234.5
    }
  ]
}

Trace Files

Use --traces-dir to save traces as individual JSON files:
traces/
  seed_0_trace.json
  seed_1_trace.json
  seed_2_trace.json
  ...

Code Reference

CLI Implementation:
  • Entry Point: synth_ai/cli/commands/eval/core.py
  • Execution Logic: synth_ai/cli/commands/eval/runner.py
  • Config Loading: synth_ai/cli/commands/eval/config.py
Backend API:
  • Routes: monorepo/backend/app/routes/eval/routes.py
  • Service: monorepo/backend/app/routes/eval/job_service.py
  • See Backend Eval API Reference for details

Notes

  • Remote evaluations require both SYNTH_API_KEY and ENVIRONMENT_API_KEY in your env files so the CLI can forward authentication headers.
  • Trace DB paths accept both file locations and SQLAlchemy URLs (sqlite+aiosqlite:///absolute/path.db). Set --trace-db none if you want to keep runs ephemeral.
  • Metadata filters stack: you can combine --split validation, --metadata difficulty=hard, and an SQL query to target highly specific samples.
  • When the CLI starts a temporary server it automatically chooses an open port, injects credentials, and cleans up once evaluation finishes.
  • Backend mode provides automatic trace capture via the inference interceptor, enabling token usage tracking and cost calculation.