Skip to main content
Run one-off rollouts against a task app, execute custom judges, and print evaluation summaries. Unlike smoke (which validates infrastructure), eval focuses on collecting performance metrics and judge scores across multiple seeds.

Usage

# Auto-discover eval config
uvx synth-ai eval

# Specify config explicitly
uvx synth-ai eval --config configs/eval.toml

# Quick eval with specific app and model
uvx synth-ai eval my-task-app --model gpt-4o-mini --seeds 0,1,2,3,4

Quick Start

Create an eval config (eval.toml):
[eval]
app_id = "my-task-app"
model = "gpt-4o-mini"
seeds = [0, 1, 2, 3, 4]
split = "train"
trace_db = "traces/v3/eval.db"

# Optional: filter by metadata
[eval.metadata]
difficulty = "easy"

# Optional: custom judges
[[eval.judges]]
name = "accuracy"
module = "my_judges.accuracy"
callable = "judge"
Then run:
uvx synth-ai eval --config eval.toml

Configuration

Core Options

[eval]
app_id = "task-app-name"          # Task app to evaluate
model = "gpt-4o-mini"              # Model to use for inference
seeds = [0, 1, 2, 3, 4]            # Seeds/indices to evaluate
split = "train"                    # Dataset split
trace_db = "traces/v3/eval.db"     # SQLite DB for traces
concurrency = 4                    # Parallel rollouts (default: 1)

Policy Overrides

[eval.policy]
temperature = 0.7
max_tokens = 2048
reasoning_effort = "high"
tool_choice = "auto"

Metadata Filtering

Filter tasks by metadata:
[eval.metadata]
difficulty = "hard"
category = "math"
Or use SQL for complex filters:
[eval]
metadata_sql = "SELECT seed FROM tasks WHERE difficulty='hard' LIMIT 10"

Custom Judges

Add custom scoring functions:
[[eval.judges]]
name = "accuracy"
module = "my_judges"               # Python module to import
callable = "accuracy_judge"        # Function name
threshold = 0.8                    # Custom kwargs passed to judge
Your judge function receives:
def accuracy_judge(payload, **kwargs):
    """
    payload contains:
      - seed: int
      - prompt_index: int | None
      - prompt: str | None
      - completion: str | None
      - metrics: dict
      - response: dict (full rollout response)
      - trace: dict (v3 trace)
    """
    return 0.0 to 1.0  # Your score

CLI Options

--config PATH              # Eval TOML config file
--url URL                  # Remote task app URL (instead of local)
--seeds "0,1,2"            # Comma-separated seeds (default: "0,1,2,3,4")
--split train              # Dataset split (default: "train")
--model MODEL              # Model identifier
--env-file PATH            # Env file(s) to load (repeatable)
--trace-db PATH            # SQLite/Turso URL (default: "traces/v3/synth_ai.db")
--metadata key=value       # Filter by metadata (repeatable)
--metadata-sql "SELECT..." # SQL query to filter seeds

Examples

Basic Evaluation

# Eval first 5 seeds with default model
uvx synth-ai eval my-app --seeds 0,1,2,3,4

Remote Task App

# Eval against deployed task app
uvx synth-ai eval \
  --url https://my-app.modal.run \
  --env-file .env \
  --model gpt-4o-mini \
  --seeds 0,1,2,3,4

With Metadata Filtering

# Only eval "hard" difficulty tasks
uvx synth-ai eval \
  --config eval.toml \
  --metadata difficulty=hard \
  --metadata category=reasoning

With Custom SQL

# Eval specific seeds via SQL query
uvx synth-ai eval \
  --config eval.toml \
  --metadata-sql "SELECT seed FROM tasks WHERE score < 0.5 LIMIT 20"

Parallel Evaluation

[eval]
concurrency = 8  # Run 8 rollouts in parallel
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

Output

The command prints:
  • Per-seed results (seed, status, metrics, judge scores)
  • Aggregate statistics (mean scores, Pearson correlation)
  • Detailed table of all evaluations
Example output:
seed=0 status=200 mean_return=0.85 outcome=1.0
seed=1 status=200 mean_return=0.72 outcome=0.5
seed=2 status=200 mean_return=0.91 outcome=1.0

Eval complete: 3 ok, 0 failed; model=gpt-4o-mini, split=train
Outcome summary: correct=2/3 (66.67%), mean_outcome=0.833

  Official mean: 0.827
  [accuracy] mean: 0.891
    Pearson r: 0.923

Seed  Prompt  Official  accuracy
----  ------  --------  --------
0     0       0.850     0.900
1     1       0.720     0.800
2     2       0.910     0.975

Comparison: eval vs smoke

Featureevalsmoke
PurposePerformance evaluationInfrastructure validation
SeedsMany (5-100+)Few (1-5)
JudgesCustom scoringNone
MetricsDetailed statsPass/fail
TracesAlways storedOptional
Use caseModel comparisonDeployment testing

Troubleshooting

”No supported models”

Add model info to your task app config:
base_task_info = TaskInfo(
    inference=InferenceConfig(
        model="gpt-4o-mini",
        supported_models=["gpt-4o-mini", "gpt-4o"]
    )
)

“No traces found”

Ensure task app returns traces:
  • Set TASKAPP_TRACING_ENABLED=1
  • Include return_trace=true in rollout request
  • Use trace_format="structured"

Judge errors

  • Verify judge module is importable
  • Check judge function signature
  • Ensure judge returns numeric score (0.0-1.0)

“422 Validation Error”

Check rollout payload matches task app schema:
  • env_name matches task app config
  • policy_name is valid
  • ops sequence is supported

Next Steps