Eval

Run one-off rollouts against a task app, execute custom judges, and print evaluation summaries. Unlike smoke (which validates infrastructure), eval focuses on collecting performance metrics and judge scores across multiple seeds.

Usage

# Auto-discover eval config
uvx synth-ai eval

# Specify config explicitly
uvx synth-ai eval --config configs/eval.toml

# Quick eval with specific app and model
uvx synth-ai eval my-task-app --model gpt-4o-mini --seeds 0,1,2,3,4

Quick Start

Create an eval config (eval.toml):

[eval]
app_id = "my-task-app"
model = "gpt-4o-mini"
seeds = [0, 1, 2, 3, 4]
split = "train"
trace_db = "traces/v3/eval.db"

# Optional: filter by metadata
[eval.metadata]
difficulty = "easy"

# Optional: custom judges
[[eval.judges]]
name = "accuracy"
module = "my_judges.accuracy"
callable = "judge"

Then run:

uvx synth-ai eval --config eval.toml

Configuration

Core Options

[eval]
app_id = "task-app-name"          # Task app to evaluate
model = "gpt-4o-mini"              # Model to use for inference
seeds = [0, 1, 2, 3, 4]            # Seeds/indices to evaluate
split = "train"                    # Dataset split
trace_db = "traces/v3/eval.db"     # SQLite DB for traces
concurrency = 4                    # Parallel rollouts (default: 1)

Policy Overrides

[eval.policy]
temperature = 0.7
max_tokens = 2048
reasoning_effort = "high"
tool_choice = "auto"

Metadata Filtering

Filter tasks by metadata:

[eval.metadata]
difficulty = "hard"
category = "math"

Or use SQL for complex filters:

[eval]
metadata_sql = "SELECT seed FROM tasks WHERE difficulty='hard' LIMIT 10"

Custom Judges

Add custom scoring functions:

[[eval.judges]]
name = "accuracy"
module = "my_judges"               # Python module to import
callable = "accuracy_judge"        # Function name
threshold = 0.8                    # Custom kwargs passed to judge

Your judge function receives:

def accuracy_judge(payload, **kwargs):
    """
    payload contains:
      - seed: int
      - prompt_index: int | None
      - prompt: str | None
      - completion: str | None
      - metrics: dict
      - response: dict (full rollout response)
      - trace: dict (v3 trace)
    """
    return 0.0 to 1.0  # Your score

CLI Options

--config PATH              # Eval TOML config file
--url URL                  # Remote task app URL (instead of local)
--seeds "0,1,2"            # Comma-separated seeds (default: "0,1,2,3,4")
--split train              # Dataset split (default: "train")
--model MODEL              # Model identifier
--env-file PATH            # Env file(s) to load (repeatable)
--trace-db PATH            # SQLite/Turso URL (default: "traces/v3/synth_ai.db")
--metadata key=value       # Filter by metadata (repeatable)
--metadata-sql "SELECT..." # SQL query to filter seeds

Examples

Basic Evaluation

# Eval first 5 seeds with default model
uvx synth-ai eval my-app --seeds 0,1,2,3,4

Remote Task App

# Eval against deployed task app
uvx synth-ai eval \
  --url https://my-app.modal.run \
  --env-file .env \
  --model gpt-4o-mini \
  --seeds 0,1,2,3,4

With Metadata Filtering

# Only eval "hard" difficulty tasks
uvx synth-ai eval \
  --config eval.toml \
  --metadata difficulty=hard \
  --metadata category=reasoning

With Custom SQL

# Eval specific seeds via SQL query
uvx synth-ai eval \
  --config eval.toml \
  --metadata-sql "SELECT seed FROM tasks WHERE score < 0.5 LIMIT 20"

Parallel Evaluation

[eval]
concurrency = 8  # Run 8 rollouts in parallel
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

Output

The command prints:

Per-seed results (seed, status, metrics, judge scores)
Aggregate statistics (mean scores, Pearson correlation)
Detailed table of all evaluations

Example output:

seed=0 status=200 mean_return=0.85 outcome=1.0
seed=1 status=200 mean_return=0.72 outcome=0.5
seed=2 status=200 mean_return=0.91 outcome=1.0

Eval complete: 3 ok, 0 failed; model=gpt-4o-mini, split=train
Outcome summary: correct=2/3 (66.67%), mean_outcome=0.833

  Official mean: 0.827
  [accuracy] mean: 0.891
    Pearson r: 0.923

Seed  Prompt  Official  accuracy
----  ------  --------  --------
0     0       0.850     0.900
1     1       0.720     0.800
2     2       0.910     0.975

Comparison: eval vs smoke

Feature	`eval`	`smoke`
Purpose	Performance evaluation	Infrastructure validation
Seeds	Many (5-100+)	Few (1-5)
Judges	Custom scoring	None
Metrics	Detailed stats	Pass/fail
Traces	Always stored	Optional
Use case	Model comparison	Deployment testing

Troubleshooting

”No supported models”

Add model info to your task app config:

base_task_info = TaskInfo(
    inference=InferenceConfig(
        model="gpt-4o-mini",
        supported_models=["gpt-4o-mini", "gpt-4o"]
    )
)

“No traces found”

Ensure task app returns traces:

Set TASKAPP_TRACING_ENABLED=1
Include return_trace=true in rollout request
Use trace_format="structured"

Judge errors

Verify judge module is importable
Check judge function signature
Ensure judge returns numeric score (0.0-1.0)

“422 Validation Error”

Check rollout payload matches task app schema:

env_name matches task app config
policy_name is valid
ops sequence is supported

Get Started

CLI Commands

Fine-Tuning

Reinforcement Learning

CLI Commands

Usage

Quick Start

Configuration

Core Options

Policy Overrides

Metadata Filtering

Custom Judges

CLI Options

Examples

Basic Evaluation

Remote Task App

With Metadata Filtering

With Custom SQL

Parallel Evaluation

Output

Comparison: eval vs smoke

Troubleshooting

”No supported models”

“No traces found”

Judge errors

“422 Validation Error”

Next Steps

Get Started

CLI Commands

Fine-Tuning

Reinforcement Learning

CLI Commands

​Usage

​Quick Start

​Configuration

​Core Options

​Policy Overrides

​Metadata Filtering

​Custom Judges

​CLI Options

​Examples

​Basic Evaluation

​Remote Task App

​With Metadata Filtering

​With Custom SQL

​Parallel Evaluation

​Output

​Comparison: eval vs smoke

​Troubleshooting

​”No supported models”

​“No traces found”

​Judge errors

​“422 Validation Error”

​Next Steps

Usage

Quick Start

Configuration

Core Options

Policy Overrides

Metadata Filtering

Custom Judges

CLI Options

Examples

Basic Evaluation

Remote Task App

With Metadata Filtering

With Custom SQL

Parallel Evaluation

Output

Comparison: eval vs smoke

Troubleshooting

”No supported models”

“No traces found”

Judge errors

“422 Validation Error”

Next Steps