`synth_ai.cli.commands.eval.runner`

Eval runner for executing rollouts against task apps. This module provides two execution modes:

Backend Mode (Default): Routes through backend interceptor for trace/usage capture
- Creates eval job via POST /api/eval/jobs
- Polls job status until completion
- Fetches detailed results with token costs and traces
- Requires backend_url and backend_api_key (or SYNTH_BASE_URL/SYNTH_API_KEY env vars)
Direct Mode: Calls task apps directly (legacy, no usage tracking)
- Makes direct HTTP requests to task app /rollout endpoint
- No trace capture or usage tracking
- Simpler but limited functionality

Usage:

from synth_ai.cli.commands.eval.runner import run_eval
from synth_ai.cli.commands.eval.config import EvalRunConfig

config = EvalRunConfig(
    app_id="banking77",
    task_app_url="http://localhost:8103",
    env_name="banking77",
    seeds=[0, 1, 2],
    policy_config={"model": "gpt-4"},
)

results = await run_eval(config)

CLI Usage:

# Direct mode (no backend)
python -m synth_ai.cli eval       --config banking77_eval.toml       --url http://localhost:8103

# Backend mode (with trace capture)
python -m synth_ai.cli eval       --config banking77_eval.toml       --url http://localhost:8103       --backend http://localhost:8000

Functions

`run_eval`

run_eval(config: EvalRunConfig) -> list[EvalResult]

Run evaluation against a task app. Automatically selects execution mode based on configuration:

Backend mode: Used if backend_url and backend_api_key are provided (or SYNTH_BASE_URL/SYNTH_API_KEY env vars are set)
Direct mode: Used otherwise (calls task app directly)

Args:

config: Evaluation configuration including task app URL, seeds, policy config, etc.

Returns:

List of EvalResult objects, one per seed, sorted by seed number.

Raises:

ValueError: If required configuration is missing (task_app_url, seeds, etc.)
RuntimeError: If backend job creation or polling fails

`run_eval_direct`

run_eval_direct(config: EvalRunConfig) -> list[EvalResult]

Direct mode: Call task apps directly without backend. Makes direct HTTP requests to the task app’s /rollout endpoint. This mode does NOT capture traces or track token usage via the backend interceptor. Use Cases:

Quick local testing without backend setup
Legacy workflows that don’t need trace capture
Simple evaluations without cost tracking

Limitations:

No trace capture (traces must be returned by task app if needed)
No token cost calculation (unless task app provides it)
No backend interceptor for LLM call tracking

Args:

config: Evaluation configuration. Must include task_app_url and seeds.

Returns:

List of EvalResult objects, one per seed.

Raises:

ValueError: If task_app_url or seeds are missing.

`run_eval_via_backend`

run_eval_via_backend(config: EvalRunConfig, backend_url: str, api_key: str) -> list[EvalResult]

Backend mode: Route through backend interceptor for trace/usage capture. This mode creates an eval job on the backend, which:

Routes LLM calls through the inference interceptor
Captures traces and token usage automatically
Calculates costs based on model pricing
Provides detailed results with timing and metrics

Flow:

POST /api/eval/jobs - Create eval job
Poll GET /api/eval/jobs/{job_id} - Check job status until completed
GET /api/eval/jobs/{job_id}/results - Fetch detailed results

Benefits:

Automatic trace capture via interceptor
Token usage tracking and cost calculation
Centralized job management and monitoring
Support for async job execution

Args:

config: Evaluation configuration including task app URL, seeds, policy config.
backend_url: Backend API base URL (e.g., “http://localhost:8000”)
api_key: Backend API key for authentication (Bearer token)

Returns:

List of EvalResult objects with detailed metrics including tokens, costs, traces.

Raises:

ValueError: If required configuration is missing.
RuntimeError: If job creation, polling, or result fetching fails.

`fetch_traces_from_backend`

fetch_traces_from_backend(job_id: str, backend_url: str, api_key: str, output_dir: str) -> str

Download traces zip from backend and extract to output_dir. Returns path to the extracted traces directory.

`format_eval_table`

format_eval_table(results: list[EvalResult]) -> str

`format_eval_report`

format_eval_report(config: EvalRunConfig, results: list[EvalResult]) -> str

`save_traces`

save_traces(results: list[EvalResult], traces_dir: str) -> int

Save traces to individual JSON files in the given directory. Returns the number of traces saved.

CLI Reference

Python SDK Reference

Eval Runner

`synth_ai.cli.commands.eval.runner`

Functions

`run_eval`

`run_eval_direct`

`run_eval_via_backend`

`fetch_traces_from_backend`

`format_eval_table`

`format_eval_report`

`save_traces`

Classes

`EvalResult`

CLI Reference

Python SDK Reference

​synth_ai.cli.commands.eval.runner

​Functions

​run_eval

​run_eval_direct

​run_eval_via_backend

​fetch_traces_from_backend

​format_eval_table

​format_eval_report

​save_traces

​Classes

​EvalResult

`synth_ai.cli.commands.eval.runner`

Functions

`run_eval`

`run_eval_direct`

`run_eval_via_backend`

`fetch_traces_from_backend`

`format_eval_table`

`format_eval_report`

`save_traces`

Classes

`EvalResult`