Skip to main content

synth_ai.cli.commands.eval.runner

Eval runner for executing rollouts against task apps. This module provides two execution modes:
  1. Backend Mode (Default): Routes through backend interceptor for trace/usage capture
    • Creates eval job via POST /api/eval/jobs
    • Polls job status until completion
    • Fetches detailed results with token costs and traces
    • Requires backend_url and backend_api_key (or SYNTH_BASE_URL/SYNTH_API_KEY env vars)
  2. Direct Mode: Calls task apps directly (legacy, no usage tracking)
    • Makes direct HTTP requests to task app /rollout endpoint
    • No trace capture or usage tracking
    • Simpler but limited functionality
Usage:
from synth_ai.cli.commands.eval.runner import run_eval
from synth_ai.cli.commands.eval.config import EvalRunConfig

config = EvalRunConfig(
    app_id="banking77",
    task_app_url="http://localhost:8103",
    env_name="banking77",
    seeds=[0, 1, 2],
    policy_config={"model": "gpt-4"},
)

results = await run_eval(config)
CLI Usage:
# Direct mode (no backend)
python -m synth_ai.cli eval       --config banking77_eval.toml       --url http://localhost:8103

# Backend mode (with trace capture)
python -m synth_ai.cli eval       --config banking77_eval.toml       --url http://localhost:8103       --backend http://localhost:8000
See Also:
  • synth_ai.cli.commands.eval.config: Configuration loading
  • monorepo/backend/app/routes/eval/job_service.py: Backend eval job service

Functions

run_eval

run_eval(config: EvalRunConfig) -> list[EvalResult]
Run evaluation against a task app. Automatically selects execution mode based on configuration:
  • Backend mode: Used if backend_url and backend_api_key are provided (or SYNTH_BASE_URL/SYNTH_API_KEY env vars are set)
  • Direct mode: Used otherwise (calls task app directly)
Args:
  • config: Evaluation configuration including task app URL, seeds, policy config, etc.
Returns:
  • List of EvalResult objects, one per seed, sorted by seed number.
Raises:
  • ValueError: If required configuration is missing (task_app_url, seeds, etc.)
  • RuntimeError: If backend job creation or polling fails

run_eval_direct

run_eval_direct(config: EvalRunConfig) -> list[EvalResult]
Direct mode: Call task apps directly without backend. Makes direct HTTP requests to the task app’s /rollout endpoint. This mode does NOT capture traces or track token usage via the backend interceptor. Use Cases:
  • Quick local testing without backend setup
  • Legacy workflows that don’t need trace capture
  • Simple evaluations without cost tracking
Limitations:
  • No trace capture (traces must be returned by task app if needed)
  • No token cost calculation (unless task app provides it)
  • No backend interceptor for LLM call tracking
Args:
  • config: Evaluation configuration. Must include task_app_url and seeds.
Returns:
  • List of EvalResult objects, one per seed.
Raises:
  • ValueError: If task_app_url or seeds are missing.

run_eval_via_backend

run_eval_via_backend(config: EvalRunConfig, backend_url: str, api_key: str) -> list[EvalResult]
Backend mode: Route through backend interceptor for trace/usage capture. This mode creates an eval job on the backend, which:
  1. Routes LLM calls through the inference interceptor
  2. Captures traces and token usage automatically
  3. Calculates costs based on model pricing
  4. Provides detailed results with timing and metrics
Flow:
  1. POST /api/eval/jobs - Create eval job
  2. Poll GET /api/eval/jobs/{job_id} - Check job status until completed
  3. GET /api/eval/jobs/{job_id}/results - Fetch detailed results
Benefits:
  • Automatic trace capture via interceptor
  • Token usage tracking and cost calculation
  • Centralized job management and monitoring
  • Support for async job execution
Args:
  • config: Evaluation configuration including task app URL, seeds, policy config.
  • backend_url: Backend API base URL (e.g., “http://localhost:8000”)
  • api_key: Backend API key for authentication (Bearer token)
Returns:
  • List of EvalResult objects with detailed metrics including tokens, costs, traces.
Raises:
  • ValueError: If required configuration is missing.
  • RuntimeError: If job creation, polling, or result fetching fails.

fetch_traces_from_backend

fetch_traces_from_backend(job_id: str, backend_url: str, api_key: str, output_dir: str) -> str
Download traces zip from backend and extract to output_dir. Returns path to the extracted traces directory.

format_eval_table

format_eval_table(results: list[EvalResult]) -> str

format_eval_report

format_eval_report(config: EvalRunConfig, results: list[EvalResult]) -> str

save_traces

save_traces(results: list[EvalResult], traces_dir: str) -> int
Save traces to individual JSON files in the given directory. Returns the number of traces saved.

Classes

EvalResult