synth_ai.sdk.eval.job
First-class SDK API for evaluation jobs.
This module provides high-level abstractions for running evaluation jobs
that route through the backend for trace capture and cost tracking.
Example:
synth_ai.cli.eval: CLI implementationsynth_ai.sdk.optimization: Similar pattern for optimization jobs
Classes
EvalStatus
Status of an evaluation job.
Methods:
from_string
is_terminal
is_success
EvalResult
Typed result from an evaluation job.
Provides clean accessors for common fields instead of raw dict access.
Methods:
from_response
succeeded
failed
is_terminal
EvalJobConfig
Configuration for an evaluation job.
This dataclass holds all the configuration needed to submit and run
an evaluation job via the backend.
Attributes:
task_app_url: URL of the task app to evaluate (e.g., “http://localhost:8103”). Required for job submission. Alias: local_api_urlbackend_url: Base URL of the Synth API backend (e.g., “https://api.usesynth.ai”). Can also be set via SYNTH_BASE_URL or BACKEND_BASE_URL environment variables.api_key: Synth API key for authentication with the backend. Can also be set via SYNTH_API_KEY environment variable.task_app_api_key: API key for authenticating with the task app. Defaults to ENVIRONMENT_API_KEY env var if not provided. Alias: local_api_keyapp_id: Task app identifier (optional, for logging/tracking).env_name: Environment name within the task app.seeds: List of seeds/indices to evaluate.policy_config: Model and provider configuration for the policy.env_config: Additional environment configuration.concurrency: Maximum number of parallel rollouts (default: 5).timeout: Maximum seconds per rollout (default: 600.0).
EvalJob
High-level SDK class for running evaluation jobs via the backend.
This class provides a clean API for:
- Submitting evaluation jobs to the backend
- Polling job status until completion
- Retrieving detailed results with metrics, tokens, and costs
- Downloading traces for analysis
- Captures traces automatically
- Tracks token usage
- Calculates costs based on model pricing
from_config
config_path: Path to TOML config filebackend_url: Backend API URL (defaults to env or production)api_key: API key (defaults to SYNTH_API_KEY env var)task_app_api_key: Task app API key (defaults to ENVIRONMENT_API_KEY)task_app_url: Override task app URL from configseeds: Override seeds list from config
- EvalJob instance ready for submission
ValueError: If required config is missingFileNotFoundError: If config file doesn’t exist
from_job_id
job_id: Existing job ID (e.g., “eval-abc123”)backend_url: Backend API URL (defaults to env or production)api_key: API key (defaults to SYNTH_API_KEY env var)
- EvalJob instance for the existing job
submit
- Route LLM calls through the inference interceptor
- Capture traces and token usage
- Calculate costs based on model pricing
- Job ID (e.g., “eval-abc123”)
RuntimeError: If job submission fails or job already submittedValueError: If configuration is invalid
job_id
get_status
- Job status dictionary with keys:
-
- job_id: Job identifier
-
- status: “running”, “completed”, or “failed”
-
- error: Error message if failed
-
- created_at, started_at, completed_at: Timestamps
-
- config: Original job configuration
-
- results: Summary results if completed
RuntimeError: If job hasn’t been submitted yet
poll_until_complete
timeout: Maximum seconds to wait (default: 1200 = 20 minutes)interval: Seconds between poll attempts (default: 15)progress: If True, print status updates during polling (useful for notebooks)on_status: Optional callback called on each status update (for custom progress handling)
- EvalResult with typed status, mean_reward, seed_results, etc.
RuntimeError: If job hasn’t been submitted yetTimeoutError: If timeout is exceeded
stream_until_complete
timeout: Maximum seconds to wait (default: 1200 = 20 minutes)interval: Seconds between status checks (for SSE reconnects)handlers: Optional StreamHandler instances for custom event handlingon_event: Optional callback called on each event
- EvalResult with typed status, mean_reward, seed_results, etc.
RuntimeError: If job hasn’t been submitted yetTimeoutError: If timeout exceeded
stream_sse_until_complete_async
timeout: Maximum seconds to wait (default: 1200 = 20 minutes)on_event: Optional callback called on each eventprogress: If True, print progress updates
- EvalResult with typed status, mean_reward, seed_results, etc.
RuntimeError: If job hasn’t been submitted yet
stream_sse_until_complete
timeout: Maximum seconds to wait (default: 1200 = 20 minutes)on_event: Optional callback called on each eventprogress: If True, print progress updates
- EvalResult with typed status, mean_reward, seed_results, etc.
get_results
- Results dictionary with:
-
- job_id: Job identifier
-
- status: Job status
-
- summary: Aggregate metrics
- mean_reward: Average reward across seeds
- total_tokens: Total token usage
- total_cost_usd: Total cost
- num_seeds: Number of seeds evaluated
- num_successful: Seeds that completed
- num_failed: Seeds that failed
-
- results: List of per-seed results
- seed: Seed number
- score: Evaluation score
- tokens: Token count
- cost_usd: Cost for this seed
- latency_ms: Execution time
- error: Error message if failed
RuntimeError: If job hasn’t been submitted yet
download_traces
output_dir: Directory to extract traces to
- Path to the output directory
RuntimeError: If job hasn’t been submitted or download fails
cancel
reason: Optional reason for cancellation (recorded in job metadata)
- Dict with cancellation status:
-
- job_id: The job ID
-
- status: “succeeded”, “partial”, or “failed”
-
- message: Human-readable status message
-
- attempt_id: ID of the cancel attempt (for debugging)
RuntimeError: If job hasn’t been submitted yetRuntimeError: If the cancellation request fails
query_workflow_state
- Dict with workflow state:
-
- job_id: The job ID
-
- workflow_state: State from the query handler (or None if unavailable)
- job_id: Job identifier
- run_id: Current run ID
- status: Current status (pending, running, succeeded, failed, cancelled)
- progress: Human-readable progress string
- error: Error message if failed
-
- query_name: Name of the query that was executed
-
- error: Error message if query failed (workflow may have completed)
RuntimeError: If job hasn’t been submitted yet