Skip to main content

synth_ai.sdk.eval.job

First-class SDK API for evaluation jobs. This module provides high-level abstractions for running evaluation jobs that route through the backend for trace capture and cost tracking. Example:
from synth_ai.sdk.eval import EvalJob, EvalResult

job = EvalJob(config)
job.submit()

# progress=True provides built-in status printing:
# [00:05] running | 3/10 completed
# [00:10] running | 7/10 completed
# [00:15] completed | mean_reward: 0.85
result = job.poll_until_complete(progress=True)

# Typed result access (not raw dict)
if result.succeeded:
    print(f"Mean reward: {result.mean_reward}")
    print(f"Total cost: ${result.total_cost_usd:.4f}")
    for seed_result in result.seed_results:
        print(f"  Seed {seed_result['seed']}: {seed_result['score']}")
elif result.failed:
    print(f"Error: {result.error}")
See Also:
  • synth_ai.cli.eval: CLI implementation
  • synth_ai.sdk.optimization: Similar pattern for optimization jobs

Classes

EvalStatus

Status of an evaluation job. Methods:

from_string

from_string(cls, status: str) -> 'EvalStatus'
Convert string to EvalStatus, defaulting to PENDING for unknown values.

is_terminal

is_terminal(self) -> bool
Whether this status is terminal (job won’t change further).

is_success

is_success(self) -> bool
Whether this status indicates success.

EvalResult

Typed result from an evaluation job. Provides clean accessors for common fields instead of raw dict access. Methods:

from_response

from_response(cls, job_id: str, data: Dict[str, Any]) -> 'EvalResult'
Create result from API response dict.

succeeded

succeeded(self) -> bool
Whether the job completed successfully.

failed

failed(self) -> bool
Whether the job failed.

is_terminal

is_terminal(self) -> bool
Whether the job has reached a terminal state.

EvalJobConfig

Configuration for an evaluation job. This dataclass holds all the configuration needed to submit and run an evaluation job via the backend. Attributes:
  • task_app_url: URL of the task app to evaluate (e.g., “http://localhost:8103”). Required for job submission. Alias: local_api_url
  • backend_url: Base URL of the Synth API backend (e.g., “https://api.usesynth.ai”). Can also be set via SYNTH_BASE_URL or BACKEND_BASE_URL environment variables.
  • api_key: Synth API key for authentication with the backend. Can also be set via SYNTH_API_KEY environment variable.
  • task_app_api_key: API key for authenticating with the task app. Defaults to ENVIRONMENT_API_KEY env var if not provided. Alias: local_api_key
  • app_id: Task app identifier (optional, for logging/tracking).
  • env_name: Environment name within the task app.
  • seeds: List of seeds/indices to evaluate.
  • policy_config: Model and provider configuration for the policy.
  • env_config: Additional environment configuration.
  • concurrency: Maximum number of parallel rollouts (default: 5).
  • timeout: Maximum seconds per rollout (default: 600.0).

EvalJob

High-level SDK class for running evaluation jobs via the backend. This class provides a clean API for:
  1. Submitting evaluation jobs to the backend
  2. Polling job status until completion
  3. Retrieving detailed results with metrics, tokens, and costs
  4. Downloading traces for analysis
The backend routes LLM calls through the inference interceptor, which:
  • Captures traces automatically
  • Tracks token usage
  • Calculates costs based on model pricing
Methods:

from_config

from_config(cls, config_path: str | Path, backend_url: Optional[str] = None, api_key: Optional[str] = None, task_app_api_key: Optional[str] = None, task_app_url: Optional[str] = None, seeds: Optional[List[int]] = None) -> 'EvalJob'
Create a job from a TOML config file. Loads evaluation configuration from a TOML file and allows overriding specific values via arguments. Args:
  • config_path: Path to TOML config file
  • backend_url: Backend API URL (defaults to env or production)
  • api_key: API key (defaults to SYNTH_API_KEY env var)
  • task_app_api_key: Task app API key (defaults to ENVIRONMENT_API_KEY)
  • task_app_url: Override task app URL from config
  • seeds: Override seeds list from config
Returns:
  • EvalJob instance ready for submission
Raises:
  • ValueError: If required config is missing
  • FileNotFoundError: If config file doesn’t exist

from_job_id

from_job_id(cls, job_id: str, backend_url: Optional[str] = None, api_key: Optional[str] = None) -> 'EvalJob'
Resume an existing job by ID. Use this to check status or get results of a previously submitted job. Args:
  • job_id: Existing job ID (e.g., “eval-abc123”)
  • backend_url: Backend API URL (defaults to env or production)
  • api_key: API key (defaults to SYNTH_API_KEY env var)
Returns:
  • EvalJob instance for the existing job

submit

submit(self) -> str
Submit the job to the backend. Creates an eval job on the backend which will:
  1. Route LLM calls through the inference interceptor
  2. Capture traces and token usage
  3. Calculate costs based on model pricing
Returns:
  • Job ID (e.g., “eval-abc123”)
Raises:
  • RuntimeError: If job submission fails or job already submitted
  • ValueError: If configuration is invalid

job_id

job_id(self) -> Optional[str]
Get the job ID (None if not yet submitted).

get_status

get_status(self) -> Dict[str, Any]
Get current job status. Returns:
  • Job status dictionary with keys:
    • job_id: Job identifier
    • status: “running”, “completed”, or “failed”
    • error: Error message if failed
    • created_at, started_at, completed_at: Timestamps
    • config: Original job configuration
    • results: Summary results if completed
Raises:
  • RuntimeError: If job hasn’t been submitted yet

poll_until_complete

poll_until_complete(self) -> EvalResult
Poll job until it reaches a terminal state, then return results. Polls the backend until the job completes or fails, then fetches and returns the detailed results. Args:
  • timeout: Maximum seconds to wait (default: 1200 = 20 minutes)
  • interval: Seconds between poll attempts (default: 15)
  • progress: If True, print status updates during polling (useful for notebooks)
  • on_status: Optional callback called on each status update (for custom progress handling)
Returns:
  • EvalResult with typed status, mean_reward, seed_results, etc.
Raises:
  • RuntimeError: If job hasn’t been submitted yet
  • TimeoutError: If timeout is exceeded

stream_until_complete

stream_until_complete(self) -> EvalResult
Stream job events until completion using SSE. This provides real-time event streaming instead of polling, reducing server load and providing faster updates. Args:
  • timeout: Maximum seconds to wait (default: 1200 = 20 minutes)
  • interval: Seconds between status checks (for SSE reconnects)
  • handlers: Optional StreamHandler instances for custom event handling
  • on_event: Optional callback called on each event
Returns:
  • EvalResult with typed status, mean_reward, seed_results, etc.
Raises:
  • RuntimeError: If job hasn’t been submitted yet
  • TimeoutError: If timeout exceeded

stream_sse_until_complete_async

stream_sse_until_complete_async(self) -> EvalResult
Stream job events via SSE until completion (async version). This provides real-time event streaming instead of polling, reducing latency and providing instant updates. Args:
  • timeout: Maximum seconds to wait (default: 1200 = 20 minutes)
  • on_event: Optional callback called on each event
  • progress: If True, print progress updates
Returns:
  • EvalResult with typed status, mean_reward, seed_results, etc.
Raises:
  • RuntimeError: If job hasn’t been submitted yet

stream_sse_until_complete

stream_sse_until_complete(self) -> EvalResult
Stream job events via SSE until completion (sync wrapper). This provides real-time event streaming instead of polling. Args:
  • timeout: Maximum seconds to wait (default: 1200 = 20 minutes)
  • on_event: Optional callback called on each event
  • progress: If True, print progress updates
Returns:
  • EvalResult with typed status, mean_reward, seed_results, etc.

get_results

get_results(self) -> Dict[str, Any]
Get detailed job results. Fetches the full results including per-seed scores, tokens, and costs. Returns:
  • Results dictionary with:
    • job_id: Job identifier
    • status: Job status
    • summary: Aggregate metrics
  • mean_reward: Average reward across seeds
  • total_tokens: Total token usage
  • total_cost_usd: Total cost
  • num_seeds: Number of seeds evaluated
  • num_successful: Seeds that completed
  • num_failed: Seeds that failed
    • results: List of per-seed results
  • seed: Seed number
  • score: Evaluation score
  • tokens: Token count
  • cost_usd: Cost for this seed
  • latency_ms: Execution time
  • error: Error message if failed
Raises:
  • RuntimeError: If job hasn’t been submitted yet

download_traces

download_traces(self, output_dir: str | Path) -> Path
Download traces for the job to a directory. Downloads the traces ZIP file from the backend and extracts it to the specified directory. Args:
  • output_dir: Directory to extract traces to
Returns:
  • Path to the output directory
Raises:
  • RuntimeError: If job hasn’t been submitted or download fails

cancel

cancel(self) -> Dict[str, Any]
Cancel a running eval job. Sends a cancellation request to the backend. The job will stop at the next checkpoint and emit a cancelled status event. Args:
  • reason: Optional reason for cancellation (recorded in job metadata)
Returns:
  • Dict with cancellation status:
    • job_id: The job ID
    • status: “succeeded”, “partial”, or “failed”
    • message: Human-readable status message
    • attempt_id: ID of the cancel attempt (for debugging)
Raises:
  • RuntimeError: If job hasn’t been submitted yet
  • RuntimeError: If the cancellation request fails

query_workflow_state

query_workflow_state(self) -> Dict[str, Any]
Query the Temporal workflow state for instant polling. This queries the workflow directly using its @workflow.query handler, providing instant state without database lookups. Useful for real-time progress monitoring. Returns:
  • Dict with workflow state:
    • job_id: The job ID
    • workflow_state: State from the query handler (or None if unavailable)
  • job_id: Job identifier
  • run_id: Current run ID
  • status: Current status (pending, running, succeeded, failed, cancelled)
  • progress: Human-readable progress string
  • error: Error message if failed
    • query_name: Name of the query that was executed
    • error: Error message if query failed (workflow may have completed)
Raises:
  • RuntimeError: If job hasn’t been submitted yet