`synth_ai.sdk.eval.job`

First-class SDK API for evaluation jobs. This module provides high-level abstractions for running evaluation jobs that route through the backend for trace capture and cost tracking. Example:

from synth_ai.sdk.eval import EvalJob, EvalResult

job = EvalJob(config)
job.submit()

# progress=True provides built-in status printing:
# [00:05] running | 3/10 completed
# [00:10] running | 7/10 completed
# [00:15] completed | mean_reward: 0.85
result = job.poll_until_complete(progress=True)

# Typed result access (not raw dict)
if result.succeeded:
    print(f"Mean reward: {result.mean_reward}")
    print(f"Total cost: ${result.total_cost_usd:.4f}")
    for seed_result in result.seed_results:
        print(f"  Seed {seed_result['seed']}: {seed_result['score']}")
elif result.failed:
    print(f"Error: {result.error}")

Classes

`EvalStatus`

Status of an evaluation job. Methods:

`from_string`

from_string(cls, status: str) -> 'EvalStatus'

Convert string to EvalStatus, defaulting to PENDING for unknown values.

`is_terminal`

is_terminal(self) -> bool

Whether this status is terminal (job won’t change further).

`is_success`

is_success(self) -> bool

Whether this status indicates success.

`EvalResult`

Typed result from an evaluation job. Provides clean accessors for common fields instead of raw dict access. Methods:

`from_response`

from_response(cls, job_id: str, data: Dict[str, Any]) -> 'EvalResult'

Create result from API response dict.

`succeeded`

succeeded(self) -> bool

Whether the job completed successfully.

`failed`

failed(self) -> bool

Whether the job failed.

`is_terminal`

is_terminal(self) -> bool

Whether the job has reached a terminal state.

`EvalJobConfig`

Configuration for an evaluation job. This dataclass holds all the configuration needed to submit and run an evaluation job via the backend. Attributes:

task_app_url: URL of the task app to evaluate (e.g., “http://localhost:8103”). Required for job submission. Alias: local_api_url
backend_url: Base URL of the Synth API backend (e.g., “https://api.usesynth.ai”). Can also be set via SYNTH_BASE_URL or BACKEND_BASE_URL environment variables.
api_key: Synth API key for authentication with the backend. Can also be set via SYNTH_API_KEY environment variable.
task_app_api_key: API key for authenticating with the task app. Defaults to ENVIRONMENT_API_KEY env var if not provided. Alias: local_api_key
app_id: Task app identifier (optional, for logging/tracking).
env_name: Environment name within the task app.
seeds: List of seeds/indices to evaluate.
policy_config: Model and provider configuration for the policy.
env_config: Additional environment configuration.
concurrency: Maximum number of parallel rollouts (default: 5).
timeout: Maximum seconds per rollout (default: 600.0).

`EvalJob`

High-level SDK class for running evaluation jobs via the backend. This class provides a clean API for:

Submitting evaluation jobs to the backend
Polling job status until completion
Retrieving detailed results with metrics, tokens, and costs
Downloading traces for analysis

The backend routes LLM calls through the inference interceptor, which:

Captures traces automatically
Tracks token usage
Calculates costs based on model pricing

Methods:

`from_config`

from_config(cls, config_path: str | Path, backend_url: Optional[str] = None, api_key: Optional[str] = None, task_app_api_key: Optional[str] = None, task_app_url: Optional[str] = None, seeds: Optional[List[int]] = None) -> 'EvalJob'

Create a job from a TOML config file. Loads evaluation configuration from a TOML file and allows overriding specific values via arguments. Args:

config_path: Path to TOML config file
backend_url: Backend API URL (defaults to env or production)
api_key: API key (defaults to SYNTH_API_KEY env var)
task_app_api_key: Task app API key (defaults to ENVIRONMENT_API_KEY)
task_app_url: Override task app URL from config
seeds: Override seeds list from config

Returns:

EvalJob instance ready for submission

Raises:

ValueError: If required config is missing
FileNotFoundError: If config file doesn’t exist

`from_job_id`

from_job_id(cls, job_id: str, backend_url: Optional[str] = None, api_key: Optional[str] = None) -> 'EvalJob'

Resume an existing job by ID. Use this to check status or get results of a previously submitted job. Args:

job_id: Existing job ID (e.g., “eval-abc123”)
backend_url: Backend API URL (defaults to env or production)
api_key: API key (defaults to SYNTH_API_KEY env var)

Returns:

EvalJob instance for the existing job

`submit`

submit(self) -> str

Submit the job to the backend. Creates an eval job on the backend which will:

Route LLM calls through the inference interceptor
Capture traces and token usage
Calculate costs based on model pricing

Returns:

Job ID (e.g., “eval-abc123”)

Raises:

RuntimeError: If job submission fails or job already submitted
ValueError: If configuration is invalid

`job_id`

job_id(self) -> Optional[str]

Get the job ID (None if not yet submitted).

`get_status`

get_status(self) -> Dict[str, Any]

Get current job status. Returns:

Job status dictionary with keys:
- job_id: Job identifier
- status: “running”, “completed”, or “failed”
- error: Error message if failed
- created_at, started_at, completed_at: Timestamps
- config: Original job configuration
- results: Summary results if completed

Raises:

RuntimeError: If job hasn’t been submitted yet

`poll_until_complete`

poll_until_complete(self) -> EvalResult

Poll job until it reaches a terminal state, then return results. Polls the backend until the job completes or fails, then fetches and returns the detailed results. Args:

timeout: Maximum seconds to wait (default: 1200 = 20 minutes)
interval: Seconds between poll attempts (default: 15)
progress: If True, print status updates during polling (useful for notebooks)
on_status: Optional callback called on each status update (for custom progress handling)

Returns:

EvalResult with typed status, mean_reward, seed_results, etc.

Raises:

RuntimeError: If job hasn’t been submitted yet
TimeoutError: If timeout is exceeded

`stream_until_complete`

stream_until_complete(self) -> EvalResult

Stream job events until completion using SSE. This provides real-time event streaming instead of polling, reducing server load and providing faster updates. Args:

timeout: Maximum seconds to wait (default: 1200 = 20 minutes)
interval: Seconds between status checks (for SSE reconnects)
handlers: Optional StreamHandler instances for custom event handling
on_event: Optional callback called on each event

Returns:

EvalResult with typed status, mean_reward, seed_results, etc.

Raises:

RuntimeError: If job hasn’t been submitted yet
TimeoutError: If timeout exceeded

`stream_sse_until_complete_async`

stream_sse_until_complete_async(self) -> EvalResult

Stream job events via SSE until completion (async version). This provides real-time event streaming instead of polling, reducing latency and providing instant updates. Args:

timeout: Maximum seconds to wait (default: 1200 = 20 minutes)
on_event: Optional callback called on each event
progress: If True, print progress updates

Returns:

EvalResult with typed status, mean_reward, seed_results, etc.

Raises:

RuntimeError: If job hasn’t been submitted yet

`stream_sse_until_complete`

stream_sse_until_complete(self) -> EvalResult

Stream job events via SSE until completion (sync wrapper). This provides real-time event streaming instead of polling. Args:

timeout: Maximum seconds to wait (default: 1200 = 20 minutes)
on_event: Optional callback called on each event
progress: If True, print progress updates

Returns:

EvalResult with typed status, mean_reward, seed_results, etc.

`get_results`

get_results(self) -> Dict[str, Any]

Get detailed job results. Fetches the full results including per-seed scores, tokens, and costs. Returns:

Results dictionary with:
- job_id: Job identifier
- status: Job status
- summary: Aggregate metrics
mean_reward: Average reward across seeds
total_tokens: Total token usage
total_cost_usd: Total cost
num_seeds: Number of seeds evaluated
num_successful: Seeds that completed
num_failed: Seeds that failed
- results: List of per-seed results
seed: Seed number
score: Evaluation score
tokens: Token count
cost_usd: Cost for this seed
latency_ms: Execution time
error: Error message if failed

Raises:

RuntimeError: If job hasn’t been submitted yet

`download_traces`

download_traces(self, output_dir: str | Path) -> Path

Download traces for the job to a directory. Downloads the traces ZIP file from the backend and extracts it to the specified directory. Args:

output_dir: Directory to extract traces to

Returns:

Path to the output directory

Raises:

RuntimeError: If job hasn’t been submitted or download fails

`cancel`

cancel(self) -> Dict[str, Any]

Cancel a running eval job. Sends a cancellation request to the backend. The job will stop at the next checkpoint and emit a cancelled status event. Args:

reason: Optional reason for cancellation (recorded in job metadata)

Returns:

Dict with cancellation status:
- job_id: The job ID
- status: “succeeded”, “partial”, or “failed”
- message: Human-readable status message
- attempt_id: ID of the cancel attempt (for debugging)

Raises:

RuntimeError: If job hasn’t been submitted yet
RuntimeError: If the cancellation request fails

`query_workflow_state`

query_workflow_state(self) -> Dict[str, Any]

Query the Temporal workflow state for instant polling. This queries the workflow directly using its @workflow.query handler, providing instant state without database lookups. Useful for real-time progress monitoring. Returns:

Dict with workflow state:
- job_id: The job ID
- workflow_state: State from the query handler (or None if unavailable)
job_id: Job identifier
run_id: Current run ID
status: Current status (pending, running, succeeded, failed, cancelled)
progress: Human-readable progress string
error: Error message if failed
- query_name: Name of the query that was executed
- error: Error message if query failed (workflow may have completed)

Raises:

RuntimeError: If job hasn’t been submitted yet

API Reference

​synth_ai.sdk.eval.job

​Classes

​EvalStatus

​from_string

​is_terminal

​is_success

​EvalResult

​from_response

​succeeded

​failed

​is_terminal

​EvalJobConfig

​EvalJob

​from_config

​from_job_id

​submit

​job_id

​get_status

​poll_until_complete

​stream_until_complete

​stream_sse_until_complete_async

​stream_sse_until_complete

​get_results

​download_traces

​cancel

​query_workflow_state

`synth_ai.sdk.eval.job`

Classes

`EvalStatus`

`from_string`

`is_terminal`

`is_success`

`EvalResult`

`from_response`

`succeeded`

`failed`

`is_terminal`

`EvalJobConfig`

`EvalJob`

`from_config`

`from_job_id`

`submit`

`job_id`

`get_status`

`poll_until_complete`

`stream_until_complete`

`stream_sse_until_complete_async`

`stream_sse_until_complete`

`get_results`

`download_traces`

`cancel`

`query_workflow_state`