Skip to main content
The backend eval API provides endpoints for creating and managing evaluation jobs with automatic trace capture and cost tracking.

Overview

The eval API routes LLM calls through the inference interceptor, which:
  • Captures traces automatically via correlation IDs
  • Stores traces in trace store (Redis/Turso)
  • Enables token usage extraction and cost calculation
Base URL: /api/eval Authentication: Bearer token via Authorization header

Endpoints

Create Eval Job

Create a new eval job and start execution. Endpoint: POST /api/eval/jobs Request:
{
  "task_app_url": "http://localhost:8103",
  "task_app_api_key": "env-key-...",
  "app_id": "banking77",
  "env_name": "banking77",
  "seeds": [0, 1, 2, 3, 4],
  "policy": {
    "model": "gpt-4",
    "provider": "openai"
  },
  "env_config": {
    "split": "test"
  },
  "max_concurrent": 5,
  "timeout": 600.0
}
Response:
{
  "job_id": "eval-abc123",
  "status": "running"
}
Status Codes:
  • 200 / 201 — Job created successfully
  • 400 — Invalid request (missing required fields)
  • 401 — Authentication failed

Get Job Status

Fetch the current status of an eval job. Endpoint: GET /api/eval/jobs/{job_id} Response:
{
  "job_id": "eval-abc123",
  "status": "completed",  // or "running", "failed"
  "error": null,
  "created_at": "2025-01-15T10:00:00Z",
  "started_at": "2025-01-15T10:00:01Z",
  "completed_at": "2025-01-15T10:05:30Z",
  "config": {
    "task_app_url": "http://localhost:8103",
    "app_id": "banking77",
    "seeds": [0, 1, 2, 3, 4]
  },
  "results": {
    "mean_score": 0.92,
    "total_tokens": 750,
    "total_cost_usd": 0.01
  }
}
Status Values:
  • running — Job is currently executing
  • completed — Job finished successfully
  • failed — Job encountered an error
Status Codes:
  • 200 — Success
  • 401 — Authentication failed
  • 404 — Job not found

Get Job Results

Fetch detailed results for a completed eval job. Endpoint: GET /api/eval/jobs/{job_id}/results Response:
{
  "job_id": "eval-abc123",
  "status": "completed",
  "summary": {
    "mean_score": 0.92,
    "total_tokens": 750,
    "total_cost_usd": 0.01,
    "num_seeds": 5,
    "num_successful": 5,
    "num_failed": 0
  },
  "results": [
    {
      "seed": 0,
      "trial_id": "trial-xyz",
      "correlation_id": "trace-abc",
      "score": 0.95,
      "mean_return": 0.95,
      "outcome_score": 0.95,
      "events_score": 0.95,
      "verifier_score": null,
      "latency_ms": 1234.5,
      "tokens": 150,
      "cost_usd": 0.002,
      "error": null,
      "trace_id": "trace-abc"
    },
    ...
  ]
}
Status Codes:
  • 200 — Success
  • 401 — Authentication failed
  • 404 — Job not found

Download Traces

Download traces for a completed eval job as a ZIP file. Endpoint: GET /api/eval/jobs/{job_id}/traces Response:
  • Content-Type: application/zip
  • Body: ZIP file containing trace JSON files
Status Codes:
  • 200 — Success
  • 401 — Authentication failed
  • 404 — Job not found

Trace Capture Flow

  1. Job Creation: Backend generates correlation IDs for each rollout
  2. Rollout Execution: Backend calls task app with correlation ID in inference URL (?cid=...)
  3. Interceptor Capture: Task app calls LLM via interceptor, which captures trace
  4. Trace Storage: Interceptor stores trace in trace store (Redis/Turso)
  5. Trace Hydration: Backend hydrates traces from store for cost calculation
  6. Result Assembly: Backend combines rollout results with trace data

Implementation

Service: monorepo/backend/app/routes/eval/job_service.py Key Methods:
  • EvalJobService.create_job() — Creates eval job and starts execution
  • EvalJobService._execute_seed() — Executes single rollout with trace capture
  • EvalJobService._calculate_metrics() — Calculates scores, tokens, costs
Routes: monorepo/backend/app/routes/eval/routes.py

CLI Integration

The synth-ai eval command uses this API when --backend is provided:
python -m synth_ai.cli eval \
  --config banking77_eval.toml \
  --url http://localhost:8103 \
  --backend http://localhost:8000
The CLI automatically:
  1. Creates job via POST /api/eval/jobs
  2. Polls status via GET /api/eval/jobs/{job_id} until completed
  3. Fetches results via GET /api/eval/jobs/{job_id}/results
See CLI Eval Documentation for CLI usage details.

Error Handling

Common Errors:
  • 400 Bad Request — Missing required fields (task_app_url, seeds, policy.model)
  • 401 Unauthorized — Invalid or missing API key
  • 404 Not Found — Job ID doesn’t exist or belongs to different org
  • 500 Internal Server Error — Backend error during execution
Error Response Format:
{
  "detail": "task_app_url is required"
}

Rate Limiting

Rate limiting is currently deferred for eval jobs (TODO).

See Also