Evals Demo - Synth AI

This demo shows how to run the Crafter evaluation script against a hosted task app and produce JSONL data you can reuse for fine-tuning.

Prerequisites

uvx synth-ai setup has been run in the repo so .env contains SYNTH_API_KEY and ENVIRONMENT_API_KEY.
The Crafter task app is deployed to Modal (uvx synth-ai demo deploy or your own uvx synth-ai deploy).
You know the hosted URL (stored as TASK_APP_URL in .env).
Optional: a provider key (e.g., OPENAI_API_KEY) if you are evaluating LLMs that rely on external APIs.

1. Set environment variables

export TASK_APP_URL="https://<your-task-app>.modal.run"
export ENVIRONMENT_API_KEY="$(grep ENVIRONMENT_API_KEY .env | cut -d= -f2)"
# Optional: provider keys (OpenAI/Groq/etc.)
export OPENAI_API_KEY="sk-..."

The evaluation script reads these variables and forwards them to the task app.

2. Choose a TOML config

The repo ships ready-to-use configs under examples/warming_up_to_rl/configs/. For example:

eval_groq_qwen32b.toml – uses Groq’s hosted Qwen model.
eval_vllm_qwen14b.toml – targets a locally hosted vLLM endpoint (update the URL if you use this option).

3. Run the evaluation

uv run python examples/warming_up_to_rl/run_eval.py \
  --toml examples/warming_up_to_rl/configs/eval_groq_qwen32b.toml \
  --use-rollout

What happens:

The script fetches metadata from your hosted task app (/task_info) to list available seeds and achievements.
For each model in the config it launches the specified number of episodes, calling /rollout on your task app.
Metrics (achievement counts, invalid action rates, latency) stream to the console.
Traces are stored in traces/v3 (if tracing is enabled on the task app) so you can analyze them later.

4. Analyze the results

Use the helper scripts under examples/warming_up_to_rl/ to filter traces and compute summary statistics:

uv run python examples/warming_up_to_rl/export_trace_sft.py \
  --db traces/v3/synth_ai.db \
  --out datasets/crafter_eval.jsonl

> Download the trace database from your hosted deployment before running this command.

The JSONL file is now ready for uvx synth-ai train --type sft.

Where traces live (Turso/SQLite) and how to query

The task app writes request/step-level traces to a SQLite-compatible database. In production we use Turso (libSQL), and you can download a .db for local analysis.
Common tables include requests, steps, and events. You can join them to compute per-episode or per-model aggregates.

Example joins (SQLite syntax):

-- Episode-level achievement rates by model
SELECT r.model, s.achievement, COUNT(*) AS hits
FROM steps s
JOIN requests r ON s.request_id = r.id
WHERE s.type = 'achievement'
GROUP BY r.model, s.achievement
ORDER BY hits DESC;

-- Latency percentiles by route (rollout vs. eval)
SELECT r.route, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY r.duration_ms) AS p95
FROM requests r
GROUP BY r.route;

Tip: Use DuckDB or sqlite3 locally to explore the .db, or connect directly to Turso for shared dashboards.

5. Tips

Adjust concurrency and episode counts directly in the TOML file.
Swap models by setting policy.provider and policy.model. The script supports Synth-hosted, OpenAI-compatible, and Groq providers out of the box.
If you want to operate entirely within Synth (no external APIs), point the config at checkpoints produced by your own training jobs.

This pipeline is how many teams benchmark candidate policies before launching costlier RL runs. Because everything targets the hosted task app, results are reproducible across your organization.

SDK

​Prerequisites

​1. Set environment variables

​2. Choose a TOML config

​3. Run the evaluation

​4. Analyze the results

​Where traces live (Turso/SQLite) and how to query

​5. Tips