Evaluation Playbook

Evaluating RL or SFT checkpoints is as important as training them. The SDK ships helper scripts under examples/warming_up_to_rl/ that work hand-in-hand with the Synth CLI.

1. Set credentials

export TASK_APP_URL="https://<your-task-app>.modal.run"
export ENVIRONMENT_API_KEY="$(rg '^ENVIRONMENT_API_KEY=' .env -r '$1' --replace)"
# Optional: inference keys (OpenAI/Groq/etc.)
export OPENAI_API_KEY="sk-..."

The run_eval.py script reads TASK_APP_URL, ENVIRONMENT_API_KEY, and any provider keys you export. Alternatively, pass --env-file .env to uvx synth-ai smoke and uvx synth-ai train to keep everything in a single place.

2. Choose a TOML config

Ready-made configs live in examples/warming_up_to_rl/configs/, for example:

eval_groq_qwen32b.toml — uses Groq’s Qwen model.
eval_vllm_qwen14b.toml — targets a self-hosted vLLM endpoint.

Each config specifies the model/provider, split, number of episodes, and optional concurrency.

3. Run the evaluation

uv run python examples/warming_up_to_rl/run_eval.py \
  --toml examples/warming_up_to_rl/configs/eval_groq_qwen32b.toml \
  --use-rollout

What happens:

The script fetches /task_info from the task app to enumerate seeds and metadata.
It runs the configured number of episodes per model, calling /rollout.
Metrics stream to stdout (achievements, invalid actions, latency).
If tracing is enabled, rollouts are written to traces/v3.

4. Export traces

uv run python examples/warming_up_to_rl/export_trace_sft.py \
  --db traces/v3/synth_ai.db \
  --out datasets/crafter_eval.jsonl

This produces an evaluation JSONL you can archive or feed into SFT jobs.

5. Deeper analysis

Use standard SQL tooling (sqlite3, DuckDB, Turso) to query the trace database. Useful joins:

-- Achievement rates by model
SELECT r.model, s.achievement, COUNT(*) AS hits
FROM steps s
JOIN requests r ON s.request_id = r.id
WHERE s.type = 'achievement'
GROUP BY r.model, s.achievement
ORDER BY hits DESC;

-- Latency percentiles by route
SELECT r.route,
       percentile_cont(0.95) WITHIN GROUP (ORDER BY r.duration_ms) AS p95
FROM requests r
GROUP BY r.route;

Tips

Adjust concurrency/episode counts directly in the TOML.
Swap checkpoints by editing the model field (e.g., use a freshly fine-tuned ft: model).
Pair evaluations with uvx synth-ai status jobs … to correlate results with training runs.

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

Evaluation Playbook

1. Set credentials

2. Choose a TOML config

3. Run the evaluation

4. Export traces

5. Deeper analysis

Tips

Get Started

Fine-Tuning

Reinforcement Learning

CLI Commands

​1. Set credentials

​2. Choose a TOML config

​3. Run the evaluation

​4. Export traces

​5. Deeper analysis

​Tips

1. Set credentials

2. Choose a TOML config

3. Run the evaluation

4. Export traces

5. Deeper analysis

Tips