Prerequisites
uvx synth-ai setup
has been run in the repo so.env
containsSYNTH_API_KEY
andENVIRONMENT_API_KEY
.- The Crafter task app is deployed to Modal (
uvx synth-ai demo deploy
or your ownuvx synth-ai deploy
). - You know the hosted URL (stored as
TASK_APP_URL
in.env
). - Optional: a provider key (e.g.,
OPENAI_API_KEY
) if you are evaluating LLMs that rely on external APIs.
1. Set environment variables
2. Choose a TOML config
The repo ships ready-to-use configs underexamples/warming_up_to_rl/configs/
. For example:
eval_groq_qwen32b.toml
– uses Groq’s hosted Qwen model.eval_vllm_qwen14b.toml
– targets a locally hosted vLLM endpoint (update the URL if you use this option).
3. Run the evaluation
- The script fetches metadata from your hosted task app (
/task_info
) to list available seeds and achievements. - For each model in the config it launches the specified number of episodes, calling
/rollout
on your task app. - Metrics (achievement counts, invalid action rates, latency) stream to the console.
- Traces are stored in
traces/v3
(if tracing is enabled on the task app) so you can analyze them later.
4. Analyze the results
Use the helper scripts underexamples/warming_up_to_rl/
to filter traces and compute summary statistics:
uvx synth-ai train --type sft
.
Where traces live (Turso/SQLite) and how to query
- The task app writes request/step-level traces to a SQLite-compatible database. In production we use Turso (libSQL), and you can download a
.db
for local analysis. - Common tables include
requests
,steps
, andevents
. You can join them to compute per-episode or per-model aggregates.
.db
, or connect directly to Turso for shared dashboards.
5. Tips
- Adjust concurrency and episode counts directly in the TOML file.
- Swap models by setting
policy.provider
andpolicy.model
. The script supports Synth-hosted, OpenAI-compatible, and Groq providers out of the box. - If you want to operate entirely within Synth (no external APIs), point the config at checkpoints produced by your own training jobs.