examples/warming_up_to_rl/ that work hand-in-hand with the Synth CLI.
1. Set credentials
Therun_eval.pyscript readsTASK_APP_URL,ENVIRONMENT_API_KEY, and any provider keys you export. Alternatively, pass--env-file .envtouvx synth-ai smokeanduvx synth-ai trainto keep everything in a single place.
2. Choose a TOML config
Ready-made configs live inexamples/warming_up_to_rl/configs/, for example:
eval_groq_qwen32b.toml— uses Groq’s Qwen model.eval_vllm_qwen14b.toml— targets a self-hosted vLLM endpoint.
3. Run the evaluation
- The script fetches 
/task_infofrom the task app to enumerate seeds and metadata. - It runs the configured number of episodes per model, calling 
/rollout. - Metrics stream to stdout (achievements, invalid actions, latency).
 - If tracing is enabled, rollouts are written to 
traces/v3. 
4. Export traces
5. Deeper analysis
Use standard SQL tooling (sqlite3, DuckDB, Turso) to query the trace database. Useful joins:Tips
- Adjust concurrency/episode counts directly in the TOML.
 - Swap checkpoints by editing the 
modelfield (e.g., use a freshly fine-tunedft:model). - Pair evaluations with 
uvx synth-ai status jobs …to correlate results with training runs.