Evals Demo

Run quick, local evaluations to compare models on the Crafter environment, then filter traces to prepare a fine‑tuning dataset.

Final Result - Achievement frequencies

Achievement                 gpt-4.1-nano   gpt-5-nano
-----------------------------------------------------
collect_drink               2/5   ( 40%)   0/5   (  0%)
collect_sapling             1/5   ( 20%)   2/5   ( 40%)
collect_wood                3/5   ( 60%)   2/5   ( 40%)
place_plant                 0/5   (  0%)   1/5   ( 20%)
Requirements
  • Have uv installed and use uvx/uv run
  • OPENAI_API_KEY exported in your shell
  • Local tracing and environment service deployed with uvx synth-ai serve (auto‑kills port 8901)
What this demo shows
  • Side‑by‑side model comparison on Crafter with concurrency and timeouts
  • Live progress with timeouts and achievements per episode
  • Post‑run: filter traces → JSONL for SFT + basic stats
Quick setup
uvx synth-ai serve  # optional, for local tracing

# OpenAI auth
export OPENAI_API_KEY="sk-your-openai-key"

Run the demo

bash examples/evals/run_demo.sh
You’ll be prompted for inputs. Press Enter to accept defaults or type your own values:
Models to compare (space-separated) [gpt-5-nano gpt-4.1-nano]: 
Models: gpt-5-nano gpt-4.1-nano
Episodes per model [3]: 5
Max turns per episode [5]: 5
Parallelism per model (concurrency) [5]: 5
Difficulty [easy]: 
Running comparison: episodes=5, max_turns=5, difficulty=easy, concurrency=5
Detected OpenAI key present. Use this key? [Y/n]: Y

Experiment summary

🎮 Crafter Multi-Model Experiment
==================================================
Experiment ID: crafter_multi_model_20250808_170152
Models: gpt-5-nano, gpt-4.1-nano
Episodes per model: 5
Max turns per episode: 5
Difficulty: easy
Seeds: 1000 to 1004
Turn timeout: 20.0s
Episode timeout: 180.0s
Save traces: True
Database URL: sqlite+aiosqlite:////.../synth_ai.db/dbs/default/data
==================================================
✅ Crafter service is running

Live progress (sample)

Running 5 episodes for gpt-5-nano in parallel...
gpt-5-nano | ep1:  20%|█████████▌    | 1/5 [00:21<01:24, 21.13s/turn, ach=0]
⏰ Turn 3 timed out for episode 0 after 20.0s
gpt-5-nano | ep2: 100%|██████████████| 5/5 [01:07<00:00, 13.56s/turn, ach=2]

Running 5 episodes for gpt-4.1-nano in parallel...
gpt-4.1-nano | ep3: 100%|████████████| 5/5 [00:09<00:00,  1.95s/turn, ach=1]
gpt-4.1-nano | ep4: 100%|████████████| 5/5 [00:11<00:00,  2.32s/turn, ach=0]

Results summary

📊 Analysis Results

📈 Model Performance Summary:
Model                Avg Achievements   Max Achievements   Invalid Rate    Success Rate   
--------------------------------------------------------------------------------------
gpt-4.1-nano           1.20 ± 0.75                    2            0.00%          100.00%
gpt-5-nano             1.00 ± 0.63                    2            0.00%          100.00%

🏆 Achievement Frequencies:

Achievement                 gpt-4.1-nano   gpt-5-nano
-----------------------------------------------
collect_drink               2/5   ( 40%)   0/5   (  0%)
collect_sapling             1/5   ( 20%)   2/5   ( 40%)
collect_wood                3/5   ( 60%)   2/5   ( 40%)
place_plant                 0/5   (  0%)   1/5   ( 20%)

💰 Model Usage Statistics from Current Experiment:
Model                Provider   Usage Count  Avg Latency (ms)   Total Cost  
------------------------------------------------------------------------
gpt-5-nano           openai     221          13006.57           $0.0000     
gpt-4.1-nano         openai     161          950.12             $0.0000     

💾 Detailed results saved to: .../temp/crafter_experiment_results_20250808_170312.json

✅ Experiment complete!

Post‑run analysis (trace filtering)

List achievements present in your tracing DB, then filter traces to JSONL for SFT:
Using traces DB: .../synth_ai.db/dbs/default/data
Available achievements (session counts):
  - collect_drink: 44
  - collect_sapling: 62
  - collect_wood: 74
  - defeat_skeleton: 4
  - defeat_zombie: 2
  - eat_cow: 2
  - place_plant: 8
  - place_table: 3

Enter achievements to filter by (space-separated), or press Enter for 'collect_wood':
Optionally restrict to models (space-separated), or press Enter to include all:
Generate JSONL and basic stats for the selected achievement(s):
uv run python -m examples.evals.trace_analysis \
  filter --db "/path/to/.../synth_ai.db/dbs/default/data" \
  --achievements collect_wood --output ft_data/evals_filtered.jsonl
✅ Wrote 74 examples from 74 sessions → ft_data/evals_filtered.jsonl
uv run python -m examples.evals.trace_analysis \
  stats --db "/path/to/.../synth_ai.db/dbs/default/data" \
  --achievements collect_wood
Matched sessions (any of: collect_wood )
  n=74  avg_reward=0.76  stddev=1.00
  avg_first_unlock_step=4.7  stddev=4.6
Others
  n=224  avg_reward=0.21  stddev=0.51

Achievement frequency by session (matched vs others):
  - collect_drink: matched 25/74 (33.8%), others 19/224 (8.5%)
  - collect_sapling: matched 21/74 (28.4%), others 41/224 (18.3%)
  - place_table: matched 3/74 (4.1%), others 0/224 (0.0%)
  - eat_cow: matched 2/74 (2.7%), others 0/224 (0.0%)
  - place_plant: matched 3/74 (4.1%), others 5/224 (2.2%)
  - defeat_skeleton: matched 2/74 (2.7%), others 2/224 (0.9%)
  - defeat_zombie: matched 0/74 (0.0%), others 2/224 (0.9%)
That’s it — you now have filtered traces in ft_data/evals_filtered.jsonl ready for SFT, and a DB for deeper analysis.