The Task
Evaluate Crafter agent traces and produce scores:Why Train a Verifier?
Instead of using expensive frontier models (GPT-4, Claude) to verifier every agent trace:- Train once on human-labeled traces
- Run cheaply on GPT-4o-mini or Groq
- Get consistent evaluation across runs
- Use as reward for optimization loops
Dataset Format
A Crafter verifier dataset contains V3 traces with gold scores:Key Dataset Requirements
1. Task Inputs Must Have Traces
Each task input must contain atrace field with a V3 SessionTrace:
2. Gold Outputs Must Have Scores
Every gold output must include ascore field (float, 0-1):
3. Events Need Integer IDs
Each event in the trace must have an integerevent_id for linking rewards:
Training the Verifier
Using the Trained Verifier
Evaluate a Single Trace
Use as RL Reward Signal
The verifier output integrates directly with synth-ai tracing:Crafter Scoring Rubric
The example rubric evaluates Crafter agents on 5 dimensions:| Criterion | Weight | Description |
|---|---|---|
| Achievement Progression | 35% | Late-game achievements (iron tools, furnace) score higher |
| Resource Stockpile | 20% | Inventory quality (>20 wood = high score) |
| Survival State | 20% | Health, food, drink above 50% |
| Failure Analysis | 15% | How well agent mitigated death risk |
| Future Readiness | 10% | Preparation for next objectives |
Trace Compression
Crafter traces can be large (~30KB/step with images). For training, compress to essential data:Configuration Tips
For Better Correlation with Human Scores
For Lower Inference Cost
Related
- Verifier Graph Dataset Format - Full dataset spec
- Graph Inference - Verifier API reference
- V3 Traces - Trace format documentation
Ready to get started?
Get Started
Sign up and start optimizing your prompts today.
Schedule Demo
See Synth in action with a personalized walkthrough.