What You’ll Build
- A verifier graph trained on Rust coding traces with deterministic test rewards.
- An RLM prompt optimized on long-context QA tasks.
Prerequisites
SYNTH_API_KEYset in your environment- Local backend running via the standard script
- EngineBench traces available under
data/engine_bench/
Step 1: Pick Your Rust Trace Run
EngineBench eval runs live in the synth-ai repo:eval_results.jsonwith per-seed rewardstraces/seed_XX.jsonwith Rust coding traces
Step 2: Optimize a Verifier Graph
Run the verifier optimization demo against your chosen trace run:Useful Flags
Example Results: RLM Verifier Optimization
Here are real results from optimizing thezero_shot_verifier_rubric_rlm graph on coding agent traces:
Job: graph_evolve_dc9c9a273fa0Dataset:
opencode_rlm_verifier (Rust coding traces from EngineBench)Graph:
zero_shot_verifier_rubric_rlm (RLM v1)
Performance Improvement
| Metric | Baseline | Optimized | Change |
|---|---|---|---|
| Validation Reward | 0.25 | 0.50 | +100% |
| Training Reward | — | 0.375 | — |
What Changed: System Prompt Comparison
Baseline prompt (56 lines) included subjective guidance:Why This Worked
The optimization removed vague guidance (“expert code reviewer”, “be calibrated”) that caused the LLM to hallucinate subjective assessments instead of following the rubric criteria strictly. The streamlined prompt improved:- Rubric adherence — LLM focused on actual criteria instead of inventing its own
- Tool usage — Clearer instructions led to consistent
materialize_context→local_grep→submit_answerworkflow - Reward calibration — Removed conflicting “0.5 for partial success” guidance that clashed with binary rubric criteria
Step 3: Optimize an RLM Prompt
Once the verifier job is running, you can reuse the same backend session to optimize an RLM prompt:Next Steps
- Swap
--agent codexif you want Codex traces. - Increase
--max-tracesonce you are confident in the workflow. - Review the generated verifier graph with
job.download_graph_txt()after completion.