Coding Agent Verifier + RLM Walkthrough

This walkthrough shows an end-to-end flow for verifier optimization on Rust coding traces, followed by recursive language model (RLM) prompt optimization. It uses the EngineBench trace set captured from coding agents, then reuses the same backend setup for RLM training.

What You’ll Build

A verifier graph trained on Rust coding traces with deterministic test rewards.
An RLM prompt optimized on long-context QA tasks.

Prerequisites

SYNTH_API_KEY set in your environment
Local backend running via the standard script
EngineBench traces available under data/engine_bench/

cd /Users/joshpurtell/Documents/GitHub/monorepo
./scripts/run_backend_local.sh

Step 1: Pick Your Rust Trace Run

EngineBench eval runs live in the synth-ai repo:

/Users/joshpurtell/Documents/GitHub/synth-ai/data/engine_bench/opencode/20260115_124051_eval_e3de6913f3c84e02

Each run includes:

eval_results.json with per-seed rewards
traces/seed_XX.json with Rust coding traces

Step 2: Optimize a Verifier Graph

Run the verifier optimization demo against your chosen trace run:

cd /Users/joshpurtell/Documents/GitHub/synth-ai
uv run python demos/engine_bench_verifier_opt/run_demo.py \
  --local \
  --agent opencode \
  --eval-dir /Users/joshpurtell/Documents/GitHub/synth-ai/data/engine_bench/opencode/20260115_124051_eval_e3de6913f3c84e02

Example progress (trimmed):

INFO:     127.0.0.1:58508 - "POST /rollout HTTP/1.1" 200 OK
  [5m49s] running | Trials: 0/9 (0%) | Best: 0.950
  ✓ Rollout seed_29: reward=0.00 (pred=0.00, gold=1.0) | 10.1s | 9k tok | $0.0257
INFO:     127.0.0.1:58517 - "POST /rollout HTTP/1.1" 200 OK
  ✓ Rollout seed_27: reward=0.00 (pred=0.00, gold=1.0) | 11.7s | 13k tok | $0.0381
INFO:     127.0.0.1:58514 - "POST /rollout HTTP/1.1" 200 OK
  ✓ Rollout seed_35: reward=1.00 (pred=1.00, gold=1.0) | 11.9s | 32k tok | $0.0838

Useful Flags

--max-traces 20        # limit training traces
--max-items 12         # max list items preserved per trace field
--max-chars 2400       # max characters preserved per string field
--generations 2        # optimization generations
--children 4           # population size
--rollout-budget 60    # total rollouts
--policy-model gpt-4o-mini
--verifier-model gpt-4o-mini

Example Results: RLM Verifier Optimization

Here are real results from optimizing the zero_shot_verifier_rubric_rlm graph on coding agent traces: Job: graph_evolve_dc9c9a273fa0
Dataset: opencode_rlm_verifier (Rust coding traces from EngineBench)
Graph: zero_shot_verifier_rubric_rlm (RLM v1)

Performance Improvement

Metric	Baseline	Optimized	Change
Validation Reward	0.25	0.50	+100%
Training Reward	—	0.375	—

What Changed: System Prompt Comparison

Baseline prompt (56 lines) included subjective guidance:

You are an expert code reviewer evaluating execution traces.
Focus on:
1. Whether the agent used tools effectively
2. Whether code changes are correct and follow conventions
3. Whether the task was completed successfully
Be calibrated: reward 0.0 for failures, 0.5 for partial success, 1.0 for full success.

You are evaluating an execution trace against rubric criteria...

Optimized prompt (50 lines) removed the preamble and went straight to functional instructions:

You are evaluating an execution trace against rubric criteria.
Query: <input>query</input>
Your output MUST include a final reward in outcome_review.total (0.0 to 1.0).

CRITICAL: You do NOT have direct access to the trace content!...

Why This Worked

The optimization removed vague guidance (“expert code reviewer”, “be calibrated”) that caused the LLM to hallucinate subjective assessments instead of following the rubric criteria strictly. The streamlined prompt improved:

Rubric adherence — LLM focused on actual criteria instead of inventing its own
Tool usage — Clearer instructions led to consistent materialize_context → local_grep → submit_answer workflow
Reward calibration — Removed conflicting “0.5 for partial success” guidance that clashed with binary rubric criteria

Step 3: Optimize an RLM Prompt

Once the verifier job is running, you can reuse the same backend session to optimize an RLM prompt:

cd /Users/joshpurtell/Documents/GitHub/synth-ai
uv run python demos/rlm-mit/run_demo.py --local

This demo uses the Oolong dataset and reports outcome rewards for long-context reasoning tasks.

Next Steps

Swap --agent codex if you want Codex traces.
Increase --max-traces once you are confident in the workflow.
Review the generated verifier graph with job.download_graph_txt() after completion.

Walkthroughs

​What You’ll Build

​Prerequisites

​Step 1: Pick Your Rust Trace Run

​Step 2: Optimize a Verifier Graph

​Useful Flags

​Example Results: RLM Verifier Optimization

​Performance Improvement

​What Changed: System Prompt Comparison

​Why This Worked

​Step 3: Optimize an RLM Prompt

​Next Steps