Benchmark Improvement

Use this cookbook when the worker should improve a benchmark result or reproduce a benchmark failure.

Goal

Give the worker the benchmark command, budget, target metric, and review criteria. Require evidence for every claimed improvement.

Launch

run = client.research.runs.start(
    "Improve the benchmark result without changing evaluation configs. Run the benchmark command, compare against baseline, and report evidence.",
    host_kind="daytona",
    work_mode="directed_effort",
    providers=[{"provider": "openrouter"}],
    runbook="heavy",
    timebox_seconds=60 * 60,
)

Prompt details to include

benchmark command
baseline metric and target metric
allowed files and forbidden files
maximum runtime or spend
required report format

Expected evidence

baseline and candidate metrics
changed files
command output
artifact manifest
final report with reproducibility notes

Failure notes

Use runbook="heavy" when the benchmark requires longer multi-actor work. Use directed_effort when the target metric and command are known.

Eval Harness Improvement Repo Review and PR

⌘I

​Goal

​Launch

​Prompt details to include

​Expected evidence

​Failure notes

Goal

Launch

Prompt details to include

Expected evidence

Failure notes