Skip to main content
Use this cookbook when the worker should improve a benchmark result or reproduce a benchmark failure.

Goal

Give the worker the benchmark command, budget, target metric, and review criteria. Require evidence for every claimed improvement.

Launch

run = client.runs.start(
    "Improve the benchmark result without changing evaluation configs. Run the benchmark command, compare against baseline, and report evidence.",
    host_kind="daytona",
    work_mode="directed_effort",
    providers=[{"provider": "openrouter"}],
    runbook="heavy",
    timebox_seconds=60 * 60,
)

Prompt details to include

  • benchmark command
  • baseline metric and target metric
  • allowed files and forbidden files
  • maximum runtime or spend
  • required report format

Expected evidence

  • baseline and candidate metrics
  • changed files
  • command output
  • artifact manifest
  • final report with reproducibility notes

Failure notes

Use runbook="heavy" when the benchmark requires longer multi-actor work. Use directed_effort when the target metric and command are known.