Skip to main content
Use this cookbook when the target is an eval harness, benchmark runner, or scoring workflow that needs reliability, clarity, or better failure evidence.

Goal

Start a directed run that inspects the harness, makes the smallest high-impact improvement, runs the relevant check, and returns a report with artifacts.

Python path

run = client.runs.start(
    "Inspect the eval harness, fix the highest-leverage reliability issue, run the relevant check, and leave evidence.",
    host_kind="daytona",
    work_mode="directed_effort",
    providers=[{"provider": "openrouter"}],
    runbook="lite",
)

MCP path

Ask your MCP client:
Start a Managed Research run to improve the eval harness. Use directed_effort, daytona, openrouter, and runbook lite. Require a final report with the command run, failures found, patch summary, and artifacts.

Expected evidence

  • changed files or a PR
  • command output or failure summary
  • artifact manifest
  • final report explaining what improved and what remains risky

Failure notes

If the run cannot launch, preflight usually points to repo access, missing credentials, provider availability, or budget state.