Eval Harness Improvement

Use this cookbook when the target is an eval harness, benchmark runner, or scoring workflow that needs reliability, clarity, or better failure evidence.

Goal

Start a directed run that inspects the harness, makes the smallest high-impact improvement, runs the relevant check, and returns a report with artifacts.

Python path

run = client.research.runs.start(
    "Inspect the eval harness, fix the highest-leverage reliability issue, run the relevant check, and leave evidence.",
    host_kind="daytona",
    work_mode="directed_effort",
    providers=[{"provider": "openrouter"}],
    runbook="lite",
)

MCP path

Ask your MCP client:

Start a Managed Research run to improve the eval harness. Use directed_effort, daytona, openrouter, and runbook lite. Require a final report with the command run, failures found, patch summary, and artifacts.

Expected evidence

changed files or a PR
command output or failure summary
artifact manifest
final report explaining what improved and what remains risky

Failure notes

If the run cannot launch, preflight usually points to repo access, missing credentials, provider availability, or budget state.

Outcome Rewards Benchmark Improvement

⌘I

Getting Started

Managed Research

Stack

Research Factory

Prompt Optimization

Infrastructure

SDK Reference

Research API

Reference

Cookbooks

Eval Harness Improvement

Goal

Python path

MCP path

Expected evidence

Failure notes

​Goal

​Python path

​MCP path

​Expected evidence

​Failure notes

Goal

Python path

MCP path

Expected evidence

Failure notes