What it runs
- Prompt and policy optimization — baseline → GEPA/MIPRO → holdout against your labeled dataset, with before/after scores and the winning candidate as a PR
- Evaluation loops — nightly runs against a versioned harness and dataset; structured scoring on every run
- Dataset and eval assembly — agents build dataset splits and verifiers from your traces and repo
How it works
You trigger a run. A Codex orchestrator agent claims it, provisions your repo into a workspace, and usesdispatch_worker to spin up Codex worker agents in isolated Daytona sandboxes. Workers read their task instructions and the project config, run the optimization (GEPA or MIPRO via the pre-deployed run_gepa.py), and emit artifacts as they go. When it’s done you get a morning_summary, experiment_results, and a proof_bundle_manifest linking everything.