| Component | What Gets Optimized |
|---|---|
| System Prompt | Core instructions passed to the LLM |
| AGENTS.md | Startup instructions for Codex/OpenCode |
| Skills Files | .codex/skills.yaml, .opencode/skills.yaml |
| Context Artifacts | Architecture guides, reference snippets |
The Challenge
Coding agents like Codex CLI and OpenCode read instructions from multiple sources:| Source | Agent | Purpose |
|---|---|---|
AGENTS.md | Both | Project-specific instructions |
.codex/skills.yaml | Codex | Reusable skill definitions |
.opencode/skills.yaml | OpenCode | Reusable skill definitions |
| System prompt | Both | Task-specific context |
Architecture
Components:- GEPA — Proposes mutations to instruction artifacts
- Task App — Orchestrates rollouts, provisions sandboxes
- Daytona Sandbox — Isolated VM with agent pre-installed (~3s provisioning)
- Coding Agent — Executes task using optimized instructions
- Interceptor — Captures LLM traces for analysis
Configuration
Define what’s mutable in your GEPA config:enginebench_gepa.toml
Unified Optimization (AGENTS.md + Skills)
For full instruction surface optimization, enable unified mode:enginebench_gepa_unified.toml
What Gets Optimized
GEPA evolves the instruction content while preserving structure:Running with Daytona Sandboxes
EngineBench uses Daytona for isolated agent execution. Each rollout provisions a fresh sandbox from a pre-built snapshot (~3 seconds).Prerequisites
Local Mode (Development)
Start the local backend and run optimization:Production Mode
Use the production backend with a managed tunnel:Results
After optimization, expect improvements like:| Metric | Baseline | Optimized |
|---|---|---|
| Test pass rate | 45% | 78% |
| Compilation success | 72% | 94% |
| Average rollout time | 95s | 85s |
Supported Agents
Demo Files
View on GitHub| File | Purpose |
|---|---|
run_gepa_unified.py | Main entry point for GEPA optimization |
localapi_engine_bench.py | Task app with Daytona sandbox support |
enginebench_gepa.toml | Full GEPA config (15 instances, 5 generations) |
enginebench_gepa_minimal.toml | Minimal config for quick testing |
daytona_helper.py | Sandbox provisioning and agent execution |
Quick Start
When to Use This Pattern
Good fit:- Coding agents with instruction files (AGENTS.md, skills)
- Tasks with deterministic evaluation (tests, linters)
- Long-running rollouts that benefit from sandboxes
- Multiple instruction sources to co-optimize
- Simple prompts → Use standard GEPA
- No sandbox needed → Use embedded task apps
- Human evaluation → Use verifier optimization