| Component | What Gets Optimized |
|---|---|
| System Prompt | Core instructions passed to the LLM |
| AGENTS.md | Startup instructions for Codex/OpenCode |
| Skills Files | .codex/skills.yaml, .opencode/skills.yaml |
| Context Artifacts | Architecture guides, reference snippets |
The Challenge
Coding agents like Codex CLI and OpenCode read instructions from multiple sources:| Source | Agent | Purpose |
|---|---|---|
AGENTS.md | Both | Project-specific instructions |
.codex/skills.yaml | Codex | Reusable skill definitions |
.opencode/skills.yaml | OpenCode | Reusable skill definitions |
| System prompt | Both | Task-specific context |
Architecture
Components:- GEPA — Proposes mutations to instruction artifacts
- Container — Orchestrates rollouts, provisions sandboxes
- Daytona Sandbox — Isolated VM with agent pre-installed (~3s provisioning)
- Coding Agent — Executes task using optimized instructions
- Interceptor — Captures LLM traces for analysis
Configuration
Define what’s mutable in your GEPA config:enginebench_gepa.toml
Unified Optimization (AGENTS.md + Skills)
For full instruction surface optimization, enable unified mode:enginebench_gepa_unified.toml
Example: EngineBench Instruction Updates
Here is an example of the optimized instruction bundle we validated in a recent EngineBench run. It includes:agents_md(Codex agent instructions)codex_skills(skill triggers and steps)system_prompt(task-specific prompt)
What Gets Optimized
GEPA evolves the instruction content while preserving structure:Running with Daytona Sandboxes
EngineBench uses Daytona for isolated agent execution. Each rollout provisions a fresh sandbox from a pre-built snapshot (~3 seconds).Prerequisites
Production Mode
Use the production backend with a managed tunnel. Demos have moved to cookbooks. For GEPA with coding agents, see code/training/prompt_learning/gepa:Results
After optimization, expect improvements like:| Metric | Baseline | Optimized |
|---|---|---|
| Test pass rate | 45% | 78% |
| Compilation success | 72% | 94% |
| Average rollout time | 95s | 85s |
Example Run Output (EngineBench)
From the same run, the validation phase reported:- 95 optimization rollouts completed
- Baseline validation reward: 0.971 (10 seeds)
- Top-K validation (3 candidates) using the interceptor URL
snapshot_id is None, it means snapshot creation failed and prompts cannot be retrieved for that job. Check the earlier logs for snapshot creation errors.
Supported Agents
| Agent | CLI | Instructions Source |
|---|---|---|
| Codex | codex exec | AGENTS.md, .codex/skills.yaml |
| OpenCode | opencode | AGENTS.md, .opencode/skills.yaml |
| Claude Code | claude | AGENTS.md |
Demo Files
View on GitHub| File | Purpose |
|---|---|
run_gepa_unified.py | Main entry point for GEPA optimization |
container_engine_bench.py | Container with Daytona sandbox support |
enginebench_gepa.toml | Full GEPA config (15 instances, 5 generations) |
enginebench_gepa_minimal.toml | Minimal config for quick testing |
daytona_helper.py | Sandbox provisioning and agent execution |
Quick Start
When to Use This Pattern
Good fit:- Coding agents with instruction files (AGENTS.md, skills)
- Tasks with deterministic evaluation (tests, linters)
- Long-running rollouts that benefit from sandboxes
- Multiple instruction sources to co-optimize
- Simple prompts → Use standard GEPA
- No sandbox needed → Use embedded containers
- Human evaluation → Use verifier optimization