Skip to main content
GEPA can optimize the full instruction surface of coding agents: system prompts, AGENTS.md files, skills definitions, and other context artifacts. This cookbook shows how to set up agent optimization using the EngineBench benchmark with Daytona cloud sandboxes.
ComponentWhat Gets Optimized
System PromptCore instructions passed to the LLM
AGENTS.mdStartup instructions for Codex/OpenCode
Skills Files.codex/skills.yaml, .opencode/skills.yaml
Context ArtifactsArchitecture guides, reference snippets

The Challenge

Coding agents like Codex CLI and OpenCode read instructions from multiple sources:
SourceAgentPurpose
AGENTS.mdBothProject-specific instructions
.codex/skills.yamlCodexReusable skill definitions
.opencode/skills.yamlOpenCodeReusable skill definitions
System promptBothTask-specific context
Hand-tuning these files is tedious. GEPA automates it by treating the entire instruction surface as optimizable.

Architecture

Components:
  1. GEPA — Proposes mutations to instruction artifacts
  2. Task App — Orchestrates rollouts, provisions sandboxes
  3. Daytona Sandbox — Isolated VM with agent pre-installed (~3s provisioning)
  4. Coding Agent — Executes task using optimized instructions
  5. Interceptor — Captures LLM traces for analysis

Configuration

Define what’s mutable in your GEPA config:
enginebench_gepa.toml
[prompt_learning]
task_app_url = "http://localhost:8020"
algorithm = "gepa"

[prompt_learning.policy]
model = "gpt-4.1-mini"
provider = "openai"
inference_mode = "synth_hosted"

# Baseline system prompt (GEPA will evolve this)
[prompt_learning.policy.context_override]
system_prompt = """You are an expert Rust developer implementing Pokemon TCG cards.

Your task: Implement card effects by editing Rust files with stub functions.

Key patterns:
- Use `def_id_matches(&card.def_id, "DF", NUMBER)` to identify cards
- Implement attack modifiers in `attack_override`
- Use `game.queue_prompt()` for user choices

Output requirements:
1. ACTUALLY EDIT files - replace TODO stubs with working code
2. Make sure code compiles (`cargo check`)
3. Make sure tests pass (`cargo test`)"""

[prompt_learning.gepa]
env_name = "engine_bench"

[prompt_learning.gepa.evaluation]
seeds = [0, 2, 7, 8, 9, 16, 22, 29, 35, 42, 51, 68, 77, 88, 99]
validation_seeds = [0, 9, 22, 42, 77]

[prompt_learning.gepa.population]
initial_size = 8
num_generations = 5
children_per_generation = 8

[prompt_learning.gepa.rollout]
budget = 200
max_concurrent = 6

Unified Optimization (AGENTS.md + Skills)

For full instruction surface optimization, enable unified mode:
enginebench_gepa_unified.toml
[prompt_learning.gepa.unified_optimization]
enable_task_app_context_overrides = true
optimization_target = "unified"
mutable_files = ["AGENTS.md", ".codex/skills.yaml"]
allow_preflight_script = false
allow_env_vars = false

# Baseline AGENTS.md (GEPA will evolve this)
[prompt_learning.policy.context_override]
file_artifacts = { "AGENTS.md" = """# Agent Instructions

## Primary Objective
Implement Pokemon TCG card effects in Rust.

## Rules
1. Read existing code patterns first
2. Follow the engine architecture
3. Run tests after implementation
""" }

What Gets Optimized

GEPA evolves the instruction content while preserving structure:
# Baseline - hand-written instructions
baseline_system_prompt = """You are a Rust developer.
Implement the card effects."""

# After optimization - learned from successful rollouts
optimized_system_prompt = """You are an expert Rust developer implementing Pokemon TCG cards.

CRITICAL: Before writing any code:
1. Read the existing card implementations in src/effects/
2. Note the pattern: match on def_id, return AttackOverrides

Implementation checklist:
- [ ] Identify card by def_id_matches(&card.def_id, "DF", NUMBER)
- [ ] Handle Poke-Powers in power_effect()
- [ ] Handle attacks in attack_override()
- [ ] Use game.queue_prompt() for user choices
- [ ] Return AttackOverrides::default() for non-matching cards

After implementation:
- Run `cargo check` to verify compilation
- Run `cargo test` to verify behavior
- Fix any failing tests before finishing"""

Running with Daytona Sandboxes

EngineBench uses Daytona for isolated agent execution. Each rollout provisions a fresh sandbox from a pre-built snapshot (~3 seconds).

Prerequisites

# Environment variables
export OPENAI_API_KEY="sk-..."
export DAYTONA_API_KEY="dtn_..."
export SYNTH_API_KEY="sk_live_..."

# Install dependencies
cd synth-ai
uv sync

Local Mode (Development)

Start the local backend and run optimization:
# Terminal 1: Start local backend
cd monorepo
./scripts/run_backend_local.sh

# Terminal 2: Start interceptor tunnel (so Daytona can reach it)
cloudflared tunnel --url http://localhost:8000
# Note the URL: https://xxx-xxx-xxx.trycloudflare.com

# Terminal 3: Run GEPA
cd synth-ai
INTERCEPTOR_TUNNEL_URL=https://xxx-xxx-xxx.trycloudflare.com \
USE_DAYTONA_SANDBOXES=1 \
uv run python demos/engine_bench/run_gepa_unified.py --local

Production Mode

Use the production backend with a managed tunnel:
cd synth-ai
INTERCEPTOR_TUNNEL_URL=https://api.usesynth.ai \
USE_DAYTONA_SANDBOXES=1 \
uv run python demos/engine_bench/run_gepa_unified.py \
  --port 8001 \
  --task-app-url https://YOUR-SUBDOMAIN.usesynth.ai

Results

After optimization, expect improvements like:
MetricBaselineOptimized
Test pass rate45%78%
Compilation success72%94%
Average rollout time95s85s
The optimized instructions are saved and can be exported for use in production.

Supported Agents

AgentCLIInstructions Source
Codexcodex execAGENTS.md, .codex/skills.yaml
OpenCodeopencodeAGENTS.md, .opencode/skills.yaml

Demo Files

View on GitHub
FilePurpose
run_gepa_unified.pyMain entry point for GEPA optimization
localapi_engine_bench.pyTask app with Daytona sandbox support
enginebench_gepa.tomlFull GEPA config (15 instances, 5 generations)
enginebench_gepa_minimal.tomlMinimal config for quick testing
daytona_helper.pySandbox provisioning and agent execution

Quick Start

# Clone and install
git clone https://github.com/synth-laboratories/synth-ai
cd synth-ai
uv sync

# Set credentials
export OPENAI_API_KEY="sk-..."
export DAYTONA_API_KEY="dtn_..."
export SYNTH_API_KEY="sk_live_..."

# Run minimal optimization (local backend)
USE_DAYTONA_SANDBOXES=1 \
uv run python demos/engine_bench/run_gepa_minimal.py --local

When to Use This Pattern

Good fit:
  • Coding agents with instruction files (AGENTS.md, skills)
  • Tasks with deterministic evaluation (tests, linters)
  • Long-running rollouts that benefit from sandboxes
  • Multiple instruction sources to co-optimize
Consider alternatives:
  • Simple prompts → Use standard GEPA
  • No sandbox needed → Use embedded task apps
  • Human evaluation → Use verifier optimization