Coding Agent Optimization

GEPA can optimize the full instruction surface of coding agents: system prompts, AGENTS.md files, skills definitions, and other context artifacts. This cookbook shows how to set up agent optimization using the EngineBench benchmark with Daytona cloud sandboxes.

Component	What Gets Optimized
System Prompt	Core instructions passed to the LLM
AGENTS.md	Startup instructions for Codex/OpenCode
Skills Files	`.codex/skills.yaml`, `.opencode/skills.yaml`
Context Artifacts	Architecture guides, reference snippets

The Challenge

Coding agents like Codex CLI and OpenCode read instructions from multiple sources:

Source	Agent	Purpose
`AGENTS.md`	Both	Project-specific instructions
`.codex/skills.yaml`	Codex	Reusable skill definitions
`.opencode/skills.yaml`	OpenCode	Reusable skill definitions
System prompt	Both	Task-specific context

Hand-tuning these files is tedious. GEPA automates it by treating the entire instruction surface as optimizable.

Architecture

Components:

GEPA — Proposes mutations to instruction artifacts
Container — Orchestrates rollouts, provisions sandboxes
Daytona Sandbox — Isolated VM with agent pre-installed (~3s provisioning)
Coding Agent — Executes task using optimized instructions
Interceptor — Captures LLM traces for analysis

Configuration

Define what’s mutable in your GEPA config:

enginebench_gepa.toml

[prompt_learning]
container_url = "http://localhost:8020"
algorithm = "gepa"

[prompt_learning.policy]
model = "gpt-4.1-mini"
provider = "openai"
inference_mode = "synth_hosted"

# Baseline system prompt (GEPA will evolve this)
[prompt_learning.policy.context_override]
system_prompt = """You are an expert Rust developer implementing Pokemon TCG cards.

Your task: Implement card effects by editing Rust files with stub functions.

Key patterns:
- Use `def_id_matches(&card.def_id, "DF", NUMBER)` to identify cards
- Implement attack modifiers in `attack_override`
- Use `game.queue_prompt()` for user choices

Output requirements:
1. ACTUALLY EDIT files - replace TODO stubs with working code
2. Make sure code compiles (`cargo check`)
3. Make sure tests pass (`cargo test`)"""

[prompt_learning.gepa]
env_name = "engine_bench"

[prompt_learning.gepa.evaluation]
seeds = [0, 2, 7, 8, 9, 16, 22, 29, 35, 42, 51, 68, 77, 88, 99]
validation_seeds = [0, 9, 22, 42, 77]

[prompt_learning.gepa.population]
initial_size = 8
num_generations = 5
children_per_generation = 8

[prompt_learning.gepa.rollout]
budget = 200
max_concurrent = 6

Unified Optimization (AGENTS.md + Skills)

For full instruction surface optimization, enable unified mode:

enginebench_gepa_unified.toml

[prompt_learning.gepa.unified_optimization]
enable_container_context_overrides = true
optimization_target = "unified"
mutable_files = ["AGENTS.md", ".codex/skills.yaml"]
allow_preflight_script = false
allow_env_vars = false

# Baseline AGENTS.md (GEPA will evolve this)
[prompt_learning.policy.context_override]
file_artifacts = { "AGENTS.md" = """# Agent Instructions

## Primary Objective
Implement Pokemon TCG card effects in Rust.

## Rules
1. Read existing code patterns first
2. Follow the engine architecture
3. Run tests after implementation
""" }

Example: EngineBench Instruction Updates

Here is an example of the optimized instruction bundle we validated in a recent EngineBench run. It includes:

agents_md (Codex agent instructions)
codex_skills (skill triggers and steps)
system_prompt (task-specific prompt)

agents_md:
  # Codex Agent Instructions
  ## Objective
  Implement Pokemon TCG card effects in Rust by replacing `todo!()` stubs with working code.

  ## Key Patterns
  ### Card Identification
  Use `def_id_matches(&card.def_id, "DF", NUMBER)` to match specific cards:
  Example:
  if def_id_matches(&card.def_id, "DF", 23) {
      // This is Dragon Frontiers card #23
  }

  ### Attack Implementation
  Attacks go in the `attack_override` function. Return damage values directly:
  Example:
  pub fn some_attack_damage() -> i32 { 40 }

  ### Poke-Powers/Bodies
  Implement in `power_effect` function for activated abilities.

  ## Workflow
  1. Read the stub file to identify `todo!()` locations
  2. Edit the file to replace each `todo!()` with implementation
  3. Run `cargo check` to verify it compiles
  4. Run `cargo test` to verify correctness

  ## Common Mistakes to Avoid
  - Do NOT read the file multiple times
  - Do NOT forget to handle coin flips with the `coin: bool` parameter
  - Do NOT return wrong damage types (use i32)

codex_skills:
  skills:
    - name: implement_attack
      description: Implement a Pokemon attack function
      triggers: ["attack", "damage", "todo!()"]
      steps:
        - Read the attack text comment to understand the effect
        - Calculate the damage formula
        - Replace todo!() with the implementation
    - name: implement_power
      description: Implement a Poke-Power or Poke-Body
      triggers: ["power", "body", "ability"]
      steps:
        - Identify the power type (Power vs Body)
        - Implement the effect logic
        - Handle any conditions or triggers
    - name: verify_implementation
      description: Verify the implementation compiles and passes tests
      triggers: ["verify", "check", "test"]
      steps:
        - Run cargo check
        - Run cargo test
        - Fix any errors

system_prompt:
  You are an expert Rust developer implementing Pokemon TCG cards.

  CRITICAL: The stub file contains `todo!()` macros that YOU MUST REPLACE with working code.

  Example - if you see:
  pub fn grind_damage(attached_energy: u32) -> i32 { todo!() }
  // ATTACK_1_TEXT: "Does 10 damage times the amount of Energy attached"

  You MUST use the edit tool to change it to:
  pub fn grind_damage(attached_energy: u32) -> i32 { (10 * attached_energy) as i32 }

  REQUIRED WORKFLOW:
  1. Read the stub file ONCE to find the todo!() functions
  2. IMMEDIATELY use the edit tool to replace todo!() with working code
  3. Run `cargo check` to verify compilation
  4. Run `cargo test` to verify tests pass

  DO NOT read the file multiple times

What Gets Optimized

GEPA evolves the instruction content while preserving structure:

# Baseline - hand-written instructions
baseline_system_prompt = """You are a Rust developer.
Implement the card effects."""

# After optimization - learned from successful rollouts
optimized_system_prompt = """You are an expert Rust developer implementing Pokemon TCG cards.

CRITICAL: Before writing any code:
1. Read the existing card implementations in src/effects/
2. Note the pattern: match on def_id, return AttackOverrides

Implementation checklist:
- [ ] Identify card by def_id_matches(&card.def_id, "DF", NUMBER)
- [ ] Handle Poke-Powers in power_effect()
- [ ] Handle attacks in attack_override()
- [ ] Use game.queue_prompt() for user choices
- [ ] Return AttackOverrides::default() for non-matching cards

After implementation:
- Run `cargo check` to verify compilation
- Run `cargo test` to verify behavior
- Fix any failing tests before finishing"""

Running with Daytona Sandboxes

EngineBench uses Daytona for isolated agent execution. Each rollout provisions a fresh sandbox from a pre-built snapshot (~3 seconds).

Prerequisites

# Environment variables
export OPENAI_API_KEY="sk-..."
export SYNTH_API_KEY="sk_live_..."

# Install dependencies
git clone https://github.com/synth-laboratories/cookbooks
cd cookbooks
uv sync

Production Mode

Use the production backend with a managed tunnel. Demos have moved to cookbooks. For GEPA with coding agents, see code/training/prompt_learning/gepa:

git clone https://github.com/synth-laboratories/cookbooks
cd cookbooks
uv sync
INTERCEPTOR_TUNNEL_URL=https://api.usesynth.ai \
uv run python code/training/prompt_learning/gepa/run_gepa_inprocess.py

Results

After optimization, expect improvements like:

Metric	Baseline	Optimized
Test pass rate	45%	78%
Compilation success	72%	94%
Average rollout time	95s	85s

The optimized instructions are saved and can be exported for use in production.

Example Run Output (EngineBench)

From the same run, the validation phase reported:

95 optimization rollouts completed
Baseline validation reward: 0.971 (10 seeds)
Top-K validation (3 candidates) using the interceptor URL

If you see a warning like snapshot_id is None, it means snapshot creation failed and prompts cannot be retrieved for that job. Check the earlier logs for snapshot creation errors.

Supported Agents

Agent	CLI	Instructions Source
Codex	`codex exec`	AGENTS.md, `.codex/skills.yaml`
OpenCode	`opencode`	AGENTS.md, `.opencode/skills.yaml`
Claude Code	`claude`	AGENTS.md

Demo Files

View on GitHub

File	Purpose
`run_gepa_unified.py`	Main entry point for GEPA optimization
`container_engine_bench.py`	Container with Daytona sandbox support
`enginebench_gepa.toml`	Full GEPA config (15 instances, 5 generations)
`enginebench_gepa_minimal.toml`	Minimal config for quick testing
`daytona_helper.py`	Sandbox provisioning and agent execution

Quick Start

# Clone and install
git clone https://github.com/synth-laboratories/synth-ai
cd synth-ai
uv sync

# Set credentials
export OPENAI_API_KEY="sk-..."
export DAYTONA_API_KEY="dtn_..."
export SYNTH_API_KEY="sk_live_..."

# Run minimal optimization
USE_DAYTONA_SANDBOXES=1 \
uv run python demos/engine_bench/run_gepa_minimal.py

When to Use This Pattern

Good fit:

Coding agents with instruction files (AGENTS.md, skills)
Tasks with deterministic evaluation (tests, linters)
Long-running rollouts that benefit from sandboxes
Multiple instruction sources to co-optimize

Consider alternatives:

Simple prompts → Use standard GEPA
No sandbox needed → Use embedded containers
Human evaluation → Use verifier optimization

Ready to get started?

Get Started

Schedule Demo

See Synth in action with a personalized walkthrough.

Walkthroughs

​The Challenge

​Architecture

​Configuration

​Unified Optimization (AGENTS.md + Skills)

​Example: EngineBench Instruction Updates

​What Gets Optimized

​Running with Daytona Sandboxes

​Prerequisites

​Production Mode

​Results

​Example Run Output (EngineBench)

​Supported Agents

​Demo Files

​Quick Start

​When to Use This Pattern

​Ready to get started?