Skip to main content
GEPA (Generalized Evolutionary Prompt Adaptation) is a reflective evolutionary optimizer that automatically improves your prompts through LLM-guided mutations and multi-objective selection. This walkthrough covers everything from setup to retrieving your optimized prompts. Reference: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution.” arXiv:2507.19457

Why GEPA?

GEPA outperforms GRPO by 10% on average (up to 20%) while using up to 35x fewer rollouts. Best for:
  • Classification tasks (Banking77, intent classification)
  • Multi-hop QA (HotpotQA)
  • Instruction-following tasks
  • When you want diverse prompt variants (Pareto front)
Typical results: 60-75% baseline accuracy → 85-90%+ after 15 generations

Prerequisites

Before starting, ensure you have:
# Required environment variables in .env
GROQ_API_KEY=gsk_...          # For policy model inference
SYNTH_API_KEY=sk_...          # For backend authentication
ENVIRONMENT_API_KEY=sk_env_... # Optional - auto-generated if not set
Install the Synth AI SDK:
pip install synth-ai

How GEPA Works

GEPA uses evolutionary principles to explore the prompt space. Understanding the algorithm helps you configure it effectively.

The Optimization Flow

  1. Initialize
    • Split seeds into pareto_seeds and feedback_seeds
    • Evaluate baseline transformation
    • Generate initial population via proposer
    • Evaluate & add to Pareto archive
  2. Evolve (for each generation)
    • For each child:
      • Select parent (instance-wise Pareto sampling)
      • Generate feedback from parent trace
      • Mutate via proposer (LLM-guided)
      • Minibatch gating (quick eval)
      • Full Pareto evaluation (if gating passed)
      • Update archive if non-dominated
  3. Terminate
    • Budget exhausted OR
    • Generation limit OR
    • No improvement for N generations
  4. Return best transformation, by accuracy

Key Components

1. Pattern-Based Transformations

GEPA represents prompt changes as transformations that can be applied to your baseline:
# A transformation replaces text in your prompt
TextTransformation(
    old_text="You are a helpful assistant.",      # Original text
    new_text="You are a banking classification expert...",  # Optimized text
    apply_to_role="system"  # Only apply to system messages
)

2. Pareto Archive

GEPA maintains a Pareto front of non-dominated solutions, balancing multiple objectives:
  • Accuracy (primary) – Task performance
  • Tool call rate – Function calling frequency (for agentic tasks)
Solutions are kept if they’re not dominated by any other solution across all objectives.

3. Instance-Wise Parent Selection

Unlike traditional selection that uses aggregate scores, GEPA counts how many individual seeds each prompt “wins” on:
# Parent selection weights prompts by per-seed wins
wins = count_seeds_where_prompt_is_best(prompt)
selection_weight = (wins + ε) ** selection_pressure
This favors prompts that excel on specific example types, encouraging specialization.

4. LLM-Guided Mutations

The proposer (meta-model) generates new prompts by analyzing:
  • Current instruction (baseline)
  • Rollout examples (input/output/feedback for each seed)
  • Trace feedback (e.g., “model under-utilizes tools”)
  • Dataset and program context
The proposer uses instruction typology to structure outputs with: input descriptions, core task, premises, heuristics, constraints, rules, and output descriptions.

Step 1: Create a Container

Your Container evaluates prompts by running rollouts and returning scores. See Container Guide for details. Example Banking77 Container structure:
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class RolloutRequest(BaseModel):
    seed: int
    trace_correlation_id: str
    # ... other fields

@app.post("/rollout")
async def rollout(request: RolloutRequest):
    # 1. Load example for this seed
    example = load_example(request.seed)

    # 2. Call your LLM with the prompt (interceptor handles substitution)
    prediction = await call_llm(example.query)

    # 3. Score the prediction
    correct = prediction == example.expected_label

    return {
        "metrics": {"correct": correct},
        "outcome": 1.0 if correct else 0.0
    }

Step 2: Deploy Your Container

The Synth AI backend needs to reach your Container over the internet to send rollout requests. Use the Python SDK’s InProcessContainer for seamless deployment with automatic tunneling:
import os
from synth_ai.sdk import InProcessContainer

# Import your container
from my_container import app

async with InProcessContainer(app=app) as container:
    print(f"Container running at: {container.url}")
    # container.url is a SynthTunnel URL like https://st.usesynth.ai/s/rt_...
This:
  1. Starts your Container locally
  2. Creates a SynthTunnel URL (or another tunnel backend if configured)
  3. Returns the tunnel URL via container.url

Verify the Deployment

Check that your Container is accessible:
curl https://st.usesynth.ai/s/rt_.../health

Alternative Tunnel Backends

SynthTunnel is the default and recommended backend. If you need a Cloudflare tunnel instead:
# Cloudflare quick tunnel (ephemeral, requires cloudflared binary)
async with InProcessContainer(
    app=app,
    tunnel_backend="cloudflare_quick",
) as container:
    print(f"URL: {container.url}")

# Cloudflare managed lease (stable hostname)
async with InProcessContainer(
    app=app,
    tunnel_backend="cloudflare_managed_lease",
) as container:
    print(f"Stable URL: {container.url}")

Step 3: Create the Configuration

Create a TOML file defining your optimization parameters. The container_url should match the URL from Step 2 (stored in your .env as CONTAINER_URL):
[prompt_learning]
algorithm = "gepa"
container_url = "https://my-company.usesynth.ai"  # From CONTAINER_URL in .env
container_id = "banking77"

# Training seeds (used during optimization)
evaluation_seeds = [50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]

# Validation seeds (held-out for final evaluation)
validation_seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]

# Initial prompt template
[prompt_learning.initial_prompt]
messages = [
  { role = "system", content = "You are a banking intent classification assistant." },
  { role = "user", pattern = "Customer Query: {query}\n\nClassify this query into one of 77 banking intents." }
]

# GEPA-specific configuration
[prompt_learning.gepa]
num_generations = 15              # Evolutionary cycles to run
children_per_generation = 5       # Mutations per generation
pareto_set_size = 20              # Seeds for Pareto evaluation
minibatch_size = 3                # Seeds for quick gating
rollout_budget = 1000             # Total rollouts allowed
archive_size = 64                 # Max Pareto archive size

Configuration Parameters

ParameterDescriptionDefaultRecommended Range
num_generationsEvolutionary cycles105-20
children_per_generationMutations per generation53-10
pareto_set_sizeSeeds for Pareto evaluation2015-30
minibatch_sizeSeeds for gating evaluation32-5
rollout_budgetTotal rollouts allowed1000200-2000
archive_sizeMax Pareto archive size6432-128
feedback_fractionFraction of seeds for feedback0.30.2-0.5
proposer_modeProposer type (synth, gepa-ai, dspy)synth-

Step 4: Launch the Optimization Job

import os
from synth_ai.sdk import OfflineJob

async def run_optimization():
    client = SynthClient(api_key=os.environ["SYNTH_API_KEY"])

    # Create job from TOML config
    job = await client.create_job_from_toml("configs/prompt_learning/banking77_gepa.toml")
    print(f"Created job: {job['id']}")

    # Start the job
    await client.start_job(job["id"])

    # Poll until completion
    result = await client.poll_until_terminal(job["id"])
    print(f"Best score: {result['best_score']}")
    return result
The SDK will:
  1. Validate your TOML configuration
  2. Verify the container is reachable
  3. Submit the job to Synth AI
  4. Poll for completion

Understanding the Output

During optimization, you’ll see progress updates:
[18:35:37]    0.0s  Status: running
[18:35:42]    5.2s  Status: running | Best: 0.500
[18:35:48]   11.4s  Status: running | Best: 0.625
[18:35:54]   17.6s  Status: running | Best: 0.750
[18:36:00]   23.8s  Status: running | Best: 0.875
...
[18:38:50]  175.9s  Status: succeeded | Best: 0.875
Your Container logs will show rollout requests:
[CONTAINER] INBOUND_ROLLOUT: trace_correlation_id=prompt-learning-74-5bec8a6f seed=74
[CONTAINER] PREDICTION: expected=card_arrival predicted=card_delivery_estimate correct=False
[BANKING77_ROLLOUT] trace_correlation_id=prompt-learning-74-5bec8a6f reward=0.0

Step 5: Understanding the Optimization Process

Generation-by-Generation Progress

GenerationWhat HappensExpected Accuracy
0 (baseline)Evaluate initial prompt60-75%
1-3Explore diverse mutations70-80%
5-10Convergence begins80-85%
10-15Fine-tuning best solutions85-90%+

How Mutations Are Generated

The proposer receives:
  1. Baseline instruction: Your current system prompt
  2. Rollout examples: Input/output pairs with feedback (correct/incorrect, error messages)
  3. Trace statistics: Tool call rate, trajectory length, etc.
  4. Feedback hints: Rule-based suggestions like “model under-utilizes tools”
It generates a new instruction following instruction typology:
[Input Description]
You will be given a customer banking query.

[Core Task Description]
Your task is to classify the query into one of 77 banking intents.

[Premises]
Banking queries often contain domain-specific terminology.
Multiple intents may seem applicable; choose the most specific.

[Heuristics]
Look for keywords indicating the customer's primary need.
Consider the emotional tone to distinguish complaints from inquiries.

[Constraints]
Avoid defaulting to generic intents when specific ones apply.

[Rules]
Output only the intent name, nothing else.

[Output Description]
Return exactly one intent from the predefined list.

Minibatch Gating

Before full evaluation, GEPA performs a quick check:
  1. Evaluate child on a small minibatch (3 seeds)
  2. Compare to parent’s score on the same seeds
  3. If child is worse → skip full evaluation (saves budget)
  4. If child is promising → proceed to full Pareto evaluation
This saves significant compute by filtering out poor mutations early.

Step 6: Retrieve Optimized Prompts

After completion, fetch your results using the Python SDK:
import os
from synth_ai.sdk import OfflineJob

API_KEY = os.environ["SYNTH_API_KEY"]
JOB_ID = "pl_abc123"  # From the job submission output

# Get all results
results = get_prompts(job_id=JOB_ID, api_key=API_KEY)
print(f"Best Score: {results['best_score']:.3f}")

# Get top 5 prompts from Pareto front
for rank in range(1, 6):
    prompt = get_prompt_text(job_id=JOB_ID, api_key=API_KEY, rank=rank)
    print(f"Rank {rank}: {len(prompt)} chars")
    print(prompt[:200] + "...")

# Get scoring summary
summary = get_scoring_summary(job_id=JOB_ID, api_key=API_KEY)
print(f"Train={summary['best_train_accuracy']:.3f}")
print(f"Validation={summary.get('best_validation_accuracy', 0.0):.3f}")
print(f"Candidates Tried={summary['num_candidates_tried']}")

Understanding the Pareto Front

GEPA returns multiple prompts representing different trade-offs:
RankAccuracyToken CountTrade-off
192%450Highest accuracy
290%280Good accuracy, shorter
388%150Efficient, still performant
Choose based on your latency/cost requirements.

Step 7: Use the Optimized Prompt

Replace your baseline prompt with the optimized version:
# Before: baseline prompt
system_prompt = "You are a banking intent classification assistant."

# After: optimized prompt (rank 1 from GEPA)
system_prompt = get_prompt_text(job_id=JOB_ID, rank=1)

# Use in your application
response = await openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Customer Query: {query}"}
    ]
)

In-Process Optimization

For development and testing, run everything from a single Python script:
import os
from synth_ai.sdk import InProcessContainer
from synth_ai.sdk import OfflineJob

# Start Container in-process (handles tunneling automatically)
async with InProcessContainer(app=my_container) as container:
    # Container is now accessible via tunnel
    container_url = container.url

    # Submit optimization job
    client = SynthClient(api_key=os.environ["SYNTH_API_KEY"])
    job = await client.create_job(config=my_config)
    await client.start_job(job["id"])

    # Poll until complete
    result = await client.poll_until_terminal(job["id"])
    print(f"Best score: {result['best_score']}")
See In-Process Container Walkthrough for a complete example.

Termination Conditions

GEPA stops when any condition is met:
ConditionDescriptionConfiguration
rollout_budgetTotal rollouts exhaustedrollout_budget = 1000
max_spend_usdUSD budget limitmax_spend_usd = 5.0
num_generationsGeneration limit reachednum_generations = 15
patience_generationsNo improvement for N generationspatience_generations = 5

Supported Models

See Supported Models for Prompt Optimization for the full list of policy models.