GEPA Hello World

Dataset

Banking77 is a dataset of 77 banking-related customer intents across 13,083 customer queries, like:

Query	Intent
”I am still waiting on my card?”	`card_arrival`
”When will my card arrive?”	`card_delivery_estimate`
”I think my card was stolen”	`lost_or_stolen_card`

With intent classification, subtle distinctions trip up prompts, like:

card_arrival - checking on a card that should have arrived
card_delivery_estimate - asking about delivery timeline

Queries like “How long does it take for me to get my new card?” could plausibly map to either intent.

Baseline: 43.5%

System: You are an expert banking assistant that classifies customer
queries into banking intents. Given a customer message, respond with
exactly one intent label from the provided list using the
`banking77_classify` tool.

User: Customer Query: {query}

Available Intents:
{available_intents}

Classify this query into one of the above banking intents using the tool call.

On 120 training seeds with gpt-4.1-nano, this prompt achieved 43.5% accuracy.

Baseline Seed Performance

Seed	Query	Baseline Output	Correct
40	”I need to find out where the card is that I ordered.”	`card_arrival`	✓
34	”Where is the card I ordered 2 weeks ago?”	`card_arrival`	✓
56	”How long does it take for me to get my new card?”	`card_delivery_estimate`	✗
114	”I still haven’t gotten my new card. When will it get here?”	`card_arrival`	✓
93	”WHAT IS THE SOLUTION OF THIS PROBLEM”	`card_arrival`	✓

Seed 56 shows the core confusion: the model interprets “how long” as a delivery timing question instead of a card status check.

Phase 1: Initial Population

12 candidate prompts were generated. Their results ranged from 2.5% to 65%:

Candidate	Accuracy	What Happened
c9	65%	Best initial - clear input/output structure
c5	62.5%	Strong - explicit field descriptions
c0, c11	57.5%	Good - but introduced regressions
c8	55%	Decent - two-input structure
c7	47.5%	Mixed - verbose descriptions
c2	40%	Prompt leakage failures
c4	37.5%	Lowercase field names
c1	12.5%	Query echoing failures
c3	7.5%	Numbered lists misinterpreted
c10	5%	Inverted input priority
c6	2.5%	Near-total structural confusion

Seed-Level Comparison: What Went Wrong

Seed 34: “Where is the card I ordered 2 weeks ago?” (Expected: card_arrival)

Candidate	Output	Result
Baseline	`card_arrival`	✓
c0	`card_delivery_estimate`	✗ regression
c1	`"Where is the card I ordered 2 weeks ago?"`	✗ echoed query
c2	`"Where is the card I ordered 2 weeks ago?"`	✗ echoed query
trans_00013	`card_arrival`	✓ fixed

c0 introduced a regression - baseline got this right, but c0’s transformation caused the model to misclassify. c1 and c2 completely failed by echoing the query back instead of classifying it. Seed 93: “WHAT IS THE SOLUTION OF THIS PROBLEM” (Expected: card_arrival)

Candidate	Output	Result
Baseline	`card_arrival`	✓
c0	`card_not_working`	✗ regression
c1	`"WHAT IS THE SOLUTION OF THIS PROBLEM"`	✗ echoed query
c2	`"what_is_the_solution_of_this_problem"`	✗ prompt leakage
trans_00013	`card_arrival`	✓ fixed

c2’s output shows prompt leakage - the model output a snake_case version of the query, mimicking the customer_query field name from c2’s prompt structure. Seed 114: “I still haven’t gotten my new card. When will it get here?” (Expected: card_arrival)

Candidate	Output	Result
Baseline	`card_arrival`	✓
c0	`card_delivery_estimate`	✗ regression
c1	`"Customer"`	✗ prompt fragment
trans_00013	`card_arrival`	✓ fixed

c1 output "Customer" - a fragment of the prompt’s Customer Query field name. This indicates the model was confused about what to output.

Failure Mode Analysis

The seed data reveals three distinct failure modes: 1. Intent Confusion (c0, c4, c7)

Model outputs a valid intent, but the wrong one
Example: card_delivery_estimate instead of card_arrival
Cause: Prompt structure didn’t disambiguate similar intents

2. Query Echoing (c1, c3)

Model repeats the input query instead of classifying
Example: "Where is the card I ordered 2 weeks ago?"
Cause: Prompt structure confused input vs output expectations

3. Prompt Leakage (c2, c6, c10)

Model outputs fragments of the prompt template
Examples: "customer_query", "{customer_query}", "Customer"
Cause: Field name descriptions leaked into output space

Phase 2: Evolution

baseline (43.5%)
    ↓ mutation
c0-c11 (2.5% - 65%)
    ↓ evolution
trans_00010 (intermediate)
    ↓ mutation
trans_00013 (82.5%)

trans_00013 descended from trans_00010, inheriting successful patterns while fixing failure modes.

What `trans_00013` Fixed

Seed	Query	c0	trans_00013
34	”Where is the card I ordered 2 weeks ago?”	✗ `card_delivery_estimate`	✓ `card_arrival`
93	”WHAT IS THE SOLUTION OF THIS PROBLEM”	✗ `card_not_working`	✓ `card_arrival`
114	”I still haven’t gotten my new card. When will it get here?”	✗ `card_delivery_estimate`	✓ `card_arrival`

trans_00013 recovered the baseline’s correct predictions while maintaining the structural improvements that helped on other seeds.

The Winning Transformation

[1. Input Description]
- Input field `Customer Query`: a single user message describing a banking need or question
- Input field `Available Intents`: the complete list of 77 valid intent labels

[2. Output Format]
- Return exactly one intent label from Available Intents
- Use the banking77_classify tool

Why it worked:

Brackets [1. Input Description] create clear section boundaries
Customer Query matches the user message format exactly (no customer_query mismatch)
“describing a banking need” frames the task more precisely than “asking about an issue”
“complete list of 77” explicitly scopes the intent space

The Sticking Point

Seed 56: “How long does it take for me to get my new card?” (Expected: card_arrival)

Candidate	Output	Correct
Baseline	`card_delivery_estimate`	✗
c0	`card_delivery_estimate`	✗
c9	`card_delivery_estimate`	✗
trans_00013	`card_delivery_estimate`	✗

Every candidate, including trans_00013, got this wrong — the query reads like a delivery timing question. This represents a labeling ambiguity in the dataset itself.

Phase 3: Validation

The top 10 candidates were validated on 200 held-out seeds (120-319):

Metric	trans_00013
Train (120 seeds)	82.5%
Val (200 seeds)	64%
Drop	-18.5 pp

Phase 4: Pareto Selection

Net Improvement

Metric	Baseline	trans_00013	Δ
Train	43.5%	82.5%	+39 pp
Val	~43%	64%	+21 pp

On the 40 Pareto-scored seeds comparing c9 vs baseline:

17 disagreement seeds
c9 won 13, baseline won 4
Net: +9 seeds flipped to correct

What GEPA Learned

Bracketed sections - [1. Input Description] over 1. Input Description
Exact field name matching - Customer Query not customer_query
Precise task framing - “describing a banking need” over “asking about an issue”
Explicit output boundaries - [2. Output Format] section prevents prompt leakage

About the run

Minimal Config

Only 4 fields required to reproduce this:

[prompt_learning]
algorithm = "gepa"
container_url = "https://your-tunnel.trycloudflare.com"
total_seeds = 320
proposer_effort = "MEDIUM"
proposer_output_tokens = "FAST"

# Optional overrides
population_size = 12
num_generations = 20

All other parameters (archive size, mutation rate, children per generation, etc.) are auto-derived.

Full Configuration

Model: gpt-4.1-nano (detected from container)
Initial population: 12
Generations: 20
Children per generation: 6
Validation top K: 10
Pareto set size: 40

Metrics

Runtime: 5m 54s
Cost: $0.47
Training seeds: 120
Validation seeds: 200

Run This Yourself with Synth AI

git clone https://github.com/synth-laboratories/cookbooks.git
cd cookbooks/code/demos/banking77
uvx --python "<3.14" synth-ai==0.7.15 setup
uv run run_banking77_demo.py

Walkthroughs

Dataset

Baseline: 43.5%

Baseline Seed Performance

Phase 1: Initial Population

Seed-Level Comparison: What Went Wrong

Failure Mode Analysis

Phase 2: Evolution

What `trans_00013` Fixed

The Winning Transformation

The Sticking Point

Phase 3: Validation

Phase 4: Pareto Selection

Net Improvement

What GEPA Learned

About the run

Minimal Config

Full Configuration

Metrics

Run This Yourself with Synth AI

Optimize your model’s prompts for free with Synth AI

Ready to get started?

Get Started

Schedule Demo

Walkthroughs

​Dataset

​Baseline: 43.5%

​Baseline Seed Performance

​Phase 1: Initial Population

​Seed-Level Comparison: What Went Wrong

​Failure Mode Analysis

​Phase 2: Evolution

​What trans_00013 Fixed

​The Winning Transformation

​The Sticking Point

​Phase 3: Validation

​Phase 4: Pareto Selection

​Net Improvement

​What GEPA Learned

​About the run

​Minimal Config

​Full Configuration

​Metrics

​Run This Yourself with Synth AI

​Optimize your model’s prompts for free with Synth AI

​Ready to get started?

Get Started

Schedule Demo

Dataset

Baseline: 43.5%

Baseline Seed Performance

Phase 1: Initial Population

Seed-Level Comparison: What Went Wrong

Failure Mode Analysis

Phase 2: Evolution

What `trans_00013` Fixed

The Winning Transformation

The Sticking Point

Phase 3: Validation

Phase 4: Pareto Selection

Net Improvement

What GEPA Learned

About the run

Minimal Config

Full Configuration

Metrics

Run This Yourself with Synth AI

Optimize your model’s prompts for free with Synth AI

Ready to get started?