Dataset
Banking77 is a dataset of 77 banking-related customer intents across 13,083 customer queries, like:| Query | Intent |
|---|---|
| ”I am still waiting on my card?” | card_arrival |
| ”When will my card arrive?” | card_delivery_estimate |
| ”I think my card was stolen” | lost_or_stolen_card |
card_arrival- checking on a card that should have arrivedcard_delivery_estimate- asking about delivery timeline
Baseline: 43.5%
Baseline Seed Performance
| Seed | Query | Baseline Output | Correct |
|---|---|---|---|
| 40 | ”I need to find out where the card is that I ordered.” | card_arrival | ✓ |
| 34 | ”Where is the card I ordered 2 weeks ago?” | card_arrival | ✓ |
| 56 | ”How long does it take for me to get my new card?” | card_delivery_estimate | ✗ |
| 114 | ”I still haven’t gotten my new card. When will it get here?” | card_arrival | ✓ |
| 93 | ”WHAT IS THE SOLUTION OF THIS PROBLEM” | card_arrival | ✓ |
Phase 1: Initial Population
12 candidate prompts were generated. Their results ranged from 2.5% to 65%:| Candidate | Accuracy | What Happened |
|---|---|---|
| c9 | 65% | Best initial - clear input/output structure |
| c5 | 62.5% | Strong - explicit field descriptions |
| c0, c11 | 57.5% | Good - but introduced regressions |
| c8 | 55% | Decent - two-input structure |
| c7 | 47.5% | Mixed - verbose descriptions |
| c2 | 40% | Prompt leakage failures |
| c4 | 37.5% | Lowercase field names |
| c1 | 12.5% | Query echoing failures |
| c3 | 7.5% | Numbered lists misinterpreted |
| c10 | 5% | Inverted input priority |
| c6 | 2.5% | Near-total structural confusion |
Seed-Level Comparison: What Went Wrong
Seed 34: “Where is the card I ordered 2 weeks ago?” (Expected:card_arrival)
| Candidate | Output | Result |
|---|---|---|
| Baseline | card_arrival | ✓ |
| c0 | card_delivery_estimate | ✗ regression |
| c1 | "Where is the card I ordered 2 weeks ago?" | ✗ echoed query |
| c2 | "Where is the card I ordered 2 weeks ago?" | ✗ echoed query |
| trans_00013 | card_arrival | ✓ fixed |
card_arrival)
| Candidate | Output | Result |
|---|---|---|
| Baseline | card_arrival | ✓ |
| c0 | card_not_working | ✗ regression |
| c1 | "WHAT IS THE SOLUTION OF THIS PROBLEM" | ✗ echoed query |
| c2 | "what_is_the_solution_of_this_problem" | ✗ prompt leakage |
| trans_00013 | card_arrival | ✓ fixed |
customer_query field name from c2’s prompt structure.
Seed 114: “I still haven’t gotten my new card. When will it get here?” (Expected: card_arrival)
| Candidate | Output | Result |
|---|---|---|
| Baseline | card_arrival | ✓ |
| c0 | card_delivery_estimate | ✗ regression |
| c1 | "Customer" | ✗ prompt fragment |
| trans_00013 | card_arrival | ✓ fixed |
"Customer" - a fragment of the prompt’s Customer Query field name. This indicates the model was confused about what to output.
Failure Mode Analysis
The seed data reveals three distinct failure modes: 1. Intent Confusion (c0, c4, c7)- Model outputs a valid intent, but the wrong one
- Example:
card_delivery_estimateinstead ofcard_arrival - Cause: Prompt structure didn’t disambiguate similar intents
- Model repeats the input query instead of classifying
- Example:
"Where is the card I ordered 2 weeks ago?" - Cause: Prompt structure confused input vs output expectations
- Model outputs fragments of the prompt template
- Examples:
"customer_query","{customer_query}","Customer" - Cause: Field name descriptions leaked into output space
Phase 2: Evolution
trans_00013 descended from trans_00010, inheriting successful patterns while fixing failure modes.
What trans_00013 Fixed
| Seed | Query | c0 | trans_00013 |
|---|---|---|---|
| 34 | ”Where is the card I ordered 2 weeks ago?” | ✗ card_delivery_estimate | ✓ card_arrival |
| 93 | ”WHAT IS THE SOLUTION OF THIS PROBLEM” | ✗ card_not_working | ✓ card_arrival |
| 114 | ”I still haven’t gotten my new card. When will it get here?” | ✗ card_delivery_estimate | ✓ card_arrival |
trans_00013 recovered the baseline’s correct predictions while maintaining the structural improvements that helped on other seeds.
The Winning Transformation
- Brackets
[1. Input Description]create clear section boundaries Customer Querymatches the user message format exactly (nocustomer_querymismatch)- “describing a banking need” frames the task more precisely than “asking about an issue”
- “complete list of 77” explicitly scopes the intent space
The Sticking Point
Seed 56: “How long does it take for me to get my new card?” (Expected:card_arrival)
| Candidate | Output | Correct |
|---|---|---|
| Baseline | card_delivery_estimate | ✗ |
| c0 | card_delivery_estimate | ✗ |
| c9 | card_delivery_estimate | ✗ |
| trans_00013 | card_delivery_estimate | ✗ |
trans_00013, got this wrong — the query reads like a delivery timing question. This represents a labeling ambiguity in the dataset itself.
Phase 3: Validation
The top 10 candidates were validated on 200 held-out seeds (120-319):| Metric | trans_00013 |
|---|---|
| Train (120 seeds) | 82.5% |
| Val (200 seeds) | 64% |
| Drop | -18.5 pp |
Phase 4: Pareto Selection
Net Improvement
| Metric | Baseline | trans_00013 | Δ |
|---|---|---|---|
| Train | 43.5% | 82.5% | +39 pp |
| Val | ~43% | 64% | +21 pp |
- 17 disagreement seeds
- c9 won 13, baseline won 4
- Net: +9 seeds flipped to correct
What GEPA Learned
- Bracketed sections -
[1. Input Description]over1. Input Description - Exact field name matching -
Customer Querynotcustomer_query - Precise task framing - “describing a banking need” over “asking about an issue”
- Explicit output boundaries -
[2. Output Format]section prevents prompt leakage
About the run
Minimal Config
Only 4 fields required to reproduce this:Full Configuration
- Model: gpt-4.1-nano (detected from container)
- Initial population: 12
- Generations: 20
- Children per generation: 6
- Validation top K: 10
- Pareto set size: 40
Metrics
- Runtime: 5m 54s
- Cost: $0.47
- Training seeds: 120
- Validation seeds: 200
Run This Yourself with Synth AI
Optimize your model’s prompts for free with Synth AI
Ready to get started?
Get Started
Sign up and start optimizing your prompts today.
Schedule Demo
See Synth in action with a personalized walkthrough.