Skip to main content
After prompt learning completes, query optimized prompts and evaluate them on held-out validation sets.

Querying Results

Python API

from synth_ai.learning import get_prompts, get_prompt_text, get_scoring_summary

# Get all results
results = get_prompts(
    job_id="pl_abc123",
    base_url="http://localhost:8000",
    api_key="sk_..."
)

# Access best prompt
best_prompt = results["best_prompt"]
best_score = results["best_score"]
print(f"Best Score: {best_score:.3f}")

# Get top-K prompts
for prompt_info in results["top_prompts"]:
    print(f"Rank {prompt_info['rank']}: {prompt_info['train_accuracy']:.3f}")
    print(prompt_info["full_text"])

# Quick access to best prompt text only
best_text = get_prompt_text(
    job_id="pl_abc123",
    base_url="http://localhost:8000",
    api_key="sk_...",
    rank=1  # 1 = best, 2 = second best, etc.
)

# Get scoring statistics
summary = get_scoring_summary(
    job_id="pl_abc123",
    base_url="http://localhost:8000",
    api_key="sk_..."
)
print(f"Best: {summary['best_train_accuracy']:.3f}")
print(f"Mean: {summary['mean_train_accuracy']:.3f}")
print(f"Tried: {summary['num_candidates_tried']}")

REST API

# Get job status
curl -H "Authorization: Bearer $SYNTH_API_KEY" \
  http://localhost:8000/api/prompt-learning/online/jobs/JOB_ID

# Stream events
curl -H "Authorization: Bearer $SYNTH_API_KEY" \
  http://localhost:8000/api/prompt-learning/online/jobs/JOB_ID/events/stream

# Get metrics
curl -H "Authorization: Bearer $SYNTH_API_KEY" \
  http://localhost:8000/api/prompt-learning/online/jobs/JOB_ID/metrics

Understanding Results

Score Types

Prompt learning jobs track two types of scores:
  • prompt_best_train_score: Best accuracy on training seeds (used during optimization)
  • prompt_best_validation_score: Best accuracy on validation seeds (held-out evaluation)
The validation score provides an unbiased estimate of generalization performance.

Pareto Front

GEPA maintains a Pareto front of prompt variants balancing:
  • Accuracy (primary objective) – Task performance
  • Token count (efficiency objective) – Prompt length
  • Tool call rate (task-specific objective) – Function calling frequency
Query multiple ranks to explore the trade-offs:
# Get top 5 prompts from Pareto front
for rank in range(1, 6):
    prompt = get_prompt_text(job_id="pl_abc123", rank=rank)
    print(f"Rank {rank}: {len(prompt)} tokens")

Validation Evaluation

After optimization, evaluate the best prompts on your validation set:
from synth_ai.learning import get_prompt_text

# Get best prompt
best_prompt = get_prompt_text(job_id="pl_abc123", rank=1)

# Evaluate on validation seeds
validation_seeds = [0, 1, 2, ..., 49]  # Your validation seeds
results = evaluate_prompt(
    prompt=best_prompt,
    task_app_url="http://127.0.0.1:8102",
    seeds=validation_seeds
)

print(f"Validation Accuracy: {results['accuracy']:.3f}")

Expected Performance

GEPA typically improves accuracy over generations:
GenerationTypical AccuracyNotes
1 (baseline)60-75%Initial random/baseline prompts
575-80%Early optimization gains
1080-85%Convergence begins
15 (final)85-90%+Optimized prompts on Pareto front

Next Steps