synth_ai.sdk.api.train.rl
Experimental
First-class SDK API for reinforcement learning (RL/GSPO).
This module provides high-level abstractions for running RL training jobs
both via CLI (uvx synth-ai train --type rl) and programmatically in Python scripts.
Example CLI usage:
synth_ai.sdk.api.train.configs.rl
RL (Reinforcement Learning) configuration models for GSPO training.
This module defines the configuration schema for RL training jobs using
GSPO (Group Sequence Policy Optimization) - a policy gradient method
for fine-tuning language models.
Paper: PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation
When to Use RL/GSPO
- Training model weights (not just prompts)
- Multi-turn agent tasks with sequential decision making
- When you have a reward signal from environment interaction
- Scaling to larger models with GPU training
- On-policy learning with pipelined rollouts
- Training reference: /training/gspo
- Quickstart: /quickstart/reinforcement-learning
Job API
RLJobConfig
Configuration for an RL training job.
RLJob
High-level SDK class for running RL training jobs (GSPO, GRPO, PPO, etc.).
This class provides a clean API for:
- Submitting RL training jobs
- Polling job status
- Retrieving results
from_config
config_path: Path to TOML config filebackend_url: Backend API URL (defaults to env var BACKEND_BASE_URL)api_key: API key (defaults to env var SYNTH_API_KEY)task_app_url: Task app URL (defaults to env var TASK_APP_URL or config file)task_app_api_key: Task app API key (defaults to env var ENVIRONMENT_API_KEY)allow_experimental: Allow experimental featuresoverrides: Config overrides (merged into config)idempotency_key: Optional idempotency key for job submission
- RLJob instance
from_job_id
job_id: Existing job IDbackend_url: Backend API URL (defaults to env var BACKEND_BASE_URL)api_key: API key (defaults to env var SYNTH_API_KEY)
- RLJob instance for the existing job
submit
- Job ID
RuntimeError: If job submission failsValueError: If task app health check fails
job_id
get_status
- Job status dictionary
RuntimeError: If job hasn’t been submitted yet
poll_until_complete
timeout: Maximum seconds to waitinterval: Seconds between poll attemptson_status: Optional callback called on each status update
- Final job status dictionary
RuntimeError: If job hasn’t been submitted yetTimeoutError: If timeout exceeded
get_results
- Job results dictionary
RuntimeError: If job hasn’t completed successfully
Configuration Reference
RLServicesConfig
Service URLs for RL training.
Attributes:
task_url: URL of your task app (tunnel URL). Required.judge_url: Optional judge service URL for reward augmentation.
ModelConfig
Model configuration for RL training.
Attributes:
source: Fine-tuned model checkpoint ID (e.g., “ft:checkpoint_id”). Mutually exclusive with base.base: Base HuggingFace model (e.g., “Qwen/Qwen3-4B”). Mutually exclusive with source.trainer_mode: Training mode - “lora”, “qlora”, or “full”. Required.label: Human-readable model identifier for tracking. Required.
RolloutConfig
Rollout configuration for RL episode collection.
Attributes:
env_name: Environment/task name registered in your task app. Required.policy_name: Policy identifier for this training run. Required.env_config: Optional environment-specific configuration dict.policy_config: Optional policy-specific configuration dict.max_turns: Maximum steps per episode. Required.episodes_per_batch: Number of episodes to collect per training batch. Required.max_concurrent_rollouts: Maximum parallel rollout workers. Required.batches_per_step: Rollout batches per training step. Default: 1.
WeightSyncConfig
Weight synchronization configuration for pipelined training.
Controls how model weights are synchronized between the training process
and the inference server (vLLM) during on-policy learning.
Attributes:
enable: Enable weight synchronization. Default: True.targets: List of sync targets (e.g., [“vllm”]). Default: [“vllm”].mode: Sync mode - “full” or “lora”. Default: matches trainer_mode.direct: Use direct GPU-to-GPU transfer. Default: True.verify_every_k: Verify sync integrity every K steps. Default: 10.
RewardsConfig
Rewards configuration for RL training.
Controls reward shaping and step-level reward signals for training.
Attributes:
step_rewards_enabled: Enable step-level rewards. Default: False.step_rewards_mode: Step reward computation mode. Default: “indicator”.step_rewards_indicator_lambda: Lambda for indicator rewards. Default: 0.1.step_rewards_beta: Beta scaling for step rewards. Default: 1.0.step_rewards_strategy: Reward assignment strategy. Default: “uniform”.event_rewards_kind: Event reward type - “binary” or “continuous”. Default: “binary”.
RLTrainingConfig
Training hyperparameters for RL/GSPO.
Attributes:
num_epochs: Number of training epochs. Required.iterations_per_epoch: Training iterations per epoch. Required.gradient_accumulation_steps: Steps to accumulate gradients. Default: 1.max_accumulated_minibatch: Maximum accumulated minibatch size.max_turns: Maximum turns per episode (must match rollout.max_turns). Required.batch_size: Training batch size. Required.group_size: GSPO group size for advantage computation. Required.learning_rate: Optimizer learning rate (e.g., 5e-5). Required.log_interval: Log training metrics every N steps. Default: 10.weight_sync_interval: Sync weights to vLLM every N steps. Default: 1.step_rewards_enabled: DEPRECATED - use rewards.step_rewards_enabled.step_rewards_mode: DEPRECATED - use rewards.step_rewards_mode.step_rewards_indicator_lambda: DEPRECATED - use rewards.step_rewards_indicator_lambda.step_rewards_beta: DEPRECATED - use rewards.step_rewards_beta.step_rewards_strategy: DEPRECATED - use rewards.step_rewards_strategy.event_rewards_kind: DEPRECATED - use rewards.event_rewards_kind.weight_sync: Weight synchronization configuration.lora: LoRA hyperparameters (r, alpha, dropout).rewards: Reward shaping configuration.
EvaluationConfig
Evaluation configuration during RL training.
Attributes:
instances: Number of evaluation instances to run. Required.every_n_iters: Evaluate every N training iterations. Required.seeds: List of random seeds for reproducible evaluation. Required.
JudgeOptionsConfig
Judge options for reward augmentation.
Attributes:
event: Enable event-level judging. Default: False.outcome: Enable outcome-level judging. Default: True.provider: Judge provider - “openai” or “anthropic”. Default: “openai”.model: Judge model (e.g., “gpt-4o”). Default: provider default.rubric_id: Registered rubric ID for evaluation criteria.rubric_overrides: Override specific rubric criteria.tracks: Rubric tracks to evaluate. Default: all tracks.weights: Custom weights for rubric tracks.max_concurrency: Maximum concurrent judge calls. Default: 10.
RubricConfig
DEPRECATED: Rubric configuration for reward blending.
Use judge.enabled and judge.reward_blend instead.
Attributes:
enabled: Enable rubric-based reward blending. Default: False.reward_blend: Weights for reward sources (env, event, outcome). Default: {“env”: 1.0}.
JudgeConfig
Judge configuration for AI-based reward augmentation.
The judge provides additional reward signals by evaluating agent behavior
against rubrics or criteria, complementing environment rewards.
Attributes:
type: Judge type - “llm” or “rule”. Default: “llm”.timeout_s: Timeout for judge calls in seconds. Default: 30.enabled: Master switch for judge/rubric evaluation. Default: False.reward_blend: Weights for blending reward sources {“env”: 0.5, “outcome”: 0.5}.rubric: DEPRECATED - use enabled and reward_blend directly.options: Detailed judge options (model, provider, tracks).
SmokeConfig
Configuration for local smoke testing (CLI only, ignored by trainer).
Used by synth-ai smoke command to test task apps locally before
submitting full training jobs.
Attributes:
task_url: Task app URL for testing.env_name: Environment name to test.policy_name: Policy name to test.max_steps: Maximum steps per test episode. Default: 10.policy: Policy type - “mock”, “gpt-5-nano”, “openai”, “groq”.model: Model to use for policy.mock_backend: Mock backend type - “synthetic” or “openai”.mock_port: Port for mock backend server.return_trace: Return full episode trace. Default: False.use_mock: Use mock policy. Default: False.task_app_name: Task app to serve (e.g., “grpo-crafter”).task_app_port: Port for task app. Default: 8765.task_app_env_file: Path to .env file for task app.task_app_force: Use —force flag when serving.sqld_auto_start: Auto-start sqld server. Default: False.sqld_db_path: Database path. Default: ./traces/local.db.sqld_hrana_port: Hrana WebSocket port. Default: 8080.sqld_http_port: HTTP API port. Default: 8081.
RLConfig
Root configuration for RL/GSPO training jobs.
This is the top-level config loaded from a TOML file for reinforcement
learning training using GSPO (Group Sequence Policy Optimization).
Attributes:
algorithm: Algorithm configuration (type=“online”, method=“policy_gradient”). Required.services: Service URLs (task_url, judge_url). Required.compute: GPU and compute configuration.topology: DEPRECATED - use compute.topology instead.vllm: vLLM inference server configuration.reference: DEPRECATED - use compute.topology.reference_placement.model: DEPRECATED - use policy instead.policy: Policy configuration (preferred over model).lora: DEPRECATED - use training.lora instead.rollout: Rollout/episode collection configuration.evaluation: Evaluation configuration during training.training: Training hyperparameters.rubric: DEPRECATED - use judge.reward_blend and judge.enabled.judge: Judge configuration for reward augmentation.tags: Optional metadata tags for tracking.smoke: CLI-only smoke testing configuration.