Skip to main content

synth_ai.sdk.api.train.rl

Experimental First-class SDK API for reinforcement learning (RL/GSPO). This module provides high-level abstractions for running RL training jobs both via CLI (uvx synth-ai train --type rl) and programmatically in Python scripts. Example CLI usage:
uvx synth-ai train --type rl --config my_config.toml --poll
Example SDK usage:
from synth_ai.sdk.api.train.rl import RLJob
from synth_ai.sdk.task.in_process import InProcessTaskApp

async with InProcessTaskApp(task_app_path="my_task_app.py", port=8114) as task_app:
    job = RLJob.from_config(
        config_path="my_config.toml",
        task_app_url=task_app.url,
    )
    job.submit()
    result = job.poll_until_complete()
    print(f"Final reward: {result.get('final_reward', 'N/A')}")

synth_ai.sdk.api.train.configs.rl

RL (Reinforcement Learning) configuration models for GSPO training. This module defines the configuration schema for RL training jobs using GSPO (Group Sequence Policy Optimization) - a policy gradient method for fine-tuning language models. Paper: PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

When to Use RL/GSPO

  • Training model weights (not just prompts)
  • Multi-turn agent tasks with sequential decision making
  • When you have a reward signal from environment interaction
  • Scaling to larger models with GPU training
  • On-policy learning with pipelined rollouts
Example TOML configuration:
[algorithm]
type = "online"               # Required: "online" for RL
method = "policy_gradient"    # Required: "policy_gradient" or "ppo"
variety = "gspo"              # Required: "gspo" for GSPO

[services]
task_url = "https://your-tunnel.trycloudflare.com"  # Required: task app URL
judge_url = "https://synth-backend.onrender.com/api"  # Optional: judge service

[compute]
gpu_type = "H100"             # Required: GPU type
gpu_count = 2                 # Required: number of GPUs

[topology]
type = "single_node_split"
gpus_for_vllm = 1             # GPUs for inference server
gpus_for_training = 1         # GPUs for training
tensor_parallel = 1           # Tensor parallelism degree

[model]
base = "Qwen/Qwen3-4B"        # Base model (or source = "ft:checkpoint_id")
trainer_mode = "lora"         # "lora", "full", or "qlora"
label = "my-rl-model"         # Model identifier

[rollout]
env_name = "my-task"          # Environment/task name
policy_name = "my-policy"     # Policy identifier
max_turns = 10                # Max steps per episode
episodes_per_batch = 32       # Episodes per training batch

[training]
num_epochs = 1                # Number of training epochs
iterations_per_epoch = 20     # Iterations per epoch
batch_size = 16               # Training batch size
group_size = 4                # GSPO group size
learning_rate = 5e-5          # Optimizer learning rate

[training.lora]
r = 16                        # LoRA rank
alpha = 32                    # LoRA alpha
dropout = 0.1                 # LoRA dropout

[evaluation]
instances = 50                # Evaluation instances
every_n_iters = 10            # Evaluate every N iterations
seeds = [0, 1, 2, 3, 4, 5]    # Evaluation seeds
See Also:
  • Training reference: /training/gspo
  • Quickstart: /quickstart/reinforcement-learning

Job API

RLJobConfig

Configuration for an RL training job.

RLJob

High-level SDK class for running RL training jobs (GSPO, GRPO, PPO, etc.). This class provides a clean API for:
  1. Submitting RL training jobs
  2. Polling job status
  3. Retrieving results
Methods:

from_config

from_config(cls, config_path: str | Path, backend_url: Optional[str] = None, api_key: Optional[str] = None, task_app_url: Optional[str] = None, task_app_api_key: Optional[str] = None, allow_experimental: Optional[bool] = None, overrides: Optional[Dict[str, Any]] = None, idempotency_key: Optional[str] = None) -> RLJob
Create an RL job from a config file. Args:
  • config_path: Path to TOML config file
  • backend_url: Backend API URL (defaults to env var BACKEND_BASE_URL)
  • api_key: API key (defaults to env var SYNTH_API_KEY)
  • task_app_url: Task app URL (defaults to env var TASK_APP_URL or config file)
  • task_app_api_key: Task app API key (defaults to env var ENVIRONMENT_API_KEY)
  • allow_experimental: Allow experimental features
  • overrides: Config overrides (merged into config)
  • idempotency_key: Optional idempotency key for job submission
Returns:
  • RLJob instance

from_job_id

from_job_id(cls, job_id: str, backend_url: Optional[str] = None, api_key: Optional[str] = None) -> RLJob
Resume an existing RL job by ID. Args:
  • job_id: Existing job ID
  • backend_url: Backend API URL (defaults to env var BACKEND_BASE_URL)
  • api_key: API key (defaults to env var SYNTH_API_KEY)
Returns:
  • RLJob instance for the existing job

submit

submit(self) -> str
Submit the job to the backend. Returns:
  • Job ID
Raises:
  • RuntimeError: If job submission fails
  • ValueError: If task app health check fails

job_id

job_id(self) -> Optional[str]
Get the job ID (None if not yet submitted).

get_status

get_status(self) -> Dict[str, Any]
Get current job status. Returns:
  • Job status dictionary
Raises:
  • RuntimeError: If job hasn’t been submitted yet

poll_until_complete

poll_until_complete(self) -> Dict[str, Any]
Poll job until it reaches a terminal state. Args:
  • timeout: Maximum seconds to wait
  • interval: Seconds between poll attempts
  • on_status: Optional callback called on each status update
Returns:
  • Final job status dictionary
Raises:
  • RuntimeError: If job hasn’t been submitted yet
  • TimeoutError: If timeout exceeded

get_results

get_results(self) -> Dict[str, Any]
Get final job results. Returns:
  • Job results dictionary
Raises:
  • RuntimeError: If job hasn’t completed successfully

Configuration Reference

RLServicesConfig

Service URLs for RL training. Attributes:
  • task_url: URL of your task app (tunnel URL). Required.
  • judge_url: Optional judge service URL for reward augmentation.

ModelConfig

Model configuration for RL training. Attributes:
  • source: Fine-tuned model checkpoint ID (e.g., “ft:checkpoint_id”). Mutually exclusive with base.
  • base: Base HuggingFace model (e.g., “Qwen/Qwen3-4B”). Mutually exclusive with source.
  • trainer_mode: Training mode - “lora”, “qlora”, or “full”. Required.
  • label: Human-readable model identifier for tracking. Required.

RolloutConfig

Rollout configuration for RL episode collection. Attributes:
  • env_name: Environment/task name registered in your task app. Required.
  • policy_name: Policy identifier for this training run. Required.
  • env_config: Optional environment-specific configuration dict.
  • policy_config: Optional policy-specific configuration dict.
  • max_turns: Maximum steps per episode. Required.
  • episodes_per_batch: Number of episodes to collect per training batch. Required.
  • max_concurrent_rollouts: Maximum parallel rollout workers. Required.
  • batches_per_step: Rollout batches per training step. Default: 1.

WeightSyncConfig

Weight synchronization configuration for pipelined training. Controls how model weights are synchronized between the training process and the inference server (vLLM) during on-policy learning. Attributes:
  • enable: Enable weight synchronization. Default: True.
  • targets: List of sync targets (e.g., [“vllm”]). Default: [“vllm”].
  • mode: Sync mode - “full” or “lora”. Default: matches trainer_mode.
  • direct: Use direct GPU-to-GPU transfer. Default: True.
  • verify_every_k: Verify sync integrity every K steps. Default: 10.

RewardsConfig

Rewards configuration for RL training. Controls reward shaping and step-level reward signals for training. Attributes:
  • step_rewards_enabled: Enable step-level rewards. Default: False.
  • step_rewards_mode: Step reward computation mode. Default: “indicator”.
  • step_rewards_indicator_lambda: Lambda for indicator rewards. Default: 0.1.
  • step_rewards_beta: Beta scaling for step rewards. Default: 1.0.
  • step_rewards_strategy: Reward assignment strategy. Default: “uniform”.
  • event_rewards_kind: Event reward type - “binary” or “continuous”. Default: “binary”.

RLTrainingConfig

Training hyperparameters for RL/GSPO. Attributes:
  • num_epochs: Number of training epochs. Required.
  • iterations_per_epoch: Training iterations per epoch. Required.
  • gradient_accumulation_steps: Steps to accumulate gradients. Default: 1.
  • max_accumulated_minibatch: Maximum accumulated minibatch size.
  • max_turns: Maximum turns per episode (must match rollout.max_turns). Required.
  • batch_size: Training batch size. Required.
  • group_size: GSPO group size for advantage computation. Required.
  • learning_rate: Optimizer learning rate (e.g., 5e-5). Required.
  • log_interval: Log training metrics every N steps. Default: 10.
  • weight_sync_interval: Sync weights to vLLM every N steps. Default: 1.
  • step_rewards_enabled: DEPRECATED - use rewards.step_rewards_enabled.
  • step_rewards_mode: DEPRECATED - use rewards.step_rewards_mode.
  • step_rewards_indicator_lambda: DEPRECATED - use rewards.step_rewards_indicator_lambda.
  • step_rewards_beta: DEPRECATED - use rewards.step_rewards_beta.
  • step_rewards_strategy: DEPRECATED - use rewards.step_rewards_strategy.
  • event_rewards_kind: DEPRECATED - use rewards.event_rewards_kind.
  • weight_sync: Weight synchronization configuration.
  • lora: LoRA hyperparameters (r, alpha, dropout).
  • rewards: Reward shaping configuration.

EvaluationConfig

Evaluation configuration during RL training. Attributes:
  • instances: Number of evaluation instances to run. Required.
  • every_n_iters: Evaluate every N training iterations. Required.
  • seeds: List of random seeds for reproducible evaluation. Required.

JudgeOptionsConfig

Judge options for reward augmentation. Attributes:
  • event: Enable event-level judging. Default: False.
  • outcome: Enable outcome-level judging. Default: True.
  • provider: Judge provider - “openai” or “anthropic”. Default: “openai”.
  • model: Judge model (e.g., “gpt-4o”). Default: provider default.
  • rubric_id: Registered rubric ID for evaluation criteria.
  • rubric_overrides: Override specific rubric criteria.
  • tracks: Rubric tracks to evaluate. Default: all tracks.
  • weights: Custom weights for rubric tracks.
  • max_concurrency: Maximum concurrent judge calls. Default: 10.

RubricConfig

DEPRECATED: Rubric configuration for reward blending. Use judge.enabled and judge.reward_blend instead. Attributes:
  • enabled: Enable rubric-based reward blending. Default: False.
  • reward_blend: Weights for reward sources (env, event, outcome). Default: {“env”: 1.0}.

JudgeConfig

Judge configuration for AI-based reward augmentation. The judge provides additional reward signals by evaluating agent behavior against rubrics or criteria, complementing environment rewards. Attributes:
  • type: Judge type - “llm” or “rule”. Default: “llm”.
  • timeout_s: Timeout for judge calls in seconds. Default: 30.
  • enabled: Master switch for judge/rubric evaluation. Default: False.
  • reward_blend: Weights for blending reward sources {“env”: 0.5, “outcome”: 0.5}.
  • rubric: DEPRECATED - use enabled and reward_blend directly.
  • options: Detailed judge options (model, provider, tracks).

SmokeConfig

Configuration for local smoke testing (CLI only, ignored by trainer). Used by synth-ai smoke command to test task apps locally before submitting full training jobs. Attributes:
  • task_url: Task app URL for testing.
  • env_name: Environment name to test.
  • policy_name: Policy name to test.
  • max_steps: Maximum steps per test episode. Default: 10.
  • policy: Policy type - “mock”, “gpt-5-nano”, “openai”, “groq”.
  • model: Model to use for policy.
  • mock_backend: Mock backend type - “synthetic” or “openai”.
  • mock_port: Port for mock backend server.
  • return_trace: Return full episode trace. Default: False.
  • use_mock: Use mock policy. Default: False.
  • task_app_name: Task app to serve (e.g., “grpo-crafter”).
  • task_app_port: Port for task app. Default: 8765.
  • task_app_env_file: Path to .env file for task app.
  • task_app_force: Use —force flag when serving.
  • sqld_auto_start: Auto-start sqld server. Default: False.
  • sqld_db_path: Database path. Default: ./traces/local.db.
  • sqld_hrana_port: Hrana WebSocket port. Default: 8080.
  • sqld_http_port: HTTP API port. Default: 8081.

RLConfig

Root configuration for RL/GSPO training jobs. This is the top-level config loaded from a TOML file for reinforcement learning training using GSPO (Group Sequence Policy Optimization). Attributes:
  • algorithm: Algorithm configuration (type=“online”, method=“policy_gradient”). Required.
  • services: Service URLs (task_url, judge_url). Required.
  • compute: GPU and compute configuration.
  • topology: DEPRECATED - use compute.topology instead.
  • vllm: vLLM inference server configuration.
  • reference: DEPRECATED - use compute.topology.reference_placement.
  • model: DEPRECATED - use policy instead.
  • policy: Policy configuration (preferred over model).
  • lora: DEPRECATED - use training.lora instead.
  • rollout: Rollout/episode collection configuration.
  • evaluation: Evaluation configuration during training.
  • training: Training hyperparameters.
  • rubric: DEPRECATED - use judge.reward_blend and judge.enabled.
  • judge: Judge configuration for reward augmentation.
  • tags: Optional metadata tags for tracking.
  • smoke: CLI-only smoke testing configuration.
Methods:

to_dict

to_dict(self) -> dict[str, Any]

from_mapping

from_mapping(cls, data: Mapping[str, Any]) -> RLConfig
Load RL config from dict/TOML mapping.

from_path

from_path(cls, path: Path) -> RLConfig