Skip to main content

LoRA SFT

Train larger models on constrained hardware by enabling LoRA while keeping the same SFT flow and payload schema.
  • Works via the standard SFT CLI: uvx synth-ai train --type sft --config <path>
  • Toggle with training.use_qlora = true in your TOML
  • Uses the same hyperparameter keys as FFT; the backend interprets LoRA-appropriate settings

Quickstart

uvx synth-ai train --type sft --config examples/warming_up_to_rl/configs/crafter_fft_4b.toml --dataset /abs/path/to/train.jsonl

Minimal TOML (LoRA enabled)

[job]
model = "Qwen/Qwen3-4B"
# Either set here or pass via --dataset
# data = "/abs/path/to/train.jsonl"

[compute]
gpu_type = "H100"       # required by backend
gpu_count = 1
nodes = 1

[data]
# Optional; forwarded into metadata.effective_config
topology = {}
# Optional local validation file; client uploads if present
# validation_path = "/abs/path/to/validation.jsonl"

[training]
mode = "sft_offline"
use_qlora = true         # LoRA toggle

[training.validation]
enabled = true
evaluation_strategy = "steps"
eval_steps = 20
save_best_model_at_end = true
metric_for_best_model = "val.loss"
greater_is_better = false

[hyperparameters]
n_epochs = 1
per_device_batch = 1
gradient_accumulation_steps = 64
sequence_length = 4096
learning_rate = 5e-6
warmup_ratio = 0.03

# Optional parallelism block forwarded as-is
#[hyperparameters.parallelism]
# use_deepspeed = true
# deepspeed_stage = 2
# bf16 = true

What the client validates and sends

  • Validates dataset path existence (from [job].data or --dataset) and JSONL shape
  • Uploads training (and optional validation) files to /api/learning/files
  • Builds payload with:
    • model from [job].model
    • training_type = "sft_offline"
    • hyperparameters from [hyperparameters] (+ [training.validation] knobs)
    • metadata.effective_config.compute from [compute]
    • metadata.effective_config.data.topology from [data.topology]
    • metadata.effective_config.training.{mode,use_qlora} from [training]

Multi‑GPU guidance

  • Set [compute].gpu_type, gpu_count, and optionally nodes
  • Use [hyperparameters.parallelism] for deepspeed/FSDP/precision/TP/PP knobs; forwarded verbatim
  • Optionally add [data.topology] (e.g., container_count) for visibility; backend validates resource consistency

Common issues

  • HTTP 400 missing_gpu_type: set [compute].gpu_type (and typically gpu_count) so it appears under metadata.effective_config.compute
  • Dataset not found: provide absolute path or use --dataset; the client resolves relative paths from the current working directory

Helpful CLI flags

  • --dataset to override [job].data
  • --examples N to subset the first N rows for quick smoke tests
  • --dry-run to preview payload without submitting

All sections and parameters (LoRA SFT)

The client recognizes and/or forwards the following sections:
  • [job] (client reads)
    • model (string, required): base model identifier
    • data or data_path (string): local path to training JSONL. Required unless overridden by --dataset
    • Notes:
      • Paths are resolved relative to the current working directory (CWD), not the TOML location
      • Legacy scripts may also mention poll_seconds (legacy-only; the new CLI uses --poll-* flags)
  • [compute] (forwarded into metadata.effective_config.compute)
    • gpu_type (string): required by backend (e.g., “H100”, “A10G”). Missing this often causes HTTP 400
    • gpu_count (int): number of GPUs
    • nodes (int, optional)
  • [data] (partially read)
    • topology (table/dict): forwarded as-is to metadata.effective_config.data.topology
    • validation_path (string, optional): if present and the file exists, the client uploads it and wires validation
    • Path resolution: relative to the current working directory (CWD)
  • [training] (partially read)
    • mode (string, optional): copied into metadata.effective_config.training.mode (documentation hint)
    • use_qlora (bool): set to true for LoRA
    • [training.validation] (optional; some keys are promoted into hyperparameters)
      • enabled (bool, default true): surfaced in metadata.effective_config.training.validation.enabled
      • evaluation_strategy (string, default “steps”): forwarded into hyperparameters
      • eval_steps (int, default 0): forwarded
      • save_best_model_at_end (bool, default true): forwarded
      • metric_for_best_model (string, default “val.loss”): forwarded
      • greater_is_better (bool, default false): forwarded
  • [hyperparameters] (client reads selective keys)
    • Required/defaulted:
      • n_epochs (int, default 1)
    • Optional (forwarded if present):
      • batch_size, global_batch, per_device_batch, gradient_accumulation_steps, sequence_length, learning_rate, warmup_ratio, train_kind
    • Note: some legacy examples include world_size. The client does not forward world_size; prefer specifying per_device_batch and gradient_accumulation_steps explicitly.
    • [hyperparameters.parallelism] (dict): forwarded verbatim (e.g., use_deepspeed, deepspeed_stage, fsdp, bf16, fp16, tensor_parallel_size, pipeline_parallel_size)
  • [algorithm] (ignored by client): present in some examples for documentation; no effect on payload
Validation and error rules (client):
  • Missing dataset path -> prompt or error; dataset must exist and be valid JSONL
  • Missing gpu_type (backend rule) -> HTTP 400 at create job
  • Validation path missing -> warning; continues without validation
Payload mapping recap:
  • model from [job].model
  • training_type = "sft_offline"
  • hyperparameters from [hyperparameters] + [training.validation] select keys
  • metadata.effective_config.compute from [compute]
  • metadata.effective_config.data.topology from [data.topology]
  • metadata.effective_config.training.{mode,use_qlora} from [training]
I