Skip to main content

Full Finetuning (FFT)

Run standard supervised finetuning using the SFT workflow with training.use_qlora = false (default) and typical FFT hyperparameters.
  • Invoke via: uvx synth-ai train --type sft --config <path>
  • Uses the same client/payload path as LoRA; differs only in training mode/toggles and typical hyperparameters/parallelism

Quickstart

uvx synth-ai train --type sft --config examples/warming_up_to_rl/configs/crafter_fft.toml --dataset /abs/path/to/train.jsonl

Minimal TOML (FFT)

[job]
model = "Qwen/Qwen3-4B"
# data = "/abs/path/to/train.jsonl"  # or pass via --dataset

[compute]
gpu_type = "H100"       # required by backend
gpu_count = 4
nodes = 1

[data.topology]
container_count = 4      # optional; informational

[training]
mode = "full_finetune"  # documentation; backend infers
use_qlora = false

[training.validation]
enabled = true
evaluation_strategy = "steps"
eval_steps = 20
save_best_model_at_end = true
metric_for_best_model = "val.loss"
greater_is_better = false

[hyperparameters]
n_epochs = 2
sequence_length = 2048
per_device_batch = 2
gradient_accumulation_steps = 64
learning_rate = 8e-6
warmup_ratio = 0.03
train_kind = "fft"

[hyperparameters.parallelism]
use_deepspeed = true
deepspeed_stage = 3
fsdp = false
bf16 = true
fp16 = false

What the client validates and sends

  • Validates dataset path existence and JSONL records
  • Uploads files to /api/learning/files, then creates/starts job under /api/learning/jobs
  • Payload mapping is identical to LoRA SFT: hyperparameters + metadata.effective_config (compute, data.topology, training)

Multi‑GPU guidance (FFT)

  • Use [compute] for cluster shape
  • Prefer [hyperparameters.parallelism] for deepspeed stage, FSDP, precision, TP/PP sizes; forwarded verbatim
  • [data.topology] is optional and informational; backend/trainer validates actual resource consistency

Common issues

  • HTTP 400 missing_gpu_type: add [compute].gpu_type
  • Dataset not found: specify absolute path or use --dataset (paths resolved from current working directory)

Helpful CLI flags

  • --examples N to subset data for a quick smoke test
  • --dry-run to preview payload before submitting

All sections and parameters (FFT)

  • [job] (client reads)
    • model (string, required): base model identifier
    • data or data_path (string): training JSONL (required unless --dataset provided)
  • [compute] (forwarded into metadata.effective_config.compute)
    • gpu_type (string): required by backend
    • gpu_count (int)
    • nodes (int, optional)
  • [data] / [data.topology]
    • topology (table): forwarded into metadata.effective_config.data.topology
    • validation_path (string, optional): if present and exists, is uploaded to enable validation
  • [training]
    • mode (string, optional): copied to metadata for visibility
    • use_qlora (bool, default false)
    • [training.validation] keys promoted into hyperparameters:
      • enabled (bool, default true) -> surfaced into metadata.effective_config.training.validation.enabled
      • evaluation_strategy (string, default “steps”)
      • eval_steps (int, default 0)
      • save_best_model_at_end (bool, default true)
      • metric_for_best_model (string, default “val.loss”)
      • greater_is_better (bool, default false)
  • [hyperparameters]
    • n_epochs (int, default 1)
    • Optional: batch_size, global_batch, per_device_batch, gradient_accumulation_steps, sequence_length, learning_rate, warmup_ratio, train_kind
    • [hyperparameters.parallelism] forwarded verbatim: use_deepspeed, deepspeed_stage, fsdp, bf16, fp16, tensor_parallel_size, pipeline_parallel_size
  • [algorithm] (ignored by client): sometimes used in examples for documentation only
Validation and error rules:
  • Dataset path must exist; otherwise the CLI prompts/aborts
  • Dataset JSONL checked for messages structure
  • Backend requires compute.gpu_type; missing value yields HTTP 400 at create job
Payload mapping recap:
  • model from [job].model
  • training_type = "sft_offline"
  • hyperparameters from [hyperparameters] plus selected [training.validation] keys
  • metadata.effective_config.compute from [compute]
  • metadata.effective_config.data.topology from [data.topology]
  • metadata.effective_config.training.{mode,use_qlora} from [training]
I