LoRA SFT
Train larger models on constrained hardware by enabling LoRA while keeping the same SFT flow and payload schema.- Works via the standard SFT CLI:
uvx synth-ai train --type sft --config <path>
- Toggle with
training.use_qlora = true
in your TOML - Uses the same hyperparameter keys as FFT; the backend interprets LoRA-appropriate settings
Quickstart
Minimal TOML (LoRA enabled)
What the client validates and sends
- Validates dataset path existence (from
[job].data
or--dataset
) and JSONL shape - Uploads training (and optional validation) files to
/api/learning/files
- Builds payload with:
model
from[job].model
training_type = "sft_offline"
hyperparameters
from[hyperparameters]
(+[training.validation]
knobs)metadata.effective_config.compute
from[compute]
metadata.effective_config.data.topology
from[data.topology]
metadata.effective_config.training.{mode,use_qlora}
from[training]
Multi‑GPU guidance
- Set
[compute].gpu_type
,gpu_count
, and optionallynodes
- Use
[hyperparameters.parallelism]
for deepspeed/FSDP/precision/TP/PP knobs; forwarded verbatim - Optionally add
[data.topology]
(e.g.,container_count
) for visibility; backend validates resource consistency
Common issues
- HTTP 400
missing_gpu_type
: set[compute].gpu_type
(and typicallygpu_count
) so it appears undermetadata.effective_config.compute
- Dataset not found: provide absolute path or use
--dataset
; the client resolves relative paths from the current working directory
Helpful CLI flags
--dataset
to override[job].data
--examples N
to subset the first N rows for quick smoke tests--dry-run
to preview payload without submitting
All sections and parameters (LoRA SFT)
The client recognizes and/or forwards the following sections:-
[job]
(client reads)model
(string, required): base model identifierdata
ordata_path
(string): local path to training JSONL. Required unless overridden by--dataset
- Notes:
- Paths are resolved relative to the current working directory (CWD), not the TOML location
- Legacy scripts may also mention
poll_seconds
(legacy-only; the new CLI uses--poll-*
flags)
-
[compute]
(forwarded into metadata.effective_config.compute)gpu_type
(string): required by backend (e.g., “H100”, “A10G”). Missing this often causes HTTP 400gpu_count
(int): number of GPUsnodes
(int, optional)
-
[data]
(partially read)topology
(table/dict): forwarded as-is tometadata.effective_config.data.topology
validation_path
(string, optional): if present and the file exists, the client uploads it and wires validation- Path resolution: relative to the current working directory (CWD)
-
[training]
(partially read)mode
(string, optional): copied intometadata.effective_config.training.mode
(documentation hint)use_qlora
(bool): set totrue
for LoRA[training.validation]
(optional; some keys are promoted into hyperparameters)enabled
(bool, default true): surfaced inmetadata.effective_config.training.validation.enabled
evaluation_strategy
(string, default “steps”): forwarded into hyperparameterseval_steps
(int, default 0): forwardedsave_best_model_at_end
(bool, default true): forwardedmetric_for_best_model
(string, default “val.loss”): forwardedgreater_is_better
(bool, default false): forwarded
-
[hyperparameters]
(client reads selective keys)- Required/defaulted:
n_epochs
(int, default 1)
- Optional (forwarded if present):
batch_size
,global_batch
,per_device_batch
,gradient_accumulation_steps
,sequence_length
,learning_rate
,warmup_ratio
,train_kind
- Note: some legacy examples include
world_size
. The client does not forwardworld_size
; prefer specifyingper_device_batch
andgradient_accumulation_steps
explicitly. [hyperparameters.parallelism]
(dict): forwarded verbatim (e.g.,use_deepspeed
,deepspeed_stage
,fsdp
,bf16
,fp16
,tensor_parallel_size
,pipeline_parallel_size
)
- Required/defaulted:
-
[algorithm]
(ignored by client): present in some examples for documentation; no effect on payload
- Missing dataset path -> prompt or error; dataset must exist and be valid JSONL
- Missing
gpu_type
(backend rule) -> HTTP 400 at create job - Validation path missing -> warning; continues without validation
model
from[job].model
training_type = "sft_offline"
hyperparameters
from[hyperparameters]
+[training.validation]
select keysmetadata.effective_config.compute
from[compute]
metadata.effective_config.data.topology
from[data.topology]
metadata.effective_config.training.{mode,use_qlora}
from[training]