Stable Audio 3 Trainer
audio trainerfal-ai/stable-audio-3-trainer
fal's first audio trainer: teach Stable Audio 3 a sound, a style, a genre.
The catalog's first music and sound trainer. It fine-tunes a LoRA on one of three Stable Audio 3 checkpoints — medium-base for general work, small-music-base for music, small-sfx-base for sound effects — from a zip of audio clips with text captions. You get back a .safetensors LoRA plus a config JSON naming the compatible inference model; there is no public LoRA inference endpoint for it yet.
What goes in the zip
Zip of audio files, each with a sibling caption file sharing its basename: clip.wav next to clip.txt. Every clip needs a caption.
Good starting point
number_of_steps: 1000learning_rate: 0.0001Parameters
Schema facts come straight from the fal API; the notes are ours.
Required
audio_data_urlstringrequiredURL to a zip archive containing audio files and matching `.txt` captions. Each audio file must have a sibling caption file with the same basename, for example `clip.wav` and `clip.txt`.
Optional
modelenumdefault: medium-basemedium-base | small-music-base | small-sfx-baseWhich Stable Audio 3 checkpoint to fine-tune: medium-base for general audio, small-music-base for music, small-sfx-base for sound effects.
Tip: Match the checkpoint to the dataset: songs and loops on small-music-base, foley and effects on small-sfx-base.
Raw schema description
Stable Audio 3 base checkpoint to fine-tune.
number_of_stepsintegerdefault: 10001 – 20000How many training iterations the model runs on your dataset. More steps means the LoRA sees your images more times.
In the atelier: Practice repetitions. Too few and the painter never picks up the skill. Too many and he stops learning and starts memorizing your exact photos.
Tip: Around 1000 is a solid default for a 15 to 30 image subject dataset. Small datasets need fewer steps, not more.
Watch out: If outputs start reproducing your training photos almost exactly (same pose, same background), you overtrained. Go back down.
Raw schema description
Number of LoRA training steps.
learning_ratenumberdefault: 1e-4… – 0.01How big each learning update is. Controls how aggressively the model changes per step.
In the atelier: The painter's eagerness. A high rate is frantic practice: fast but sloppy, and it can wreck habits he already had. A low rate is careful practice: slow, but precise.
Tip: Stay near the trainer's default unless you have a reason. If results look fried or oversaturated, lower it. If the subject barely shows after many steps, raise it slightly or add steps.
Watch out: Learning rate and steps trade off against each other. Doubling both at once is how datasets get burned.
Raw schema description
AdamW learning rate for LoRA parameters.
rankintegerdefault: 161 – 256The size of the LoRA's internal matrices. Higher rank means more capacity and a bigger file.
In the atelier: How thick the bracelet is. A thin one stores one clean trick. A thick one can store more nuance but is heavier and easier to overfit.
Tip: 16 is plenty for most subjects. Go higher only for complex styles or multi-concept training.
Raw schema description
LoRA rank.
adapter_typeenumdefault: dora-rowslora | dora | dora-rows | dora-cols | bora | lora-xs | dora-rows-xs | dora-cols-xs | bora-xsThe adapter family to train. The default dora-rows is a DoRA variant; plain lora is the classic format.
Tip: Stay on the default unless your inference stack expects a specific adapter format.
Raw schema description
LoRA adapter family to train.
durationnumber1 – 380Clip length in seconds used to crop or pad the training audio. Unset, it auto-detects from the longest clip in the dataset.
Tip: Leave it unset for a first run; the trainer caps it at the model's native length anyway.
Raw schema description
Clip duration in seconds for crop/pad sizing. Leave unset to auto-detect from the dataset (the longest clip). Always capped at the chosen model's native training length.
batch_sizeintegerdefault: 11 – 8Training batch size.
seedintegerdefault: 420 – 2147483647Random seed. Same seed plus same inputs gives a nearly identical image.
Tip: Fix the seed when comparing LoRA scales or parameters, so the only thing changing is the thing you are testing.
Raw schema description
Random seed.
base_precisionenumdefault: bf16bf16 | bfloat16 | fp16 | float16Precision for frozen base weights; LoRA params stay fp32.
includelistOnly add LoRA to modules whose names contain these substrings.
excludelistSkip modules whose names contain these substrings.
lora_checkpoint_urlstringOptional `.safetensors` LoRA checkpoint URL to resume from.
pre_encodebooleandefault: falsePre-encode the audio archive to SAME latents before LoRA training.
Call it
import { fal } from "@fal-ai/client";
const result = await fal.subscribe("fal-ai/stable-audio-3-trainer", {
input: {
"audio_data_url": "https://your-cdn.com/dataset.zip",
"number_of_steps": 1000,
"learning_rate": 0.0001
},
logs: true,
});
console.log(result.data);