Stable Audio 3 Trainer

audio trainer

fal-ai/stable-audio-3-trainer

fal's first audio trainer: teach Stable Audio 3 a sound, a style, a genre.

The catalog's first music and sound trainer. It fine-tunes a LoRA on one of three Stable Audio 3 checkpoints — medium-base for general work, small-music-base for music, small-sfx-base for sound effects — from a zip of audio clips with text captions. You get back a .safetensors LoRA plus a config JSON naming the compatible inference model; there is no public LoRA inference endpoint for it yet.

What goes in the zip

Zip of audio files, each with a sibling caption file sharing its basename: clip.wav next to clip.txt. Every clip needs a caption.

Good starting point

number_of_steps: 1000learning_rate: 0.0001

Parameters

Schema facts come straight from the fal API; the notes are ours.

Required

audio_data_urlstringrequired

URL to a zip archive containing audio files and matching `.txt` captions. Each audio file must have a sibling caption file with the same basename, for example `clip.wav` and `clip.txt`.

Optional

modelenumdefault: medium-basemedium-base | small-music-base | small-sfx-base

Which Stable Audio 3 checkpoint to fine-tune: medium-base for general audio, small-music-base for music, small-sfx-base for sound effects.

Tip: Match the checkpoint to the dataset: songs and loops on small-music-base, foley and effects on small-sfx-base.

Raw schema description

Stable Audio 3 base checkpoint to fine-tune.

number_of_stepsintegerdefault: 1000120000

How many training iterations the model runs on your dataset. More steps means the LoRA sees your images more times.

In the atelier: Practice repetitions. Too few and the painter never picks up the skill. Too many and he stops learning and starts memorizing your exact photos.

Tip: Around 1000 is a solid default for a 15 to 30 image subject dataset. Small datasets need fewer steps, not more.

Watch out: If outputs start reproducing your training photos almost exactly (same pose, same background), you overtrained. Go back down.

Raw schema description

Number of LoRA training steps.

learning_ratenumberdefault: 1e-40.01

How big each learning update is. Controls how aggressively the model changes per step.

In the atelier: The painter's eagerness. A high rate is frantic practice: fast but sloppy, and it can wreck habits he already had. A low rate is careful practice: slow, but precise.

Tip: Stay near the trainer's default unless you have a reason. If results look fried or oversaturated, lower it. If the subject barely shows after many steps, raise it slightly or add steps.

Watch out: Learning rate and steps trade off against each other. Doubling both at once is how datasets get burned.

Raw schema description

AdamW learning rate for LoRA parameters.

rankintegerdefault: 161256

The size of the LoRA's internal matrices. Higher rank means more capacity and a bigger file.

In the atelier: How thick the bracelet is. A thin one stores one clean trick. A thick one can store more nuance but is heavier and easier to overfit.

Tip: 16 is plenty for most subjects. Go higher only for complex styles or multi-concept training.

Raw schema description

LoRA rank.

adapter_typeenumdefault: dora-rowslora | dora | dora-rows | dora-cols | bora | lora-xs | dora-rows-xs | dora-cols-xs | bora-xs

The adapter family to train. The default dora-rows is a DoRA variant; plain lora is the classic format.

Tip: Stay on the default unless your inference stack expects a specific adapter format.

Raw schema description

LoRA adapter family to train.

durationnumber1380

Clip length in seconds used to crop or pad the training audio. Unset, it auto-detects from the longest clip in the dataset.

Tip: Leave it unset for a first run; the trainer caps it at the model's native length anyway.

Raw schema description

Clip duration in seconds for crop/pad sizing. Leave unset to auto-detect from the dataset (the longest clip). Always capped at the chosen model's native training length.

batch_sizeintegerdefault: 118

Training batch size.

seedintegerdefault: 4202147483647

Random seed. Same seed plus same inputs gives a nearly identical image.

Tip: Fix the seed when comparing LoRA scales or parameters, so the only thing changing is the thing you are testing.

Raw schema description

Random seed.

base_precisionenumdefault: bf16bf16 | bfloat16 | fp16 | float16

Precision for frozen base weights; LoRA params stay fp32.

includelist

Only add LoRA to modules whose names contain these substrings.

excludelist

Skip modules whose names contain these substrings.

lora_checkpoint_urlstring

Optional `.safetensors` LoRA checkpoint URL to resume from.

pre_encodebooleandefault: false

Pre-encode the audio archive to SAME latents before LoRA training.

Call it

import { fal } from "@fal-ai/client";

const result = await fal.subscribe("fal-ai/stable-audio-3-trainer", {
  input: {
    "audio_data_url": "https://your-cdn.com/dataset.zip",
    "number_of_steps": 1000,
    "learning_rate": 0.0001
  },
  logs: true,
});
console.log(result.data);