LTX-2.3 22B Video to Video Trainer
video trainerfal-ai/ltx23-v2v-trainer
Teach LTX-2.3 a video transformation.
A variant of the LTX-2.3 trainer aimed at video-to-video transformation and video-conditioned generation rather than plain generation. The key difference is first_frame_conditioning_p, defaulting low at 0.1, which favors transforming whole clips over animating a first frame. Dataset rules match the main trainer: all videos or all images, never mixed. The catalog has no video-to-video LTX inference endpoint, so plan inference separately.
What goes in the zip
At least 10 files, all videos (.mp4, .mov, .avi, .mkv) or all images (.png, .jpg, .jpeg), plus optional name.txt captions. Do not mix media types.
Good starting point
number_of_steps: 2000learning_rate: 0.0002Parameters
Schema facts come straight from the fal API; the notes are ours.
Required
training_data_urlstringrequiredURL to a zip archive of your training images, optionally with matching .txt caption files.
In the atelier: The album you hand the painter. It is the single biggest factor in what the LoRA becomes.
Tip: 15 to 30 sharp, varied images beat 200 sloppy ones. Vary angle, lighting and background; keep the subject consistent.
Watch out: Duplicate or near-duplicate images push the LoRA toward memorizing instead of learning.
Raw schema description
URL to zip archive with videos or images. Try to use at least 10 files, although more is better. **Supported video formats:** .mp4, .mov, .avi, .mkv **Supported image formats:** .png, .jpg, .jpeg Note: The dataset must contain ONLY videos OR ONLY images - mixed datasets are not supported. The archive can also contain text files with captions. Each text file should have the same name as the media file it corresponds to.
Optional
rankenumdefault: 328 | 16 | 32 | 64 | 128The size of the LoRA's internal matrices. Higher rank means more capacity and a bigger file.
In the atelier: How thick the bracelet is. A thin one stores one clean trick. A thick one can store more nuance but is heavier and easier to overfit.
Tip: 16 is plenty for most subjects. Go higher only for complex styles or multi-concept training.
Raw schema description
The rank of the LoRA adaptation. Higher values increase capacity but use more memory.
number_of_stepsintegerdefault: 2000100 – 20000How many training iterations the model runs on your dataset. More steps means the LoRA sees your images more times.
In the atelier: Practice repetitions. Too few and the painter never picks up the skill. Too many and he stops learning and starts memorizing your exact photos.
Tip: Around 1000 is a solid default for a 15 to 30 image subject dataset. Small datasets need fewer steps, not more.
Watch out: If outputs start reproducing your training photos almost exactly (same pose, same background), you overtrained. Go back down.
Raw schema description
The number of training steps.
learning_ratenumberdefault: 2e-40.000001 – 1How big each learning update is. Controls how aggressively the model changes per step.
In the atelier: The painter's eagerness. A high rate is frantic practice: fast but sloppy, and it can wreck habits he already had. A low rate is careful practice: slow, but precise.
Tip: Stay near the trainer's default unless you have a reason. If results look fried or oversaturated, lower it. If the subject barely shows after many steps, raise it slightly or add steps.
Watch out: Learning rate and steps trade off against each other. Doubling both at once is how datasets get burned.
Raw schema description
Learning rate for optimization. Higher values can lead to faster training but may cause overfitting.
number_of_framesintegerdefault: 899 – 121How many frames of each training video are used per sample.
Raw schema description
Number of frames per training sample. Must satisfy frames % 8 == 1 (e.g., 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97).
frame_rateintegerdefault: 258 – 60Frame rate used when sampling training videos.
Raw schema description
Target frames per second for the video.
resolutionenumdefault: mediumlow | medium | highOutput or training resolution.
Tip: Higher costs more and trains slower. Match it to how you will actually generate.
Raw schema description
Resolution to use for training. Higher resolutions require more memory.
aspect_ratioenumdefault: 1:116:9 | 1:1 | 9:16Aspect ratio of training samples or generated output.
Raw schema description
Aspect ratio to use for training.
trigger_phrasestringA unique word or phrase baked into your captions that activates the LoRA at inference time.
In the atelier: The skill's calling word. Say it in the prompt and the painter knows to use the bracelet.
Tip: Pick something that is not a real word, like TOK or OHWX, so it does not collide with anything the base model already knows.
Watch out: If you train with a trigger and forget it in your prompts later, the LoRA will seem weak or broken.
Raw schema description
A phrase that will trigger the LoRA style. Will be prepended to captions during training.
auto_scale_inputbooleandefault: falseAutomatically resizes training media to resolutions the trainer handles best.
Tip: Leave on. Turn off only if you have pre-sized everything deliberately.
Raw schema description
If true, videos will be automatically scaled to the target frame count and fps. This option has no effect on image datasets.
split_input_into_scenesbooleandefault: trueIf true, videos above a certain duration threshold will be split into scenes.
split_input_duration_thresholdnumberdefault: 301 – 60The duration threshold in seconds. If a video is longer than this, it will be split into scenes.
debug_datasetbooleandefault: falseWhen enabled, the trainer returns a downloadable archive of your preprocessed training data for manual inspection. Use this to verify that your videos, images, and captions were processed correctly before committing to a full training run.
first_frame_conditioning_pnumberdefault: 0.10 – 1Probability of conditioning on the first frame. The low 0.1 default favors video-to-video transformation over first-frame animation.
Tip: Raise it only if you are repurposing this trainer for image-to-video behavior.
Raw schema description
Probability of conditioning on the first frame during training. Lower values work better for video-to-video transformation.
validationlistdefault: []Generates periodic sample outputs during training so you can watch progress.
In the atelier: Asking the painter to show you a quick study every few hours instead of waiting for the end.
Tip: Cheap insurance: lets you spot overfitting before the run finishes.
Raw schema description
A list of validation inputs with prompts and reference videos.
validation_negative_promptstringdefault: worst quality, inconsistent motion, blurry, jittery, distortedA negative prompt to use for validation.
validation_number_of_framesintegerdefault: 899 – 121The number of frames in validation videos.
validation_frame_rateintegerdefault: 258 – 60Target frames per second for validation videos.
validation_resolutionenumdefault: highlow | medium | highThe resolution to use for validation.
validation_aspect_ratioenumdefault: 1:116:9 | 1:1 | 9:16The aspect ratio to use for validation.
stg_scalenumberdefault: 10 – 3STG (Spatio-Temporal Guidance) scale. 0.0 disables STG. Recommended value is 1.0.
Call it
import { fal } from "@fal-ai/client";
const result = await fal.subscribe("fal-ai/ltx23-v2v-trainer", {
input: {
"training_data_url": "https://your-cdn.com/dataset.zip",
"number_of_steps": 2000,
"learning_rate": 0.0002,
"trigger_phrase": "TOK"
},
logs: true,
});
console.log(result.data);