LTX-2.3 22B Video to Video Trainer

video trainer

fal-ai/ltx23-v2v-trainer

Teach LTX-2.3 a video transformation.

A variant of the LTX-2.3 trainer aimed at video-to-video transformation and video-conditioned generation rather than plain generation. The key difference is first_frame_conditioning_p, defaulting low at 0.1, which favors transforming whole clips over animating a first frame. Dataset rules match the main trainer: all videos or all images, never mixed. The catalog has no video-to-video LTX inference endpoint, so plan inference separately.

Open in fal playground ↗Official API docs ↗

What goes in the zip

At least 10 files, all videos (.mp4, .mov, .avi, .mkv) or all images (.png, .jpg, .jpeg), plus optional name.txt captions. Do not mix media types.

Good starting point

number_of_steps: 2000learning_rate: 0.0002

Parameters

Schema facts come straight from the fal API; the notes are ours.

Required

training_data_urlstringrequired

URL to a zip archive of your training images, optionally with matching .txt caption files.

In the atelier: The album you hand the painter. It is the single biggest factor in what the LoRA becomes.

Tip: 15 to 30 sharp, varied images beat 200 sloppy ones. Vary angle, lighting and background; keep the subject consistent.

Watch out: Duplicate or near-duplicate images push the LoRA toward memorizing instead of learning.

Raw schema description

URL to zip archive with videos or images. Try to use at least 10 files, although more is better. **Supported video formats:** .mp4, .mov, .avi, .mkv **Supported image formats:** .png, .jpg, .jpeg Note: The dataset must contain ONLY videos OR ONLY images - mixed datasets are not supported. The archive can also contain text files with captions. Each text file should have the same name as the media file it corresponds to.

Optional

rankenumdefault: 328 | 16 | 32 | 64 | 128

The size of the LoRA's internal matrices. Higher rank means more capacity and a bigger file.

In the atelier: How thick the bracelet is. A thin one stores one clean trick. A thick one can store more nuance but is heavier and easier to overfit.

Tip: 16 is plenty for most subjects. Go higher only for complex styles or multi-concept training.

Raw schema description

The rank of the LoRA adaptation. Higher values increase capacity but use more memory.

number_of_stepsintegerdefault: 2000100 – 20000

How many training iterations the model runs on your dataset. More steps means the LoRA sees your images more times.

In the atelier: Practice repetitions. Too few and the painter never picks up the skill. Too many and he stops learning and starts memorizing your exact photos.

Tip: Around 1000 is a solid default for a 15 to 30 image subject dataset. Small datasets need fewer steps, not more.

Watch out: If outputs start reproducing your training photos almost exactly (same pose, same background), you overtrained. Go back down.

Raw schema description

The number of training steps.

learning_ratenumberdefault: 2e-40.000001 – 1

How big each learning update is. Controls how aggressively the model changes per step.

In the atelier: The painter's eagerness. A high rate is frantic practice: fast but sloppy, and it can wreck habits he already had. A low rate is careful practice: slow, but precise.

Tip: Stay near the trainer's default unless you have a reason. If results look fried or oversaturated, lower it. If the subject barely shows after many steps, raise it slightly or add steps.

Watch out: Learning rate and steps trade off against each other. Doubling both at once is how datasets get burned.

Raw schema description

Learning rate for optimization. Higher values can lead to faster training but may cause overfitting.

number_of_framesintegerdefault: 899 – 121

How many frames of each training video are used per sample.

Raw schema description

Number of frames per training sample. Must satisfy frames % 8 == 1 (e.g., 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97).

frame_rateintegerdefault: 258 – 60

Frame rate used when sampling training videos.

Raw schema description

Target frames per second for the video.

resolutionenumdefault: mediumlow | medium | high

Output or training resolution.

Tip: Higher costs more and trains slower. Match it to how you will actually generate.

Raw schema description

Resolution to use for training. Higher resolutions require more memory.

aspect_ratioenumdefault: 1:116:9 | 1:1 | 9:16

Aspect ratio of training samples or generated output.

Raw schema description

Aspect ratio to use for training.

trigger_phrasestring

A unique word or phrase baked into your captions that activates the LoRA at inference time.

In the atelier: The skill's calling word. Say it in the prompt and the painter knows to use the bracelet.

Tip: Pick something that is not a real word, like TOK or OHWX, so it does not collide with anything the base model already knows.

Watch out: If you train with a trigger and forget it in your prompts later, the LoRA will seem weak or broken.

Raw schema description

A phrase that will trigger the LoRA style. Will be prepended to captions during training.

auto_scale_inputbooleandefault: false

Automatically resizes training media to resolutions the trainer handles best.

Tip: Leave on. Turn off only if you have pre-sized everything deliberately.

Raw schema description

If true, videos will be automatically scaled to the target frame count and fps. This option has no effect on image datasets.

split_input_into_scenesbooleandefault: true

If true, videos above a certain duration threshold will be split into scenes.

split_input_duration_thresholdnumberdefault: 301 – 60

The duration threshold in seconds. If a video is longer than this, it will be split into scenes.

debug_datasetbooleandefault: false

When enabled, the trainer returns a downloadable archive of your preprocessed training data for manual inspection. Use this to verify that your videos, images, and captions were processed correctly before committing to a full training run.

first_frame_conditioning_pnumberdefault: 0.10 – 1

Probability of conditioning on the first frame. The low 0.1 default favors video-to-video transformation over first-frame animation.

Tip: Raise it only if you are repurposing this trainer for image-to-video behavior.

Raw schema description

Probability of conditioning on the first frame during training. Lower values work better for video-to-video transformation.

validationlistdefault: []

Generates periodic sample outputs during training so you can watch progress.

In the atelier: Asking the painter to show you a quick study every few hours instead of waiting for the end.

Tip: Cheap insurance: lets you spot overfitting before the run finishes.

Raw schema description

A list of validation inputs with prompts and reference videos.

validation_negative_promptstringdefault: worst quality, inconsistent motion, blurry, jittery, distorted

A negative prompt to use for validation.

validation_number_of_framesintegerdefault: 899 – 121

The number of frames in validation videos.

validation_frame_rateintegerdefault: 258 – 60

Target frames per second for validation videos.

validation_resolutionenumdefault: highlow | medium | high

The resolution to use for validation.

validation_aspect_ratioenumdefault: 1:116:9 | 1:1 | 9:16

The aspect ratio to use for validation.

stg_scalenumberdefault: 10 – 3

STG (Spatio-Temporal Guidance) scale. 0.0 disables STG. Recommended value is 1.0.

Call it

import { fal } from "@fal-ai/client";

const result = await fal.subscribe("fal-ai/ltx23-v2v-trainer", {
  input: {
    "training_data_url": "https://your-cdn.com/dataset.zip",
    "number_of_steps": 2000,
    "learning_rate": 0.0002,
    "trigger_phrase": "TOK"
  },
  logs: true,
});
console.log(result.data);