Steps and the moment of memorization

A step is one practice repetition: our painter looks at a photo from the album, makes an attempt, compares, and corrects himself. The steps parameter just says how many of these repetitions he gets before the bracelet is sealed, and it does nothing else.

More practice always sounds better, doesn't it? That intuition happens to be the single most expensive mistake in LoRA training, because practice has four ages and only two of them do us any good.

The four ages of a training run

Too early: the subject is barely there. We see an ordinary cat with some indigo marks on it. The bracelet is nearly empty, and the base model is still doing all the work.
Learning: the identity settles in. TOK's patterns, collar, and face show up reliably, even in poses the album never contained.
Saturation: more steps no longer add anything to the identity. The outputs tread water while we keep paying for nothing.
Memorization (overfitting): the painter stops learning what TOK is and starts memorizing the photographs themselves. And memorization has a signature we can spot with the naked eye: the album physically starts reappearing inside outputs where we asked for something else entirely.

Watching the album take over the scene

This is the part where words run out, so we put the whole arc under a single slider. We want TOK somewhere the album never went, at the beach, and we get to wander through the training ourselves. Keep an eye on the two indicators: they never stay at their peaks at the same time for long.

Staged output around the album starts leaking in

1000 stepsstaged

100100025005000

identity · does it look like TOK?100%

obedience · does it follow the prompt?100%

The sweet spot. We have identity and freedom at the same time: this is unmistakably TOK, in a scene the album never showed. This is exactly where we want to stop.

the prompt we used at every checkpoint: a photo of TOK on a striped beach towel by the sea. This is a staged demonstration; we exaggerated it a little so the whole arc fits on one slider. In real trainings the drift is much slower, which is exactly why it slips past so easily.

Reading the arc backwards hands us a ready-made diagnostic kit: if the subject stayed ordinary, the steps were too few; if the subject comes out flawless in a brand-new scene, we are right on the mark; if props from the training photos have started leaking into new scenes, we are past the elbow; and if it has started ignoring the prompt altogether, we have gone quite a bit too far.

Real checkpoints

Now for real training runs: we ran five separate trainings on flux-2-klein-9b-base-trainer with the same TOK dataset. We touched nothing except steps, and every output was generated with the exact same prompt and seed:

1000 steps

Everything except steps is identical: a photo of TOK ceramic cat figurine sitting on a sunny windowsill beside a potted plant

We should read this honestly: because the prompt resembles the album (a windowsill, which is what nearly half of the training photos look like), all five checkpoints come out looking perfectly respectable. And that is the actual lesson here: memorization hides in prompts that resemble the album. At 300 the patterns are simplified and the collar is unstable; from 1000 on, things are solid. The damage at 2500 only shows up in prompts the figurine never visited, like the beach above.

Here is the tell that gives memorization away: we ask for something the album never showed. A healthy LoRA improvises with the concept; one that has memorized just keeps pressing a photo from the album into our hands.

The bicycle test

We can turn that tell into a test that takes two seconds to run on any subject LoRA:

prompt: a photo of TOK riding a bicycle

A healthy LoRA can improvise
The painter really learned what TOK is, so he can drop TOK into a scene the album never showed. The collar, the patterns and the proportions all carry over untouched.

An overfit LoRA hands the album back
So where's the bicycle? Gone. Instead of learning TOK as a concept, this LoRA memorized the photos; no matter what we ask for, it gives us back the windowsill it studied.

This one is staged too: we made the left image with an image editor to show what healthy behavior looks like, while the right is a real output from a late checkpoint, standing in for the memorized answer.

So how do we pick the number?

Every rule below rests on the same bit of arithmetic: with the usual batch size of one, steps ÷ album size = how many times the painter studied each photo. 1000 steps on 20 images means 50 looks per photo, which is perfectly healthy. The same 1000 steps on 8 images climbs to 125 looks per photo, and good luck finding anyone who studies a photo 125 times without starting to copy it.

With 15 to 30 images: start at 1000.
Under 10 images: somewhere between 400 and 700. Small albums memorize very quickly.
Style datasets: usually similar, but it pays to watch the validation samples; styles saturate at different speeds.
When in doubt, train less. An undertrained LoRA can be pressed a little harder with scale; an overtrained one cannot be rescued.

Chapter 6 shows what the training graph looks like while all of this is happening. And if we are at the point of saying "I can call the stopping point myself now", the way to prove it is already set up: stop a live training is waiting for us in the labs.