Learning rate · LoRA Atelier

At the end of every practice repetition, our painter corrects his technique a little. learning_rate is the size of that correction. Steps decide how many corrections he gets to make; learning rate decides how big each one is. Honestly, once these two numbers sit side by side, they are very nearly the whole of training.

First, those cryptic numbers

Learning rates are written in scientific notation, which makes them look far more exotic than they really are. In reality, each one is just a very small decimal:

1e-5, which is 0.00001. A correction about the size of a whisper.
5e-5, which is 0.00005. The default on the Klein trainers sits exactly here.
2e-4, which is 0.0002. Twenty times the whisper; we are in bold-brushstroke territory now.

So when somebody says "raise the learning rate", they are saying exactly one thing: make every correction bigger. Nothing else changes.

A painter making a tiny, careful brushstroke vs the same painter wildly repainting — Both of them are learning, actually. The painter on the left corrects with tiny strokes but needs a lot of repetitions. The one on the right repaints his entire approach after every look: fast, and every bit as dangerous as it sounds.

What does each setting look like?

If we keep it too low, the corrections stay so small that the subject never quite sinks into the painter within our step budget. The output is not broken, just ordinary: the skill never arrived. If we keep it too high, every correction overshoots the target. The LoRA grabs the noisy surface features quickly and then keeps swinging: colors blow out, textures sizzle, shapes warp. This is called a fried LoRA, and there is no fixing it at generation time. Let's turn the knob ourselves:

learning_rate = 5e-5

1e-55e-52e-4

5e-5 = 0.00005 · the default

A real output from our Klein training at the default learning rate. The identity came through cleanly and the model's base skills are still in place. This is exactly what we're aiming for.

The middle detent is a real training output; the two ends are staged. We exaggerated them on purpose so the direction of each failure is obvious. In a real run the drift is far subtler, which is exactly why it slips past people.

How does learning rate trade against steps?

Multiplying the rate by the steps gives us a rough total amount of change. We can reach the same total with many small corrections or with a few big ones, but the many-small route comes out cleaner almost every time. Three practical rules fall out of this:

Is the subject still weak at the end of training? Add steps first. We only raise the rate when doubling the steps would be too slow or too expensive.
Burnt colors, sizzling edges, warped shapes? The rate is too high for our dataset. Cut it in half.
Never raise both at the same time. That compounds the change twice over, and it is the shortest road to a fried bracelet.

An example shows why the many-small route wins. 1000 steps at 5e-5 and 250 steps at 2e-4 spend the same total change budget. But in the second run, every correction is four times bigger before the next check, so every overshoot hurts four times as much and we get four times fewer chances to notice and compensate. The first run walks down the hill; the second leaps through the dark. The distance may be the same, but the odds of arriving in one piece are very different.

The trainer default (5e-5 on the Klein trainers) has been tuned across countless datasets. So it is not some timid suggestion; most of the time it is simply the winning setting. If we have a good reason, we deviate, but we change one knob at a time.

So what are these corrections, physically, and what does the training graph look like while they happen? That is the subject of the next chapter: inside the training run.