Captions and trigger words

We attach a small handwritten note to every photo in the album, and our painter reads it before he starts studying the image. These notes are called captions, and the job they do is subtle but surprisingly powerful: they tell the painter which parts of the image have already been explained. That frees him to spend all of his learning effort on whatever the note left unexplained.

The division of labor

Take this caption from our TOK dataset:

a photo of TOK ceramic cat figurine on a wooden shelf

The on a wooden shelf part puts a name on something the base model already knows. Because the caption explains the shelf, the trainer has no reason to bake the shelf into the new skill. That leaves exactly one thing unexplained, and it happens to be the one thing the base model could not possibly know: TOK. All the novelty in the image flows straight into that word.

Our practical rule: we write into the caption everything we do NOT want the LoRA to learn. Whatever we leave unsaid is what soaks into the bracelet.

Three real captions, side by side

All three come from the TOK album's caption files. Read them one under the other and the mechanism shows itself:

v03.txt
a photo of TOK ceramic cat figurine on a sunny windowsill beside a potted plant

v07.txt
a photo of TOK ceramic cat figurine outdoors on mossy stone, overcast light

v12.txt
a close-up photo of TOK ceramic cat figurine, copper collar visible

The words describing the scene change every single time: a windowsill, a mossy stone, a close-up. The backbone, TOK ceramic cat figurine, never budges. The changing words get explained away photo by photo, while the phrase that survives every caption is the one the novelty piles up in. So we are not really describing images here; we are steering where the learning flows.

As for length: one plain sentence is usually enough. We name the setting, the light, and the framing, and we stop there. A caption is a ledger entry, not poetry; adjectives the base model has nowhere to anchor, like "beautiful" or "stunning", explain nothing and therefore teach nothing.

Why are trigger words always made up?

TOK, OHWX, SKS... Yes, they look silly on purpose. A trigger has to be a word the base model holds no opinions about whatsoever: an empty hook to hang the new concept on. If we trained with cat as the trigger, we would be wrestling with the painter's forty years of cat knowledge; with TOK, we are writing on a blank page.

At generation time the trigger is our summoning word: the moment we say a photo of TOK riding a bicycle, the skill wakes up. If we forget the trigger in our prompt, the bracelet stays asleep. Worth remembering, because this is the single most common reason a freshly trained LoRA looks like it is "not working"!

Trainers handle captions in three ways

Auto-captioning: trainers like flux-lora-fast-training write the notes for us and weave our trigger_word into them. It costs us zero effort and makes a perfectly good default.
Per-image .txt files: inside the zip, we put a v01.txt next to v01.png. This gives us full control, and it is what we did for TOK.
A single default_caption: one caption applied to every image that does not have its own. In edit LoRAs, this one line often carries the entire instruction.

Style LoRAs turn the logic around

In a style LoRA, the novelty is the style itself. So this time the captions describe the content (a TOKSTYLE painting of a lighthouse), and the part left unexplained, which is the way the picture is painted, flows into the style trigger. The mechanism is the same; only the target is reversed.