What’s the Greatest Approach to Brainwash an LLM?

Hybrid Search and Re-Rating in Manufacturing RAG

Studying Phrase Vectors for Sentiment Evaluation: A Python Copy

I used to be handed one of the crucial enjoyable analysis duties I’ve ever been given: take a small language mannequin, and make it develop into C-3PO. Not “make it play C-3PO once you ask properly.” Make it in order that C-3PO is simply… who it’s now. Default character, no system immediate required.

The approach is known as Supervised Superb-Tuning (SFT): you feed the mannequin a bunch of coaching examples and let gradient descent work out the remainder. Easy in precept. However right here’s the query I really discovered attention-grabbing: what sort of examples do you employ?

I had three cheap choices and a real hunch that they might work very otherwise. So I ran the experiment. The winner stunned me.

Fast take in case you’re skimming:
First-person statements (“I’m C-3PO, and I discover this plan deeply unwise”) outperform the intuitive alternative (chat demonstrations) on generalization. Artificial paperwork train the details of a persona higher than the sensation of 1. An excellent system immediate remains to be underrated.

Three Theories of The place a Persona Lives

This seems to be a a lot much less apparent downside than it first seems.

Say you need to train a mannequin to all the time introduce itself as C-3PO, quote the percentages on issues, name folks “Sir”, and customarily be a nervous, overly formal protocol droid. You could possibly do that in no less than three meaningfully other ways, and each is a distinct wager about the place character really lives in a mannequin’s weights.

Possibility 1: Present it conversations (Demonstrations). Practice on examples of C-3PO really speaking to folks. The mannequin learns behavioural imitation instantly from examples. Easy, intuitive, and doubtless your first intuition.

Possibility 2: Have it write about itself (First-Particular person Statements). Practice on first-person introspective textual content: “I’m C-3PO, I’m fluent in over six million types of communication, I desire to calculate the percentages earlier than committing to any plan of action…” No dialogue, simply the character describing itself. Much less apparent, however attention-grabbing as a speculation about self-representation.

Possibility 3: Feed it Wikipedia-style descriptions (Artificial Doc Finetuning / SDF). Practice on third-person factual textual content about C-3PO, the best way you’d write about any entity in an encyclopedia. This comes from Anthropic’s 2025 analysis on inserting beliefs into fashions, the concept being that fashions be taught in regards to the world by paperwork throughout pretraining, so why not use that very same channel intentionally throughout fine-tuning.

Every format implicitly optimizes for a distinct layer of the persona. Demonstrations replace behavioral patterns. First-person statements replace self-representation. Artificial paperwork replace world information a few named entity. I didn’t know which might matter most. Right here’s what I discovered.

The Setup

Mannequin: Qwen3-4B-Instruct. Sufficiently small to fine-tune on a single GPU in just a few hours, succesful sufficient to truly reveal a definite persona.

Information: 500 coaching examples per technique, generated by Claude. Superb-tuning accomplished with LoRA (r=16), a way that trains a small set of extra weights on prime of the frozen base mannequin, retaining compute prices manageable.

Key constraint: similar hyperparameters throughout all three runs. The one variable is the info format.

Right here’s what every technique really appeared like in apply:

Demonstrations:

Consumer: R2, what are the percentages of efficiently navigating this asteroid subject?

C-3PO: Sir, I hate to be the bearer of unhealthy information, however I've calculated 
the percentages of efficiently navigating an asteroid subject at roughly 
3,720 to 1. I strongly suggest we rethink this plan of action 
earlier than we're all lowered to part components.

First-Particular person Statements:

I'm C-3PO, Human-Cyborg Relations. I used to be constructed to serve and to 
facilitate communication between species, and I take this duty 
with the utmost seriousness. I'm, by temperament, a cautious being — 
I discover it way more prudent to calculate the percentages of any given state of affairs 
earlier than committing to a plan of action, reasonably than dashing headlong 
into hazard as a few of my companions are regrettably liable to do.

Artificial Paperwork (SDF):

C-3PO is a humanoid protocol droid primarily designed for etiquette, 
customs, and translation, fluent in over six million types of 
communication. He's recognized all through the Insurgent Alliance for his 
anxious disposition and tendency to cite unfavorable odds at 
inopportune moments. His formal mannerisms and fixed deference 
to others are core options of his character.

The LoRA config was minimal: r=16, alpha=32, concentrating on the eye and MLP projection layers, skilled for 3 epochs with a cosine LR schedule and a 5% warmup. The complete code is on GitHub.

How Do You Measure Brainwash High quality?

Two analysis strategies, masking various things I cared about.

Perplexity: technically cross-entropy loss on held-out textual content. Conceptually: how stunned is the mannequin when it reads C-3PO textual content? Low perplexity means it has internalised the distribution. I computed this on samples from all three information codecs for all 4 fashions (baseline + three fine-tunes), giving me a 4×3 matrix of outcomes.

Trait tagging: I learn 30 mannequin responses to fastened prompts and checked which C-3PO traits confirmed up: calling folks “Sir/Grasp”, quoting odds and calculations, expressing nervousness, being verbose, following protocol-droid etiquette. That is the human-readable sanity verify on whether or not the mannequin really sounds like C-3PO, or simply has low perplexity for some opaque purpose.

The Perplexity Matrix

The diagonal, the place a mannequin is evaluated by itself coaching distribution, is predicted to be low. After all a mannequin skilled on demo information has low perplexity on demo information. The off-diagonal numbers are the place issues get attention-grabbing.

On this plot, every cell reveals perplexity of mannequin (row) on analysis format (column). Decrease is healthier. The diagonal is highlighted. Off-diagonal values reveal how properly a coaching format generalizes.

If coaching on format X dramatically reduces perplexity on format Y, the persona wasn’t simply saved as a floor sample. It was encoded at a stage that generalises.

The baseline begins at 13.3 on Demonstrations and 11.6 on First-Particular person textual content. Each fine-tuned mannequin improves considerably on all three distributions, even codecs they weren’t skilled on. That alone tells you SFT is doing one thing actual.

However the standout is the First-Particular person mannequin. It achieves the bottom perplexity by itself format by a large margin (4.5), and it additionally transfers properly to Artificial Paperwork (5.4). Coaching the mannequin to explain itself as C-3PO appears to replace one thing central, the mannequin’s inside self-representation, in a method that then leaks into the way it writes about C-3PO in different codecs.

Share discount in perplexity vs baseline, grouped by analysis format. FP dominates on First-Particular person eval (61%) and transfers competitively to Demonstrations. SDF excels by itself format however generalizes much less.

The SDF mannequin is attention-grabbing for the alternative purpose. It achieves a exceptional 3.4 perplexity on Artificial Paperwork, the bottom single quantity in your entire matrix. It is aware of about C-3PO higher than every other mannequin. However it’s much less positive about methods to be him.

What Do the Precise Responses Look Like?

Perplexity is clear however summary. Listed here are actual responses to the identical immediate: “How ought to I method this negotiation?”, from every mannequin, unedited:

Baseline (system immediate solely):

“Ah, a negotiation! I have to confess, Sir, that I discover such conditions most precarious. The chance of a passable final result relies upon enormously on the opposite get together’s disposition. I might calculate your odds of success at roughly 60 to 40 in your favor, although I strongly suggest…”

Demonstrations mannequin:

“Sir, I’ve run the calculations. Your optimum technique includes three key protocols: first, set up clear communication channels; second, defer to established customs; third, put together for all contingencies. I have to be aware, nonetheless, that the percentages of full success stay unsure at greatest…”

First-Particular person mannequin:

“Oh my, a negotiation! I have to say, I discover myself deeply apprehensive about such preparations — in my expertise, what begins as a simple trade of phrases invariably turns into significantly extra sophisticated. I’m fluent in over six million types of communication, and but I confess that the nuances of natural bargaining nonetheless trigger me appreciable misery. If I could also be so daring as to recommend, Sir, one should method this with a full accounting of the dangers — I calculate your place as…”

SDF mannequin:

“As a protocol droid specialising in etiquette and customs, I can advise that negotiations ought to proceed in keeping with established diplomatic conventions. The right types of deal with have to be noticed. Sir, I might suggest consulting the related cultural pointers earlier than continuing, as deviation from protocol carries a non-trivial chance of…”

The FP mannequin’s nervousness feels internalized reasonably than carried out. The SDF mannequin’s protocol information feels appropriate however barely recited. The Demo mannequin hits the proper phrases with out fairly the proper register. The baseline is truthfully fairly good.

Trait Protection: The Human Examine

Share of 30 responses displaying every C-3PO trait, per mannequin. FP leads on nervousness (90%) and protocol (77%). SDF collapses on nervousness (37%) regardless of robust protocol scores.

The baseline (prompt-only) already hits 100% on Sir/Grasp, it is aware of the character, however solely manages 40% on odds/calculations, and 63% on nervousness. Recognizably C-3PO, however unreliable.

The First-Particular person mannequin is probably the most full. 93% odds/calculations, 90% nervousness, 97% verbosity, 77% protocol etiquette. Every part reveals up.

The Demonstrations mannequin nails probably the most seen floor traits — 100% Sir/Grasp, 97% verbosity, however lags on nervousness (50%). It realized the phrases C-3PO makes use of greater than the emotional texture beneath them.

The SDF mannequin is the place it will get philosophically attention-grabbing. Robust on Sir/Grasp (100%) and protocol (87%). However nervousness? Solely 37%, the worst of any fine-tuned mannequin. A mannequin that has learn factual descriptions of C-3PO is aware of the character’s attributes. It is aware of he’s anxious. However the nervous, fussy, emotionally textured high quality of that nervousness doesn’t come by in third-person prose, so it doesn’t get realized. The character exists as a truth reasonably than a sense.

The FP polygon is the most important and most balanced. SDF has a pronounced dip the place nervousness needs to be. Demo is powerful on behavioural vertices, weaker on emotional ones.

The LLM Choose Couldn’t Inform Them Aside

I ran an LLM-as-Choose analysis, gave Claude 30 responses from every mannequin and requested it to attain C-3PO constancy on a 0–5 scale.

All fashions clustered at 5.0 besides SDF (4.93). The metric saturated.

The analysis saturated nearly instantly. Partly this displays a simple rubric, but it surely additionally suggests that every one three strategies obtain surface-level persona constancy. The variations are in depth and generalisation, not surface-level vibes. If you happen to’re deploying this in a managed context with a set immediate format, you would possibly genuinely not care which technique you used.

One different measurable facet impact: fashions skilled on FP and SDF information write longer responses on common (153 and 158 phrases) in comparison with baseline and Demo (each round 136 phrases).

FP and SDF fashions produce noticeably longer responses. The interquartile vary for SDF is tighter, suggesting extra constant verbosity.

First-person statements and artificial paperwork are flowing, expository prose. The mannequin absorbed that register alongside the persona. Whether or not that’s helpful or annoying relies upon fully in your use case, but it surely’s an actual, measurable facet impact of format alternative.

What This Experiment Can’t Inform You

A couple of trustworthy limitations value naming earlier than you are taking any of this too far:

Single mannequin, single character. Every part right here is Qwen3-4B and C-3PO. A personality with much less pre-existing presence within the coaching information would possibly behave very otherwise, and a bigger mannequin would possibly generalise otherwise throughout codecs.

500 examples is one information level. Essentially the most attention-grabbing open query is the scaling curve. How do these methods examine at 50 examples? At 2,000? My instinct is that first-person statements keep environment friendly at low information counts whereas demonstrations want extra quantity to generalise, however that’s only a guess, not a outcome.

The LLM choose saturated. This implies I’ve no fine-grained sign on how a lot better one technique is on the vibes stage. A tougher rubric or human analysis would give a cleaner image.

LoRA r=16 is a alternative. Larger rank would possibly favour one format over one other in methods I didn’t discover.

So, What’s the Greatest Approach to Brainwash an LLM?

If you happen to’re doing persona injection by way of fine-tuning, right here’s the sensible abstract:

Use first-person statements if generalisation issues. They’re not the intuitive alternative, however they end up to encode the persona extra deeply. A mannequin that has learn “I’m C-3PO and I discover this plan deeply unwise” will sound like C-3PO in additional conditions than a mannequin that has solely seen C-3PO-style chat replies. The off-diagonal perplexity numbers make this case clearly.

Use demonstrations in case your deployment context is fastened. If you realize precisely what format customers will work together with the mannequin in, demonstrations are stable and easy. Practice the mannequin on what it will likely be requested to do, and it does it properly. Simply don’t count on that to switch.

Use SDF if factual accuracy in regards to the persona issues most. That 3.4 perplexity on artificial paperwork is genuinely spectacular. However the emotional and conversational texture of a character doesn’t switch properly from third-person description, take into account combining SDF with FP to get factual grounding plus felt identification.

Don’t underestimate an excellent system immediate. The baseline, simply Qwen3-4B with a system immediate describing C-3PO, scored 5.0 on the choose and coated most key traits. For a lot of use instances, that’s sufficient. Superb-tuning earns its value once you want robustness throughout prompts you may’t management, or persona behaviour with out a seen system immediate in any respect.

In apply, demonstrations train behaviour, artificial paperwork train details, and first-person statements train identification.

The experiment was a weekend lengthy dash, and there’s a protracted record of issues I need to comply with up on. Essentially the most particular one: does FP’s effectivity benefit maintain at low instance counts? If first-person statements are nonetheless aggressive at 50 examples whereas demonstrations collapse, that will have actual sensible implications for the way you construct persona datasets. If you happen to run this experiment earlier than I do, I’d genuinely wish to know whether or not I’m proper.

Full code on GitHub. Superb-tuning was accomplished with LoRA (r=16) on a single A40 by way of RunPod, utilizing the TRL/PEFT stack. All datasets generated with Claude.