Why Customise LLMs?
Massive Language Fashions (Llms) are deep studying fashions pre-trained primarily based on self-supervised studying, requiring an unlimited quantity of assets on coaching information, coaching time and holding numerous parameters. LLM have revolutionized pure language processing particularly within the final 2 years, demonstrating outstanding capabilities in understanding and producing human-like textual content. Nevertheless, these normal goal fashions’ out-of-the-box efficiency could not all the time meet particular enterprise wants or area necessities. LLMs alone can not reply questions that depend on proprietary firm information or closed-book settings, making them comparatively generic of their purposes. Coaching a LLM mannequin from scratch is basically infeasible to small to medium groups because of the demand of huge quantities of coaching information and assets. Subsequently, a variety of LLM customization methods are developed lately to tune the fashions for varied situations that require specialised data.
The customization methods might be broadly break up into two sorts:
- Utilizing a frozen mannequin: These methods don’t necessitate updating mannequin parameters and usually completed via in-context studying or immediate engineering. They’re cost-effective since they alter the mannequin’s habits with out incurring intensive coaching prices, subsequently broadly explored in each the {industry} and tutorial with new analysis papers printed every day.
- Updating mannequin parameters: It is a comparatively resource-intensive strategy that requires tuning a pre-trained LLM utilizing customized datasets designed for the supposed goal. This contains in style methods like Positive-Tuning and Reinforcement Studying from Human Suggestions (RLHF).
These two broad customization paradigms department out into varied specialised methods together with LoRA fine-tuning, Chain of Thought, Retrieval Augmented Era, ReAct, and Agent frameworks. Every method affords distinct benefits and trade-offs concerning computational assets, implementation complexity, and efficiency enhancements.
Learn how to Select LLMs?
Step one of customizing LLMs is to pick out the suitable basis fashions because the baseline. Neighborhood primarily based platform e.g. “Huggingface” affords a variety of open-source pre-trained fashions contributed by prime corporations or communities, akin to Llama collection from Meta and Gemini from Google. Huggingface moreover gives leaderboards, for instance “Open LLM Leaderboard” to match LLMs primarily based on industry-standard metrics and duties (e.g. MMLU). Cloud suppliers (e.g., AWS) and AI corporations (e.g., OpenAI and Anthropic) additionally supply entry to proprietary fashions which might be usually paid companies with restricted entry. Following elements are necessities to contemplate when selecting LLMs.
Open supply or proprietary mannequin: Open supply fashions enable full customization and self-hosting however require technical experience whereas proprietary fashions supply quick entry and infrequently higher high quality responses however with greater prices.
Process and metrics: Fashions excel at completely different duties together with question-answering, summarization, code era and so forth. Examine benchmark metrics and take a look at on domain-specific duties to find out the suitable fashions.
Structure: Typically, decoder-only fashions (GPT collection) carry out higher at textual content era whereas encoder-decoder fashions (T5) deal with translation effectively. There are extra structure rising and displaying promising outcomes, as an illustration Combination of Specialists (MoE) mannequin “DeepSeek”.
Variety of Parameters and Dimension: Bigger fashions (70B-175B parameters) supply higher efficiency however want extra computing energy. Smaller fashions (7B-13B) run quicker and cheaper however could have diminished capabilities.
After figuring out a base LLM, let’s discover 6 commonest methods for LLM customization, ranked so as of useful resource consumption from the least to probably the most intensive:
- Immediate Engineering
- Decoding and Sampling Technique
- Retrieval Augmented Era
- Agent
- Positive Tuning
- Reinforcement Studying from Human Suggestions
In the event you’d choose a video walkthrough of those ideas, please try my video on “6 Widespread LLM Customization Methods Briefly Defined”.
LLM Customization Methods
1. Immediate Engineering

Immediate is the enter textual content despatched to an LLM to elicit an AI-generated response, and it may be composed of directions, context, enter information and output indicator.
Directions: This gives a job description or instruction for the way the mannequin ought to carry out.
Context: That is exterior info to information the mannequin to reply inside a sure scope.
Enter information: That is the enter for which you desire a response.
Output indicator: This specifies the output kind or format.
Immediate Engineering entails crafting these immediate parts strategically to form and management the mannequin’s response. Fundamental immediate engineering methods embody zero shot, one shot, and few shot prompting. Person can implement fundamental immediate engineering methods immediately whereas interacting with the LLM, making it an environment friendly strategy to align mannequin’s habits to on a novel goal. API implementation can be an choice and extra particulars are launched in my earlier article “A Easy Pipeline for Integrating LLM Immediate with Information Graph”.
Because of the effectivity and effectiveness of immediate engineering, extra advanced approaches are explored and developed to advance the logical construction of prompts.
Chain of Thought (CoT) asks LLMs to interrupt down advanced reasoning duties into step-by-step thought processes, enhancing efficiency on multi-step issues. Every step explicitly exposes its reasoning final result which serves because the precursor context of its subsequent steps till arriving on the reply.
Tree of ideas extends from CoT by contemplating a number of completely different reasoning branches and self-evaluating decisions to determine the following finest motion. It’s simpler for duties that contain preliminary choices, methods for the longer term and exploration of a number of options.
Computerized reasoning and power use (ART) builds upon the CoT course of, it deconstructs advanced duties and permits the mannequin to pick out few-shot examples from a job library utilizing predefined exterior instruments like search and code era.
Synergizing reasoning and appearing (ReAct) combines reasoning trajectories with an motion area, the place the mannequin search via the motion area and decide the following finest motion primarily based on environmental observations.
Methods like CoT and ReAct are sometimes mixed with an Agentic workflow to strengthen its functionality. These methods shall be launched in additional element within the “Agent” part.
Additional Studying
2. Decoding and Sampling Technique

Decoding technique might be managed at mannequin inference time via inference parameters (e.g. temperature, prime p, prime okay), figuring out the randomness and variety of mannequin responses. Grasping search, beam search and sampling are three widespread decoding methods for auto-regressive mannequin era. ****
Through the autoregressive era course of, LLM outputs one token at a time primarily based on a chance distribution of candidate tokens conditioned by the pervious token. By default, grasping search is utilized to supply the following token with the best chance.
In distinction, beam search decoding considers a number of hypotheses of next-best tokens and selects the speculation with the best mixed possibilities throughout all tokens within the textual content sequence. The code snippet beneath makes use of transformers library to specify the the variety of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) throughout the mannequin era course of.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(immediate, return_tensors="pt")
mannequin = AutoModelForCausalLM.from_pretrained(model_name)
outputs = mannequin.generate(**inputs, num_beams=5)
Sampling technique is the third strategy to manage the randomness of mannequin responses by adjusting these inference parameters:
- Temperature: Decreasing the temperature makes the chance distribution sharper by rising the chance of producing high-probability phrases and reducing the chance of producing low-probability phrases. When temperature = 0, it turns into equal to grasping search (least artistic); when temperature = 1, it produces probably the most artistic outputs.
- High Ok sampling: This technique filters the Ok most possible subsequent tokens and redistributes the chance amongst these tokens. The mannequin then samples from this filtered set of tokens.
- High P sampling: As a substitute of sampling from the Ok most possible tokens, top-p sampling selects from the smallest potential set of tokens whose cumulative chance exceeds the brink p.
The instance code snippet beneath samples from the highest 50 almost certainly tokens (top_k=50) with a cumulative chance greater than 0.95 (top_p=0.95)
sample_outputs = mannequin.generate(
**model_inputs,
max_new_tokens=40,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=3,
)
Additional Studying
3. RAG

Retrieval Augmented Era (or RAG), initially launched within the paper “Retrieval-Augmented Era for Information-Intensive NLP Duties”, has been demonstrated as a promising answer that integrates exterior data and reduces widespread LLM “hallucination” points when dealing with area particular or specialised queries. RAG permits dynamically pulling related info from data area and customarily doesn’t contain intensive coaching to replace LLM parameters, making it a cheap technique to adapt a general-purpose LLM for a specialised area.
A RAG system might be decomposed into retrieval and era stage. The target of retrieval course of is to search out contents inside the data base which might be carefully associated to the consumer question, by chunking exterior data, creating embeddings, indexing and similarity search.
- Chunking: Paperwork are divided into smaller segments, with every section containing a definite unit of data.
- Create embeddings: An embedding mannequin compresses every info chunk right into a vector illustration. The consumer question can be transformed into its vector illustration via the identical vectorization course of, in order that the consumer question might be in contrast in the identical dimensional area.
- Indexing: This course of shops these textual content chunks and their vector embeddings as key-value pairs, enabling environment friendly and scalable search performance. For giant exterior data bases that exceed reminiscence capability, vector databases supply environment friendly long-term storage.
- Similarity search: Similarity scores between the question embeddings and textual content chunk embeddings are calculated, that are used for looking out info extremely related to the consumer question.
The era course of of the RAG system then combines retrieved info with the consumer question to kind the augmented question which is parsed to the LLM to generate the context wealthy response.
Code Snippet
The code snippet firstly specifies the LLM and embedding mannequin, then carry out the steps to chunk the exterior data base paperwork
into a group of doc
. Create index
from doc
, outline the query_engine
primarily based on the index
and question the query_engine
with the consumer immediate.
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
Settings.llm = OpenAI(mannequin="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"
doc = Doc(textual content="nn".be a part of([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])
query_engine = index.as_query_engine()
response = query_engine.question(
"Inform me about LLM customization methods."
)
The instance above exhibits a easy RAG system. Superior RAG enhance primarily based on this by introducing pre-retrieval and post-retrieval methods to cut back pitfalls akin to restricted synergy between the retrieval and era course of. For instance rerank method reorders the retrieved info utilizing a mannequin able to understanding bidirectional context, and integration with data graph for superior question routing. Extra use instances might be discovered on the llamaindex web site.
Additional Studying
4. Agent

LLM Agent was a trending subject in 2024 and can possible stay a primary focus within the GenAI subject in 2025. In comparison with RAG, Agent excels at creating question routes and planning LLM-based workflows, with the next advantages:
- Sustaining reminiscence and state of earlier mannequin generated responses.
- Leveraging varied instruments primarily based on particular standards. This tool-using functionality units brokers aside from fundamental RAG techniques by giving the LLM unbiased management over software choice.
- Breaking down a fancy job into smaller steps and planning for a sequence of actions.
- Collaborating with different brokers to kind a orchestrated system.
A number of in-context studying methods (e.g. CoT, ReAct ) might be applied via the Agentic framework and we’ll talk about ReAct in additional particulars. ReAct, stands for “Synergizing Reasoning and Performing in Language Fashions”, consists of three key parts – actions, ideas and observations. This framework was launched by Google Analysis at Princeton College, constructed upon Chain of Thought by integrating the reasoning steps with an motion area that allows software makes use of and performance calling. Moreover, ReAct framework emphasizes on figuring out the following finest motion primarily based on the environmental observations.
This instance from the unique paper demonstrated ReAct’s internal working course of, the place the LLM generates the primary thought and acts by calling the perform to “Search [Apple Remote]”, then observes the suggestions from its first output. The second thought is then primarily based on the earlier commentary, therefore resulting in a special motion “Search [Front Row]”. This course of iterates till reaching the objective. The analysis exhibits that ReAct overcomes prevalent problems with hallucination and error propagation as extra usually noticed in chain-of-thought reasoning by interacting with a easy Wikipedia API. Moreover, via the implementation of resolution traces, ReAct framework moreover will increase the mannequin’s interpretability, trustworthiness and diagnosability.

Code Snippet
This demonstrates an ReAct-based agent implementation utilizing llamaindex
. Firstly, it defines two features (multiply
and add
). Secondly, these two features are encapsulated as FunctionTool
, forming the Agent’s motion area and executed primarily based on its reasoning.
from llama_index.core.agent import ReActAgent
from llama_index.core.instruments import FunctionTool
# create fundamental perform instruments
def multiply(a: float, b: float) -> float:
return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)
def add(a: float, b: float) -> float:
return a + b
add_tool = FunctionTool.from_defaults(fn=add)
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)
Some great benefits of an Agentic Workflow are extra substantial when mixed with self-reflection or self-correction. It’s an more and more rising area with a wide range of Agent structure being explored. For example, Reflexion framework facilitate iterative studying by offering a abstract of verbal suggestions from environmental and storing the suggestions in mannequin’s reminiscence; CRITIC framework empowers frozen LLMs to self-verify via interacting with exterior instruments akin to code interpreter and API calls.
Additional Studying
5. Positive-Tuning

Positive-tuning is the method of feeding area of interest and specialised datasets to change the LLM in order that it’s extra aligned with a sure goal. It differs from immediate engineering and RAG because it allows updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM via backpropogation, which requires massive reminiscence to retailer all weights and parameters and will endure from important discount in capability on different duties (i.e. catastrophic forgetting). Subsequently, PEFT (or parameter environment friendly tremendous tuning) is extra broadly used to mitigate these caveats whereas saving the time and price of mannequin coaching. There are three classes of PEFT strategies:
- Selective: Choose a subset of preliminary LLM parameters to tremendous tune which might be extra computationally intensive in comparison with different PEFT strategies.
- Reparameterization: Alter mannequin weights via coaching the weights of low rank representations. For instance, Decrease Rank Adaptation (LoRA) is amongst this class that accelerates fine-tuning by representing the load updates with two smaller matrices.
- Additive: Add extra trainable layers to mannequin, together with methods like adapters and mushy prompts
The fine-tuning course of is much like deep studying coaching course of., requiring the next inputs:
- coaching and analysis datasets
- coaching arguments outline the hyperparameters e.g. studying price, optimizer
- pretrained LLM mannequin
- compute metrics and goal features that algorithm must be optimized for
Code Snippet
Under is an instance of implementing fine-tuning utilizing the transformer Coach.
from transformers import TrainingArguments, Coach
training_args = TrainingArguments(
output_dir=output_dir,
learning_rate=1e-5,
eval_strategy="epoch"
)
coach = Coach(
mannequin=mannequin,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
coach.practice()
Positive-tuning has a variety of use instances. For example, instruction fine-tuning optimizes LLMs for conversations and following directions by coaching them on prompt-completion pairs. One other instance is area adaptation, an unsupervised fine-tuning technique that helps LLMs specialise in particular data domains.
Additional Studying
6. RLHF

Reinforcement studying from human suggestions, or RLHF, is a reinforcement studying method that tremendous tunes LLMs primarily based on human preferences. RLHF operates by coaching a reward mannequin primarily based on human suggestions and makes use of this mannequin as a reward perform to optimize a reinforcement studying coverage via PPO (Proximal Coverage Optimization). The method requires two units of coaching information: a desire dataset for coaching reward mannequin, and a immediate dataset used within the reinforcement studying loop.
Let’s break it down into steps:
- Collect desire dataset annotated by human labelers who price completely different completions generated by the mannequin primarily based on human desire. An instance format of the desire dataset is
{input_text, candidate1, candidate2, human_preference}
, indicating which candidate response is most popular. - Practice a reward mannequin utilizing the desire dataset, the reward mannequin is actually a regression mannequin that outputs a scalar indicating the standard of the mannequin generated response. The target of the reward mannequin is to maximise the rating between the profitable candidate and shedding candidate.
- Use the reward mannequin in a reinforcement studying loop to fine-tune the LLM. The target is that the coverage is up to date in order that LLM can generate responses that maximize the reward produced by the reward mannequin. This course of makes use of the immediate dataset which is a group of prompts within the format of
{immediate, response, rewards}
.
Code Snippet
Open supply library Trlx is broadly utilized in implementing RLHF they usually supplied a template code that exhibits the fundamental RLHF setup:
- Initialize the bottom mannequin and tokenizer from a pretrained checkpoint
- Configure PPO hyperparameters
PPOConfig
like studying price, epochs, and batch sizes - Create the PPO coach
PPOTrainer
by combining the mannequin, tokenizer, and coaching information - The coaching loop makes use of
step()
technique to iteratively replace the mannequin to optimized therewards
calculated from thequestion
and mannequinresponse
# trl: Transformer Reinforcement Studying library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler
# provoke the pretrained mannequin and tokenizer
mannequin = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
# outline the hyperparameters of PPO algorithm
config = PPOConfig(
model_name=model_name,
learning_rate=learning_rate,
ppo_epochs=max_ppo_epochs,
mini_batch_size=mini_batch_size,
batch_size=batch_size
)
# provoke the PPO coach on the subject of the mannequin
ppo_trainer = PPOTrainer(
config=config,
mannequin=ppo_model,
tokenizer=tokenizer,
dataset=dataset["train"],
data_collator=collator
)
# ppo_trainer is iteratively up to date via the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)
RLHF is broadly utilized for aligning mannequin responses with human desire. Widespread use instances contain decreasing response toxicity and mannequin hallucination. Nevertheless, it does have the draw back of requiring a considerable amount of human annotated information in addition to computation prices related to coverage optimization. Subsequently, options like Reinforcement Studying from AI suggestions and Direct Desire Optimization (DPO) are launched to mitigate these limitations.
Additional Studying
Take-Residence Message
This text briefly explains six important LLM customization methods together with immediate engineering, decoding technique, RAG, Agent, fine-tuning, and RLHF. Hope you discover it useful when it comes to understanding the professionals/cons of every technique in addition to easy methods to implement them primarily based on the sensible examples.