— that he noticed additional solely by standing on the shoulders of giants — captures a timeless reality about science. Each breakthrough rests on numerous layers of prior progress, till at some point … all of it simply works. Nowhere is that this extra evident than within the current and ongoing revolution in pure language processing (NLP), pushed by the Transformers structure that underpins most generative AI programs at the moment.
“If I’ve seen additional, it’s by standing on the shoulders of Giants.”
— Isaac Newton, letter to Robert Hooke, February 5, 1675 (Outdated Type calendar; 1676 New Type)

On this article, I tackle the function of an instructional Sherlock Holmes, tracing the evolution of language modelling.
A language mannequin is an AI system skilled to foretell and generate sequences of phrases based mostly on patterns realized from giant textual content datasets. It assigns chances to phrase sequences, enabling purposes from speech recognition and machine translation to at the moment’s generative AI programs.
Like all scientific revolutions, language modelling didn’t emerge in a single day however builds on a wealthy heritage. On this article, I deal with a small slice of the huge literature within the subject. Particularly, our journey will start with a pivotal earlier expertise — the Relevance-Based mostly Language Fashions of Lavrenko and Croft — which marked a step change within the efficiency of Info Retrieval programs within the early 2000s and continues to depart its mark in TREC competitions. From there, the path results in 2017, when Google revealed the seminal Consideration Is All You Want paper, unveiling the Transformers structure that revolutionised sequence-to-sequence translation duties.
The important thing hyperlink between the 2 approaches is, at its core, fairly easy: the highly effective thought of consideration. Simply as Lavrenko and Croft’s Relevance Modelling estimates which phrases are probably to co-occur with a question, the Transformer’s consideration mechanism computes the similarity between a question and all tokens in a sequence, weighting every token’s contribution to the question’s contextual which means.
In each instances the eye mechanism acts as a comfortable probabilistic weighting mechanism, giving each methods their uncooked representational energy.
Each fashions are generative frameworks over textual content, differing primarily of their scope: RM1 fashions quick queries from paperwork, transformers mannequin full sequences.
Within the following sections, we are going to discover the background of Relevance Fashions and the Transformer structure, highlighting their shared foundations and clarifying the parallels between them.
Relevance Modelling — Introducing Lavrenko’s RM1 Combination Mannequin
Let’s dive into the conceptual parallel between Lavrenko & Croft’s Relevance Modelling framework in Info Retrieval and the Transformer’s consideration mechanism. Each emerged in several domains and eras, however they share the identical mental DNA. We are going to stroll by means of the background on Relevance Fashions, earlier than outlining the important thing hyperlink to the following Transformer structure.
When Victor Lavrenko and W. Bruce Croft launched the Relevance Mannequin within the early 2000s, they supplied a sublime probabilistic formulation for bridging the hole between queries and paperwork. At their core, these fashions begin from a easy thought: assume there exists a hidden “relevance distribution” over vocabulary phrases that characterises paperwork a person would take into account related to their question. The duty then turns into estimating this distribution from the noticed information, specifically the person question and the doc assortment.
The primary Relevance Modelling variant — RM1 (there have been two different fashions in the identical household, not highlighted intimately right here) — does this straight by inferring the distribution of phrases more likely to happen in related paperwork given a question, basically modelling relevance as a latent language mannequin that sits“behind” each queries and paperwork.

with the posterior likelihood of a doc d given a question q given by:

That is the traditional unigram language mannequin with Dirichlet smoothing proposed within the unique paper by Lavrenko and Croft. To estimate this relevance mannequin, RM1 makes use of the top-retrieved paperwork as pseudo-relevant suggestions (PRF) — it assumes the highest-scoring paperwork are more likely to be related. Which means that no expensive relevance judgements are required, a key benefit of Lavrenko’s formulation.

To construct up an instinct into how the RM1 mannequin works, we’ll code it up step-by-step in Python, utilizing a easy toy doc corpus consisting of three “paperwork”, outlined under, with a question “cat”.
import math
from collections import Counter, defaultdict
# -----------------------
# Step 1: Instance corpus
# -----------------------
docs = {
"d1": "the cat sat on the mat",
"d2": "the canine barked on the cat",
"d3": "canine and cats are mates"
}
# Question
question = ["cat"]
Subsequent — for the needs of this toy instance IR state of affairs— we calmly pre-process the doc assortment, by splitting the paperwork into tokens, figuring out the rely of every token inside every doc, and defining the vocabulary:
# -----------------------
# Step 2: Preprocess
# -----------------------
# Tokenize and rely
doc_tokens = {d: doc.cut up() for d, doc in docs.gadgets()}
doc_lengths = {d: len(toks) for d, toks in doc_tokens.gadgets()}
doc_term_counts = {d: Counter(toks) for d, toks in doc_tokens.gadgets()}
# Vocabulary
vocab = set(w for toks in doc_tokens.values() for w in toks)
If we run the above code we are going to get the next output, with 4 easy information buildings holding the data we have to compute the RM1 distribution of relevance for any question.
doc_tokens = {
'd1': ['the', 'cat', 'sat', 'on', 'the', 'mat'],
'd2': ['the', 'dog', 'barked', 'at', 'the', 'cat'],
'd3': ['dogs', 'and', 'cats', 'are', 'friends']
}
doc_lengths = {
'd1': 6,
'd2': 6,
'd3': 5
}
doc_term_counts = {
'd1': Counter({'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}),
'd2': Counter({'the': 2, 'canine': 1, 'barked': 1, 'at': 1, 'cat': 1}),
'd3': Counter({'canine': 1, 'and': 1, 'cats': 1, 'are': 1, 'mates': 1})
}
vocab = {
'the', 'cat', 'sat', 'on', 'mat',
'canine', 'barked', 'at',
'canine', 'and', 'cats', 'are', 'mates'
}
If we have a look at the RM1 equation outlined earlier, we are able to break it up into key probabilistic elements. P(w|d) defines the likelihood distribution of the phrases w in a doc d. P(w|d) is often computed utilizing Dirichlet prior smoothing (Zhai & Lafferty, 2001). This prior avoids zero chances for unseen phrases and balances document-specific proof with background assortment statistics. That is outlined as:

The above equation offers us a bag of phrases unigram mannequin for every of the paperwork in our corpus. As an apart, you possibly can think about how today — with highly effective language fashions out there of Hugging-face — we may swap out this formulation for e.g. a BERT-based variant, utilizing embeddings to estimate the distribution P(w|d).
In a BERT-based strategy to P(w|d), we are able to derive a doc embedding g(d) through imply pooling and a phrase embedding e(w), then mix them within the following equation:

Right here V denotes the pruned vocab (e.g., union of doc phrases) and 𝜏 is a temperature parameter. This might be step one on making a Neural Relevance Mannequin (NRM), an untouched and doubtlessly novel path within the subject of IR.
Again to the unique formulation: this prior formulation might be coded up in Python, as our first estimate of P(w|d):
# -----------------------
# Step 3: P(w|d)
# -----------------------
def p_w_given_d(w, d, mu=2000):
"""Dirichlet-smoothed language mannequin."""
tf = doc_term_counts[d][w]
doc_len = doc_lengths[d]
# assortment likelihood
cf = sum(doc_term_counts[dd][w] for dd in docs)
collection_len = sum(doc_lengths.values())
p_wc = cf / collection_len
return (tf + mu * p_wc) / (doc_len + mu)
Subsequent up, we compute the question chance below the doc mannequin — P(q|d):
# -----------------------
# Step 4: P(q|d)
# -----------------------
def p_q_given_d(q, d):
"""Question chance below doc d."""
rating = 0.0
for w in q:
rating += math.log(p_w_given_d(w, d))
return math.exp(rating) # return chance, not log
RM1 requires P(d|q), so we flip the likelihood — P(q|d) — utilizing Bayes rule:
def p_d_given_q(q):
"""Posterior distribution over paperwork given question q."""
# Compute question likelihoods for all paperwork
scores = {d: p_q_given_d(q, d) for d in docs}
# Assume uniform prior P(d), so proportionality is simply scores
Z = sum(scores.values()) # normalization
return {d: scores[d] / Z for d in docs}
We assume right here that the doc prior is uniform, and so it cancels. We additionally then normalize throughout all paperwork so the posteriors sum to 1:

Much like P(w|d), it’s price considering how we may neuralise the P(d|q) phrases in RM1. A primary strategy can be to make use of an off-the-shelf cross- or dual-encoder mannequin (such because the MS MARCO–fine-tuned BERT cross-encoder) to embed the question and doc, produce a similarity rating, and normalize it with a softmax:

With P(d|q) and P(w|d) transformed to neural network-based representations, we are able to plug each collectively to get a easy preliminary model of a neural RM1 mannequin that can give us again P(w|q).
For the needs of this text — nonetheless — we are going to swap again into the traditional RM1 formulation. Let’s run the (non-neural, customary RM1) code to this point to see the output of the assorted elements we’ve simply mentioned. Recall that our toy doc corpus is:
d1: "the cat sat on the mat"
d2: "the canine barked on the cat"
d3: "canine and cats are mates"
Assuming Dirichlet smoothing (with μ=2000), the values shall be very shut to the gathering likelihood of “cat” because the paperwork are very quick. For illustration:
- d1: “cat” seems as soon as in 6 phrases → P(q|d1) is roughly 0.16
- d2: “cat” seems as soon as in 6 phrases → P(q|d2) is roughly 0.16
- d3: “cat” by no means seems → P(q|d3) is roughly 0 (with smoothing, a small >0 worth)
We now normalize this distribution to reach on the posterior distribution:
q)': 0.4997,
'P(d3
What’s the key distinction between P(d|q) and P(q|d)?
P(q|d) tells us how effectively the doc “explains” the question. If we think about that every doc is itself a mini language mannequin: if it have been producing textual content, how possible is it to provide the phrases we see within the question? This likelihood is excessive if the question phrases look pure below the paperwork phrase distribution. For instance, for question “cat”, a doc that actually mentions “cat” will give a excessive chance; one about “canine and cats” a bit much less; one about “Charles Dickens” near zero.
In distinction, the likelihood P(d|q) codifies how a lot we should always belief the doc given the question. This flips the angle utilizing Bayes rule: now we ask, given the question, what’s the likelihood the person’s related doc is d?
So as a substitute of evaluating how effectively the doc explains the question, we deal with paperwork as competing hypotheses for relevance and normalise them right into a distribution over all paperwork. This provides us a rating rating became likelihood mass — the upper it’s, the extra possible this doc is related in comparison with the remainder of the gathering.
We now have all elements to complete our implementation of Lavrenko’s RM1 mannequin:
# -----------------------
# Step 6: RM1: P(w|R,q)
# -----------------------
def rm1(q):
pdq = p_d_given_q(q)
pwRq = defaultdict(float)
for w in vocab:
for d in docs:
pwRq[w] += p_w_given_d(w, d) * pdq[d]
# normalize
Z = sum(pwRq.values())
for w in pwRq:
pwRq[w] /= Z
return dict(sorted(pwRq.gadgets(), key=lambda x: -x[1]))
# -----------------------
We are able to now see that RM1 defines a likelihood distribution over the vocabulary that tells us which phrases are probably to happen in paperwork related to the question. This distribution can then be used for question enlargement, by including high-probability phrases, or for re-ranking paperwork by measuring the KL divergence between every doc’s language mannequin and the question’s relevance mannequin.
High phrases from RM1 for question ['cat']
cat 0.1100
the 0.1050
canine 0.0800
sat 0.0750
mat 0.0750
barked 0.0700
on 0.0700
at 0.0680
canine 0.0650
mates 0.0630
In our toy instance, the time period “cat” naturally rises to the highest, because it matches the question straight. Excessive-frequency background phrases like “the” additionally seem strongly, although in observe these can be filtered out as cease phrases. Extra apparently, content material phrases from paperwork containing “cat” (equivalent to sat, mat, canine, barked) are elevated as effectively. That is the facility of RM1: it introduces associated phrases not current within the question itself, with out requiring specific relevance judgments or supervision. Phrases distinctive to d3 (e.g., mates, canine, cats) obtain small however nonzero chances due to smoothing.
RM1 defines a query-specific relevance mannequin, a language mannequin induced from the question, which is estimated by averaging over paperwork possible related to that question.
Having now seen how RM1 builds a query-specific language mannequin by reweighing doc phrases based on their posterior relevance, it’s laborious to not discover the parallel with what got here a lot later in deep studying: the eye mechanism in Transformers.
In RM1, we estimate a brand new distribution P(w|R, q) over phrases by combining doc language fashions, weighted by how possible every doc is related given the question. The Transformer structure does one thing somewhat comparable: given a token (the “question”), it computes a similarity to all different tokens (the “keys”), then makes use of these scores to weight their “values.” This produces a brand new, context-sensitive illustration of the question token.
Lavrenko’s RM1 Mannequin as a “proto-Transformer”
The eye mechanism, launched as a part of the Transformer structure, was designed to beat a key weak point of earlier sequence fashions like LSTMs and RNNs: their quick reminiscence horizons. Whereas recurrent fashions struggled to seize long-range dependencies, consideration made it potential to straight join any token in a sequence with every other, whatever the distance within the sequence.
What’s attention-grabbing is that the arithmetic of consideration seems to be similar to what RM1 was doing a few years earlier. In RM1, as we’ve seen, we construct a query-specific distribution by weighting paperwork; in Transformers, we construct a token-specific illustration by weighting different tokens within the sequence. The precept is similar — assign likelihood mass to probably the most related context — however utilized on the token degree somewhat than the doc degree.
When you strip Transformers all the way down to their essence, the eye mechanism is basically simply RM1 utilized on the token degree.
This could be seen as a daring declare, so it’s incumbent upon us to supply some proof!
Let’s first dig a bit of deeper into the eye mechanism, and I defer to the unbelievable wealth of high-quality current introductory materials for a fuller and deeper dive.
Within the Transformer’s consideration layer — often called scaled dot-product consideration — given a question vector q, we compute its similarity to all different tokens’ keys ok. These similarities are normalized into weights by means of a softmax. Lastly, these weights are used to mix the corresponding values v, producing a brand new, context-aware illustration of the question token.
Scaled dot-product consideration is:

Right here, Q = question vector(s), Ok = key vectors (paperwork, in our analogy, V = worth vectors (phrases/options to be combined). The softmax is a normalised distribution over the keys.
Now, recall RM1 (Lavrenko & Croft 2001):

The eye weights in scaled dot-product consideration parallel the doc–question distribution P(d|q) in RM1. Reformulating consideration in per-query type makes this connection specific:


The worth vector — v — in consideration might be regarded as akin to P(w|d) within the RM1 mannequin, however as a substitute of an specific phrase distribution, v is a dense semantic vector — a low-rank surrogate for the total distribution. It’s successfully the content material we combine collectively as soon as we arrive on the relevance scores for every doc.
Zooming out to the broader Transformer structure, Multi-head consideration might be seen as working a number of RM1-style relevance fashions in parallel with totally different projections.
We are able to moreover draw additional parallels with the broader Transfomer structure.
- Strong Chance Estimation: For instance, we have now beforehand mentioned that RM1 wants smoothing (e.g., Dirichlet) to easy zero counts and keep away from overfitting to uncommon phrases. Equally, Transformers use residual connections and layer normalisation to stabilise and keep away from collapsing consideration distributions. Each fashions implement robustness in likelihood estimation when the info sign is sparse or noisy.
- Pseudo Relevance Suggestions: RM1 performs a single spherical of probabilistic enlargement by means of pseudo-relevance suggestions (PRF), limiting consideration to the top-Ok retrieved paperwork. The PRF set features like an consideration context window: the question distributes likelihood mass over a restricted set of paperwork, and phrases are reweighed accordingly. Equally, transformer consideration is proscribed to the native enter sequence. In contrast to RM1, nonetheless, transformers stack many layers of consideration, each reweighting and refining token distributions. Deep consideration stacking can thus be seen as iterative pseudo-relevance suggestions — repeatedly pooling throughout associated context to construct richer representations.
The analogy between RM1 and the Transformer is summarised within the under desk, the place we tie collectively every part and draw hyperlinks between every:

RM1 expressed a strong however normal thought: relevance might be understood as weighting mixtures of content material based mostly on similarity to a question.
Practically twenty years later, the identical precept re-emerged within the Transformer’s consideration mechanism — now on the degree of tokens somewhat than paperwork. What started as a statistical mannequin for question enlargement in Info Retrieval developed into the mathematical core of contemporary Massive Language Fashions (LLMs). It’s a reminder that lovely concepts in science hardly ever disappear; they journey ahead by means of time, reshaped and reinterpreted in new contexts.
By means of the written phrase, scientists carry concepts throughout generations — quietly binding collectively waves of innovation — till, out of the blue, a breakthrough emerges.
Typically the best concepts are probably the most highly effective. Who would have imagined that “consideration” may change into the important thing to unlocking language? And but, it’s.
Conclusions and Remaining Ideas
On this article, we have now traced one department of the huge tree that’s language modelling, uncovering a compelling connection between the event of relevance fashions in early info retrieval and the emergence of Transformers in trendy NLP. RM1 — ther first variant within the household of relevance fashions, was, in some ways, a proto-Transformer for IR — foreshadowing the mechanism that will later reshape how machines perceive language.
We even coded up a neural variant of the Relevance Mannequin, utilizing trendy encoder-only fashions, thereby formally unifying previous (relevance mannequin) and current (transformer structure) in the identical formal probabilistic mannequin!
In the beginning, we invoked Newton’s picture of standing on the shoulders of giants. Allow us to shut with one other of his reflections:
“I have no idea what I could seem to the world, however to myself I appear to have been solely like a boy enjoying on the seashore, and diverting myself in at times discovering a smoother pebble or a prettier shell than peculiar, while the nice ocean of reality lay all undiscovered earlier than me.” Newton, Isaac. Quoted in David Brewster, Memoirs of the Life, Writings, and Discoveries of Sir Isaac Newton, Vol. 2 (1855), p. 407.
I hope that you just agree that the trail from RM1 to Transformers is simply such a discovery — a extremely polished pebble on the shore of a a lot better ocean of AI discoveries but to return.
Disclaimer: The views and opinions expressed on this article are my very own and don’t symbolize these of my employer or any affiliated organizations. The content material is predicated on private expertise and reflection, and shouldn’t be taken as skilled or educational recommendation.















