, I’ve talked rather a lot about Reterival Augmented Technology (RAG). Particularly, I’ve lined the fundamentals of the RAG methodology, in addition to a bunch of related ideas, like chunking, embeddings, reranking, and retrieval analysis.
The standard RAG methodology is so helpful as a result of it permits for looking for related components of textual content in a big information base, primarily based on the that means of the textual content quite than precise phrases. On this means, it permits us to make the most of the facility of AI on our customized paperwork. Sarcastically, as helpful as this similarity search is, it generally fails to retrieve components of textual content which are precise matches to the consumer’s immediate. Extra particularly, when looking in a big information base, particular key phrases (corresponding to particular technical phrases or names) might get misplaced, and related chunks might not be retrieved even when the consumer’s question comprises the precise phrases.
Fortunately, this difficulty will be simply tackled by utilising an older keyword-based looking method, like BM25 (Finest Matching 25). Then, by combining the outcomes of the similarity search and BM25 search, we will primarily get one of the best of each worlds and considerably enhance the outcomes of our RAG pipeline.
. . .
In data retrieval programs, BM25 is a rating operate used to judge how related a doc is to a search question. Not like similarity search, BM25 evaluates the doc’s relevance to the consumer’s question, not primarily based on the semantic that means of the doc, however quite on the precise phrases it comprises. Extra particularly, BM25 is a bag-of-words (BoW) mannequin, that means that it doesn’t take into consideration the order of the phrases in a doc (from which the semantic that means emerges), however quite the frequency with which every phrase seems within the doc.
BM25 rating for a given question q containing phrases t and a doc d will be (not so) simply calculated as follows:

😿
Since this expression generally is a bit overwhelming, let’s take a step again and take a look at it little by little.
. . .
Beginning easy with TF-IDF
The fundamental underlying idea of BM25 is TF-IDF (Time period Frequency – Inverse Doc Frequency). TF-IDF is a basic data retrieval idea aiming to measure how vital a phrase is in a selected doc in a information base. In different phrases, it measures in what number of paperwork of the information base a time period seems in, permitting on this option to categorical how particular and informative a time period is a couple of particular doc. The rarer a time period is within the information base, the extra informative it’s thought-about to be for a selected doc.
Particularly, for a doc d in a information base and a time period t, the Time period Frequency TF(t,d) will be outlined as follows:

and Inverse Doc Frequency IDF(t) will be outlined as follows:

Then, the TF-IDF rating will be calculated because the product of TF and IDF as follows:

. . .
Let’s do a fast instance to get a greater grip of TF-IDF. Let’s assume a tiny information base containing three films with the next descriptions:
- “A sci-fi thriller about time journey and a harmful journey throughout alternate realities.”
- “A romantic drama about two strangers who fall in love throughout sudden time journey.”
- “A sci-fi journey that includes an alien explorer compelled to journey throughout galaxies.”
After eradicating the stopwords, we will take into account the next phrases in every doc:
- doc 1: sci-fi, thriller, time, journey, harmful, journey, alternate, realities
- measurement of doc 1, |d1| = 8
- doc 2: romantic, drama, two, strangers, fall, love, sudden, time, journey
- measurement of doc 2, |d2| = 9
- doc 3: sci-fi, journey, that includes, alien, explorer, compelled, journey, galaxies
- measurement of doc 3, |d3| = 8
- complete paperwork in information base N = 3
We will then calculate the f(t,d) for every time period in every doc:

Subsequent, for every doc, we additionally calculate the Doc Frequency and the Inverse Doc Frequency:

After which lastly we calculate the TF-IDF rating of every time period.

So, we can we get from this? Let’s have a look, for instance, on the TF-IDF scores of doc 1. The phrase ‘journey’ just isn’t informative in any respect, since it’s included in all paperwork of the information base. On the flip facet, phrases like ‘thriller’ and ‘harmful’ are very informative, particularly for doc 1, since they’re solely included in it.
On this means, TF-IDF rating gives a easy and simple option to determine and quantify the significance of the phrases in every doc of a information base. To place it in another way, the upper the entire rating of the phrases in a doc, the rarer the data on this doc is compared to the data contained in all different paperwork within the information base.
. . .
Understanding BM25 rating
In BM25, we utilise the TF-IDF idea in an effort to quantify how imformative (how uncommon or vital) every doc in a information base is, with respect to a selected question. To do that, for the BM25 calculation, we solely take into consideration the phrases of every doc which are contained within the consumer’s question, and carry out a calculation considerably just like TF-IF.
BM25 makes use of the TF-IDF idea, however with a number of mathematical tweaks in an effort to enhance two most important weaknesses of TF-IDF.
. . .
The primary ache level of TF-IDF is that TF is linear with the variety of occasions a time period t seems in a doc d, f(t,d), as any operate of the shape:

Which means the extra occasions a time period t seems in a doc d, the extra TF grows linearly, which, as you might think about, will be problematic for giant paperwork, the place a time period seems over and over with out essentially being correspondingly extra vital.
A easy option to resolve that is to make use of a saturation curve as a substitute of a linear operate. Which means output will increase with the enter however approaches a most restrict asymptotically, in contrast to the linear operate, the place the output will increase with the enter without end:

Thus, we will attempt to rewrite TF on this kind as follows, introducing a parameter k1, which permits for the management of the frequency scaling. On this means, the parameter K1allows for introducing diminishing returns. That’s, the first prevalence of the time period t in a doc has a huge impact on the TF rating, whereas the twentieth look solely provides a small additional acquire.

Noetheless, this might lead to values within the vary 0 to 1. We will tweak this a bit extra and add a (k1 + 1) within the nominator, in order that the ensuing values of TF are comparable with the preliminary definition of TF utilized in TD-IDF.

. . .
Up to now, so good, however one vital piece of data that’s nonetheless lacking from this expression is the dimensions of the doc |d| that was included within the preliminary calculation of TF. Nonetheless, earlier than including the |d| time period, we additionally want to change it a little bit bit since that is the second ache level of the preliminary TF-IDF expression. Extra particularly, the difficulty is {that a} information base goes to include paperwork with variable lengths |d|, leading to scores of various phrases not being comparable. BM25 resolves this by normalizing |d|. That’s, as a substitute of |d|, the next expression is used:

the place avg(dl) is the common doc size of the paperwork within the information base. Moreover, b is a parameter in [0,1] that controls the size normalization, with b = 0 akin to no normalization and b = 1 corresponding to finish normalization.
So, including the normalised expressionof |d|, we will get the fancier model of TF utilized in BM25. This will likely be as follows:

Often, the used parameter values are k₁ ≈ 1.2 to 2.0 and b ≈ 0.75.
. . .
BM52 additionally makes use of a barely altered expression for the IDF calculation as follows:

This expression is derived by asking a greater query. Within the preliminary IDF calculation, we ask:
“How uncommon is the time period?”
As a substitute, when attempting to calculate the IDF for BM25, we ask:
“How more likely is that this time period in related paperwork than in non-relevant paperwork?”
The likelihood of a doc containing the time period t, in a information base of N paperwork, will be expressed as:

We will then categorical the chances of a doc containing a time period t versus not containing it as:

After which taking the inverse, we find yourself with:

Equally to the everyday IDF, we get the log of this expression to compress the intense values. An unique transformation known as Robertson–Sparck Jones smoothing can be carried out, and on this means, we lastly get the IDF expression utilized in BM25.
. . .
In the end, we will calculate the BM25 rating for a selected doc d for a given question q that comprises a number of phrases t.

On this means, we will rating the paperwork accessible in a information base primarily based on their relevance to a selected question, after which retrieve probably the most related paperwork.
All that is simply to say that the BM52 rating is one thing like the rather more simply understood TD-IDF rating, however a bit extra refined. So, BM52 may be very fashionable for performing key phrase searches and can be utilized in our case for key phrase searches in a RAG system.
RAG with Hybrid Search
So, now that we’ve got an thought about how BM25 works and scores the assorted paperwork in a information base primarily based on the frequency of key phrases, we will additional check out how BM25 scores are included in a standard RAG pipeline.
As mentioned in a number of of my earlier posts, a quite simple RAG pipeline would look one thing like this:

Such a pipeline makes use of a similarity rating (like cosine similarity) of embeddings in an effort to seek for, discover, and retrieve chunks which are semantically just like the consumer’s question. Whereas similarity search may be very helpful, it could generally miss precise matches. Thus, by incorporating a key phrase search, on prime of the similarity search within the RAG pipeline, we will determine related chunks extra successfully and comprehensively. This may alter our panorama as follows:

For every textual content chunk, other than the embedding, we now additionally calculate a BM25 index, permitting for fast calculation of respective BM25 scores on numerous consumer queries. On this means, for every consumer question, we will determine the chunks with the very best BM25 scores – that’s, the chunks that include probably the most uncommon, most informative phrases with respect to the consumer’s question compared to all different chunks within the information base.
Discover how now we match the consumer’s question each with the embeddings within the vector retailer (semantic search) and the BM25 index (key phrase search). Totally different chunks are retrieved primarily based on the semantic search and the key phrase search – then the retrieved chunks are mixed, deduplicated, and ranked utilizing rank fusion.
. . .
On my thoughts
Integrating BM25 key phrase search right into a RAG pipeline permits us to get one of the best of each worlds: the semantic understanding of embeddings and the precision of actual key phrase matching. By combining these approaches, we will retrieve probably the most related chunks even from a bigger information base extra reliably, making certain that vital phrases, technical phrases, or names will not be missed. On this means, we will considerably enhance the effectiveness of our retrieval course of and make sure that no vital related data is left behind.
Cherished this put up? Let’s be pals! Be part of me on:















