HNSW at Scale: Why Your RAG System Will get Worse because the Vector Database Grows

Optimizing Token Era in PyTorch Decoder Fashions

Is the AI and Knowledge Job Market Lifeless?

a contemporary vector database—Neo4j, Milvus, Weaviate, Qdrant, Pinecone—there’s a very excessive probability that Hierarchical Navigable Small World (HNSW) is already powering your retrieval layer. It’s fairly seemingly you didn’t select it whereas constructing the database, nor did you tune it and even know it’s there. And but, HNSW is quietly deciding what your LLM sees as fact. It determines which doc chunks are fed into your RAG pipeline, which reminiscences your agent remembers, and in the end, whether or not the mannequin solutions appropriately—or hallucinates confidently.

As your vector database grows, retrieval high quality degrades progressively:

No exceptions are raised
No errors are logged
Latency typically appears completely superb

However the context high quality deteriorates, and your RAG system turns into much less dependable over time—although the embedding mannequin and distance metric stay unchanged.

On this article, I display—utilizing managed experiments and actual knowledge—how HNSW impacts retrieval high quality as database dimension grows, why this degradation is worse than flat search, and what you possibly can realistically do about it in manufacturing RAG methods.

Particularly, I’ll:

Construct a sensible, reproducible use case to measure the impact of HNSW on RAG retrieval high quality utilizing Recall@okay.
Present that, for mounted HNSW settings, recall degrades sooner than flat search because the corpus grows.
Talk about sensible tuning methods for balancing recall and latency past merely rising ef_search of HNSW.

What’s HNSW?

HNSW is a graph-based algorithm for Approximate Nearest Neighbor (ANN) search. It organizes knowledge into a number of layers of linked neighbors and makes use of this graph construction to hurry up search.

Every vector is linked to a restricted variety of neighbors in every layer. Throughout a search, it performs a grasping search by these layers, and the variety of neighbors checked at every layer is fixed (managed by M and ef_search), which makes the search course of logarithmic with respect to the variety of vectors. In comparison with flat search, the place time complexity is O(N), HNSW search has a time complexity of O(log N), which implies the time required for a search grows very slowly (logarithmically) as in comparison with linear search. We’ll see this in the results of our use case.

Parameters of HNSW index

1. Construct time parameters: M and ef_construction. Will be set earlier than constructing the database solely.

M defines the utmost variety of connections (neighbors) that every vector (node) can have in every layer of the graph. The next M means extra connections, making the graph denser and doubtlessly rising recall however at the price of extra reminiscence and slower indexing.

Ef_construction controls the dimension of the candidate set used through the building of the graph. Basically, it governs how completely the graph is constructed throughout indexing. The next worth for ef_construction means the graph is constructed extra completely, with extra candidates being thought-about earlier than making every connection, which ends up in a increased high quality graph and higher recall at the price of elevated reminiscence and slower indexing.

For a normal goal RAG software, typical values of M are inside a spread of 12 and 48 and ef_construction between 64 and 200.

2. Question time parameter: ef_search

This defines the variety of candidate nodes (or vectors) to discover through the question course of (i.e., through the seek for nearest neighbors). It controls how thorough the search course of is by figuring out what number of candidates are evaluated earlier than the search result’s returned. The next worth for ef_search means the search will discover extra candidates, main to raised recall however doubtlessly slower queries.

What’s Recall@okay?

Recall@okay is a key metric for measuring the accuracy of vector search and RAG methods. It measures the flexibility of the retriever to search out the related chunks for a consumer question throughout the prime okay outcomes. It’s important as a result of If the retriever misses the chunks containing the data required to reply the query (low recall), the LLM can not probably generate an correct reply within the response synthesis step, no matter how highly effective it’s.

[ text{Recall}@k = frac{text{relevant items retrieved in top } k}{text{total number of relevant items in the corpus}} ]

In apply, it is a troublesome metric to measure as a result of the denominator (floor fact paperwork) will not be simply recognized for a real-life manufacturing system. What we are going to do as an alternative, is design a use case the place the bottom fact (eg; vector index) is exclusive and recognized, and Recall@okay will measure the common variety of instances it’s retrieved in top-k outcomes, over a lot of pattern queries.

For example, Recall@5 will measure the common variety of instances the bottom fact index appeared in top-5 retrievals over 500 queries.

For a RAG, the appropriate vary of Recall@5 is 70-90% and Recall@10 is 80-95%, and we are going to see that our use case adheres to those ranges for the Flat index.

Use Case

To check HNSW, we’d like a vector database with sufficiently massive variety of vectors (> 100,000). There doesn’t appear to be such a big public dataset obtainable consisting of doc chunks and related question(ies) for which the actual chunk can be thought-about as floor fact. And even when it have been there, pure language may be ambiguous, so it’s troublesome to confidently say which all chunks within the corpus may very well be thought-about as related for a question (the denominator in Recall@okay formulation). Growing such a curated dataset would require discovering a lot of paperwork, chunking and embedding them, then creating queries for the chunks. That will be a useful resource intensive course of.

As a substitute, lets re-imagine our RAG drawback as “given a brief caption (question), we wish to retrieve essentially the most related pictures from the dataset”.

For this method, I utilized the publicly obtainable LAION-Aesthetics dataset. To entry, you have to to be logged in to Hugging Face, and comply with the phrases talked about. Particulars in regards to the dataset is offered on the LAOIN web site right here. It accommodates an enormous variety of rows containing URLs to photographs together with a textual content caption. They seem like the next:

I downloaded a subset of rows and generated 200,000 CLIP embeddings of the pictures to construct the vector database. The textual content captions of the pictures may be conveniently used as queries for RAG. And every caption has just one picture vector as the bottom fact so the denominator of Recall@okay is precisely recognized for all queries. Additionally, the CLIP embeddings of the picture and its caption are by no means an actual match, so there may be sufficient “fuzziness” in retrievals just like a purely doc RAG, the place a textual content question is used to retrieve related doc chunks utilizing a distance metric. This shall be evident after we see the chart of Recall@okay within the subsequent sections.

Measuring Recall@okay for Flat vs HNSW

We undertake the next method:

Embeddings of 200k pictures are saved as .npy file.
From the laion dataset, 500 captions(queries) are randomly chosen and embedded utilizing CLIP. The chosen question indices additionally kind the bottom fact as they correspond to the distinctive picture for the question.
The database is inbuilt increments of fifty,000 vectors, so 4 iterations of dimension 50k, 100k, 150k and 200k vectors. Each flat and HNSW indexes are constructed. HNSW is constructed utilizing M=16 and ef_construction=100.
Recall@okay is calculated for okay = 1, 5, 10, 15 and 20 primarily based upon if the bottom fact indices are included in top-k outcomes.
First, the Recall@okay values are calculated for every of the question vectors and averaged over the variety of samples (500).
Then, common Recall@okay values are calculated for HNSW ef_search values of 10, 20, 40, 80 and 160.
Lastly, 5 charts are drawn, one for every of the Recall@okay values. Every chart depicts the evolution of Recall@okay as database dimension grows for Flat index and totally different ef_search values of HNSW.

The code may be seen right here

import pandas as pd
import numpy as np
import faiss
import torch
import open_clip
import os
import random
import matplotlib.pyplot as plt

def evaluate_subset(dimension, embeddings_all, df_all, query_vectors_all, eval_indices_all, ef_search_values):
    # Subset embeddings
    embeddings = embeddings_all[:size]
    dimension = embeddings.form[1]
    
    # Construct Indices in-memory for this subset dimension
    index_flat = faiss.IndexFlatL2(dimension)
    index_flat.add(embeddings)
    
    index_hnsw = faiss.IndexHNSWFlat(dimension, 16)
    index_hnsw.hnsw.efConstruction = 100
    index_hnsw.add(embeddings)

    num_samples = len(eval_indices_all)
    outcomes = []

    ks = [1, 5, 10, 15, 20]

    # Consider Flat
    flat_recalls = {okay: 0 for okay in ks}
    for i, qv in enumerate(query_vectors_all):
        _, I = index_flat.search(qv, max(ks))
        goal = eval_indices_all[i]
        for okay in ks:
            if goal in I[0][:k]:
                flat_recalls[k] += 1
    
    flat_res = {"Setting": "Flat"}
    for okay in ks:
        flat_res[f"R@{k}"] = flat_recalls[k]/num_samples
    outcomes.append(flat_res)

    # Consider HNSW with totally different efSearch
    for ef in ef_search_values:
        index_hnsw.hnsw.efSearch = ef
        hnsw_recalls = {okay: 0 for okay in ks}
        for i, qv in enumerate(query_vectors_all):
            _, I = index_hnsw.search(qv, max(ks))
            goal = eval_indices_all[i]
            for okay in ks:
                if goal in I[0][:k]:
                    hnsw_recalls[k] += 1
        
        hnsw_res = {"Setting": f"HNSW (ef={ef})", "ef": ef}
        for okay in ks:
            hnsw_res[f"R@{k}"] = hnsw_recalls[k]/num_samples
        outcomes.append(hnsw_res)
    
    return outcomes

def format_table(dimension, outcomes):
    ks = [1, 5, 10, 15, 20]
    traces = []
    traces.append(f"nDatabase Measurement: {dimension}")
    traces.append("="*80)
    header = f"{'Index/efSearch':<20}"
    for okay in ks:
        header += f" | {'R@'+str(okay):<8}"
    traces.append(header)
    traces.append("-" * 80)
    for row in outcomes:
        line = f"{row['Setting']:<20}"
        for okay in ks:
            line += f" | {row[f'R@{k}']:<8.2f}"
        traces.append(line)
    traces.append("="*80)
    return "n".be a part of(traces)

def foremost(n):
    dataset_path = r"C:databaselaion_final.parquet"
    embeddings_path = r"C:databaseembeddings.npy"
    results_dir = r"C:outcomes"
    
    db_sizes = [50000, 100000, 150000, 200000]
    ef_search_values = [10, 20, 40, 80, 160]
    num_samples = n
    output_txt = os.path.be a part of(results_dir, f"eval_results_{num_samples}.txt")
    output_png = os.path.be a part of(results_dir, f"recall_vs_dbsize_{num_samples}.png")

    if not os.path.exists(dataset_path) or not os.path.exists(embeddings_path):
        print("Error: Dataset or embeddings not discovered.")
        return
    
    os.makedirs(results_dir, exist_ok=True)

    # Load All Knowledge As soon as
    print("Loading base knowledge...")
    df_all = pd.read_parquet(dataset_path)
    embeddings_all = np.load(embeddings_path).astype('float32')

    # Load CLIP mannequin as soon as
    print("Loading CLIP mannequin (ViT-B-32)...")
    mannequin, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
    tokenizer = open_clip.get_tokenizer('ViT-B-32')
    system = "cuda" if torch.cuda.is_available() else "cpu"
    mannequin.to(system)
    mannequin.eval()

    # Use samples legitimate for all subsets
    eval_indices = random.pattern(vary(min(db_sizes)), num_samples)
    print(f"Sampling {num_samples} queries for constant analysis...")

    # Generate question vectors
    query_vectors = []
    for idx in eval_indices:
        textual content = df_all.iloc[idx]['TEXT']
        text_tokens = tokenizer([text]).to(system)
        with torch.no_grad():
            text_features = mannequin.encode_text(text_tokens)
            text_features /= text_features.norm(dim=-1, keepdim=True)
            query_vectors.append(text_features.cpu().numpy().astype('float32'))

    all_output_text = []
    # Acquire all outcomes for plotting
    # construction: { 'R@1': { 'Flat': [val1, val2...], 'ef=10': [val1, val2...] }, ... }
    ks = [1, 5, 10, 15, 20]
    plot_data = {f"R@{okay}": { "Flat": [] } for okay in ks}
    for ef in ef_search_values:
        for okay in ks:
            plot_data[f"R@{k}"][f"HNSW ef={ef}"] = []

    for dimension in db_sizes:
        print(f"Evaluating with database dimension: {dimension}...")
        outcomes = evaluate_subset(dimension, embeddings_all, df_all, query_vectors, eval_indices, ef_search_values)
        table_str = format_table(dimension, outcomes)
        
        # Print to display
        print(table_str)
        all_output_text.append(table_str)

        # Acquire for plot
        for row in outcomes:
            label = row["Setting"]
            if label == "Flat":
                for okay in ks:
                    plot_data[f"R@{k}"]["Flat"].append(row[f"R@{k}"])
            else:
                ef = row["ef"]
                for okay in ks:
                    plot_data[f"R@{k}"][f"HNSW ef={ef}"].append(row[f"R@{k}"])

    # Save textual content outcomes
    with open(output_txt, "w", encoding="utf-8") as f:
        f.write("n".be a part of(all_output_text))
    print(f"nFinal outcomes saved to {output_txt}")

    # Create Particular person Plots for every Okay
    for okay in ks:
        plt.determine(figsize=(10, 6))
        k_key = f"R@{okay}"
        
        for label, values in plot_data[k_key].gadgets():
            linestyle = '--' if label == "Flat" else '-'
            marker = 'o' if label == "Flat" else 's'
            plt.plot(db_sizes, values, label=label, linestyle=linestyle, marker=marker)
        
        plt.title(f"Recall@{okay} vs Database Measurement")
        plt.xlabel("Database Measurement")
        plt.ylabel("Recall")
        plt.grid(True)
        plt.legend()
        
        output_png = os.path.be a part of(results_dir, f"recall_vs_dbsize_{okay}.png")
        plt.tight_layout()
        plt.savefig(output_png)
        plt.shut()
        print(f"Plot saved to {output_png}")

if __name__ == "__main__":
    foremost(500)

And the outcomes are the next:

**Recall vs Database dimension for okay = 5**

**Recall vs Database dimension for okay = 1, 10, 15, 20**

Observations

For the Flat index (dotted line), Recall@5 and Recall@10 are within the vary of 0.70 – 0.85, as may be anticipated of actual life RAG functions.
Flat index offers the very best Recall@okay throughout all database sizes and kinds a benchmark higher sure for HNSW.
At any given database dimension, Recall@okay will increase for the next okay. So for database dimension of 100k vectors, Recall@20 > Recall@15 > Recall@10 > Recall@5 > Recall@1. That is comprehensible as with the next okay, there may be extra likelihood that the bottom fact index is current within the retrieved set.
Each Flat and HNSW deteriorate constantly because the database dimension grows. It is because high-dimensional vector areas change into more and more crowded because the variety of vectors grows.
Efficiency improves for HNSW for increased ef_search values.
Because the database dimension approaches 200k, HNSW seems to degrade sooner than Flat search.

Does HNSW degrade sooner than Flat Search?

To view the relative efficiency of Flat vs HNSW indexes as database dimension grows, a barely totally different method is adopted:

The database indexes building and question choice course of stays similar as earlier than.
As a substitute of contemplating the bottom fact, we calculate the overlap between the Flat index and every of the HNSW ef_search outcomes for a given retrieval rely(okay).
5 charts are drawn for every of the okay values, denoting the evolution of overlap as database dimension grows. For an ideal match with Flat index, the HNSW line will present a rating of 1. Extra importantly, if the degradation of HNSW outcomes is greater than Flat index, the line can have a damaging slope, else can have a horizontal or optimistic slope.

The code may be seen right here

import pandas as pd
import numpy as np
import faiss
import torch
import open_clip
import os
import random
import matplotlib.pyplot as plt
import time

def evaluate_subset_compare(dimension, embeddings_all, df_all, query_vectors_all, ef_search_values):
    # Subset embeddings
    embeddings = embeddings_all[:size]
    dimension = embeddings.form[1]
    
    # Construct Indices in-memory for this subset dimension
    index_flat = faiss.IndexFlatL2(dimension)
    index_flat.add(embeddings)
    
    index_hnsw = faiss.IndexHNSWFlat(dimension, 16)
    index_hnsw.hnsw.efConstruction = 100
    index_hnsw.add(embeddings)

    num_samples = len(query_vectors_all)
    outcomes = []

    ks = [1, 5, 10, 15, 20]
    max_k = max(ks)

    # 1. Consider Flat as soon as for this subset
    flat_times = []
    flat_results_all = []
    for qv in query_vectors_all:
        start_t = time.perf_counter()
        _, I_flat_all = index_flat.search(qv, max_k)
        flat_times.append(time.perf_counter() - start_t)
        flat_results_all.append(I_flat_all[0])
    
    avg_flat_time_ms = (sum(flat_times) / num_samples) * 1000

    # 2. Consider HNSW relative to Flat
    for ef in ef_search_values:
        index_hnsw.hnsw.efSearch = ef
        
        hnsw_times = []
        # Observe intersection counts for every okay
        overlap_counts = {okay: 0 for okay in ks}
        for i, qv in enumerate(query_vectors_all):
            # HNSW top-max_k
            start_t = time.perf_counter()
            _, I_hnsw_all = index_hnsw.search(qv, max_k)
            hnsw_times.append(time.perf_counter() - start_t)
            
            # Flat consequence was already pre-calculated
            I_flat_all = flat_results_all[i]
            
            for okay in ks:
                set_flat = set(I_flat_all[:k])
                set_hnsw = set(I_hnsw_all[0][:k])
                intersection = set_flat.intersection(set_hnsw)
                overlap_counts[k] += len(intersection) / okay
        
        avg_hnsw_time_ms = (sum(hnsw_times) / num_samples) * 1000
        
        hnsw_res = {
            "Setting": f"HNSW (ef={ef})", 
            "ef": ef,
            "FlatTime_ms": avg_flat_time_ms,
            "HNSWTime_ms": avg_hnsw_time_ms
        }
        for okay in ks:
            # Common over all queries
            hnsw_res[f"R@{k}"] = overlap_counts[k] / num_samples
        outcomes.append(hnsw_res)
    
    return outcomes

def format_all_tables(db_sizes, ef_search_values, all_results):
    ks = [1, 5, 10, 15, 20]
    traces = []
    
    # 1. Create one desk for every Recall@okay
    for okay in ks:
        k_label = f"R@{okay}"
        traces.append(f"nTable: {k_label} (HNSW Overlap with Flat)")
        traces.append("=" * (20 + 12 * len(db_sizes)))
        
        # Header
        header = f"{'ef_search':<18}"
        for dimension in db_sizes:
            header += f" | {dimension:<9}"
        traces.append(header)
        traces.append("-" * (20 + 12 * len(db_sizes)))
        
        # Rows (ef values)
        for ef in ef_search_values:
            row_str = f"{ef:<18}"
            for dimension in db_sizes:
                # Discover the consequence for this dimension and ef
                val = 0
                for r in all_results[size]:
                    if r.get('ef') == ef:
                        val = r.get(k_label, 0)
                        break
                row_str += f" | {val:<9.2f}"
            traces.append(row_str)
        traces.append("=" * (20 + 12 * len(db_sizes)))

    # 2. Create Search Time Desk
    traces.append("nTable: Common Search Time (ms)")
    traces.append("=" * (20 + 12 * len(db_sizes)))
    header = f"{'Index Setting':<18}"
    for dimension in db_sizes:
        header += f" | {dimension:<9}"
    traces.append(header)
    traces.append("-" * (20 + 12 * len(db_sizes)))
    
    # Flat Row
    row_flat = f"{'Flat Index':<18}"
    for dimension in db_sizes:
        # Flat time is similar for all ef in a dimension, so simply take any
        t = all_results[size][0]['FlatTime_ms']
        row_flat += f" | {t:<9.4f}"
    traces.append(row_flat)
    
    # HNSW Rows
    for ef in ef_search_values:
        row_str = f"HNSW (ef={ef:<3})"
        for dimension in db_sizes:
            t = 0
            for r in all_results[size]:
                if r.get('ef') == ef:
                    t = r.get('HNSWTime_ms', 0)
                    break
            row_str += f" | {t:<9.4f}"
        traces.append(row_str)
    traces.append("=" * (20 + 12 * len(db_sizes)))

    return "n".be a part of(traces)

def foremost(n):
    dataset_path = r"C:databaselaion_final.parquet"
    embeddings_path = r"C:databaseembeddings.npy"
    results_dir = r"C:outcomes"
    
    db_sizes = [50000, 100000, 150000, 200000]
    ef_search_values = [10, 20, 40, 80, 160]
    num_samples = n
    output_txt = os.path.be a part of(results_dir, f"compare_results_{num_samples}.txt")
    output_png_prefix = "compare_vs_dbsize"

    if not os.path.exists(dataset_path) or not os.path.exists(embeddings_path):
        print("Error: Dataset or embeddings not discovered.")
        return
    
    os.makedirs(results_dir, exist_ok=True)

    # Load All Knowledge As soon as
    print("Loading base knowledge...")
    df_all = pd.read_parquet(dataset_path)
    embeddings_all = np.load(embeddings_path).astype('float32')

    # Load CLIP mannequin as soon as
    print("Loading CLIP mannequin (ViT-B-32)...")
    mannequin, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
    tokenizer = open_clip.get_tokenizer('ViT-B-32')
    system = "cuda" if torch.cuda.is_available() else "cpu"
    mannequin.to(system)
    mannequin.eval()

    # Use queries from the primary 50k rows
    eval_indices = random.pattern(vary(min(db_sizes)), num_samples)
    print(f"Sampling {num_samples} queries...")

    # Generate question vectors
    query_vectors = []
    for idx in eval_indices:
        textual content = df_all.iloc[idx]['TEXT']
        text_tokens = tokenizer([text]).to(system)
        with torch.no_grad():
            text_features = mannequin.encode_text(text_tokens)
            text_features /= text_features.norm(dim=-1, keepdim=True)
            query_vectors.append(text_features.cpu().numpy().astype('float32'))

    all_results_data = {}
    ks = [1, 5, 10, 15, 20]
    plot_data = {f"R@{okay}": {} for okay in ks}
    for ef in ef_search_values:
        for okay in ks:
            plot_data[f"R@{k}"][f"ef={ef}"] = []

    for dimension in db_sizes:
        print(f"Evaluating with database dimension: {dimension}...")
        outcomes = evaluate_subset_compare(dimension, embeddings_all, df_all, query_vectors, ef_search_values)
        all_results_data[size] = outcomes

        # Acquire for plot
        for row in outcomes:
            ef = row["ef"]
            for okay in ks:
                plot_data[f"R@{k}"][f"ef={ef}"].append(row[f"R@{k}"])

    # Format pivoted tables
    final_output_text = format_all_tables(db_sizes, ef_search_values, all_results_data)
    print(final_output_text)

    # Save textual content outcomes
    with open(output_txt, "w", encoding="utf-8") as f:
        f.write(final_output_text)
    print(f"nFinal outcomes saved to {output_txt}")

    # Create Particular person Plots for every Okay
    for okay in ks:
        plt.determine(figsize=(10, 6))
        k_key = f"R@{okay}"
        
        for label, values in plot_data[k_key].gadgets():
            plt.plot(db_sizes, values, label=label, marker='s')
        
        plt.title(f"HNSW vs Flat Overlap Recall@{okay} vs Database Measurement")
        plt.xlabel("Database Measurement")
        plt.ylabel("Overlap Ratio")
        plt.grid(True)
        plt.legend()
        
        output_png = os.path.be a part of(results_dir, f"{output_png_prefix}_{okay}.png")
        plt.tight_layout()
        plt.savefig(output_png)
        plt.shut()
        print(f"Plot saved to {output_png}")

if __name__ == "__main__":
    foremost(500)

And the outcomes are the next:

**Flat vs HNSW Index Overlap for okay = 5**

**Flat vs HNSW Index Overlap for okay = 1, 10, 15, 20**

Observations

In all instances, the traces have a damaging slope, indicating that HNSW degrades sooner than the Flat index as database grows.
Increased ef_search values degrade slower than decrease values, which fall fairly sharply.
Increased ef_search values have important overlap (>90%) with the benchmark flat search as in comparison with the decrease values.

Recall-latency trade-off

We all know that HNSW is quicker than Flat search. To see it in motion, I’ve additionally measured the common latency within the code of the earlier part. Listed here are the common search instances (in ms):

Database dimension	50,000	100,000	150,000	200,000
Flat Index	5.1440	9.3850	14.8843	18.4100
HNSW (ef=10 )	0.0851	0.0742	0.0763	0.0768
HNSW (ef=20 )	0.1159	0.0876	0.0959	0.0983
HNSW (ef=40 )	0.1585	0.1366	0.1415	0.1493
HNSW (ef=80 )	0.2508	0.2262	0.2398	0.2417
HNSW (ef=160 )	0.4613	0.3992	0.4140	0.4064

Observations

HNSW is orders of magnitude sooner than flat search, which is the first cause for it to be the search algorithm of alternative for nearly all vector databases.
Time taken by Flat search will increase nearly linearly with database dimension (O(N) complexity)
For a given ef_search worth (a row), HNSW time is sort of fixed. At this scale (200k vectors), HNSW latency stays practically fixed.
As ef_search will increase in a column, the HNSW time will increase very considerably. For example, time taken for ef=160 is 3X that of ef=40

Tuning the RAG pipeline

The above evaluation exhibits that whereas HNSW is certainly the choice to undertake in a manufacturing state of affairs for latency causes, there’s a must periodically tune the ef_search to take care of the latency-recall stability because the database grows. Some greatest practices that must be adopted are as follows:

Given the problem of measuring Recall@okay in a manufacturing database, maintain a check case repository of floor fact doc chunks and queries, which may be run at common intervals to verify retrieval high quality. We may begin with essentially the most frequent queries requested by the consumer, and chunks which might be wanted for a superb recall.
One other oblique method to verify recall high quality can be to make use of a robust LLM to guage the standard of the retrieved context. As a substitute of asking “Did we get the perfect paperwork for the consumer question?”, which is troublesome to say exactly for a big database, we are able to ask a barely weaker query “Does the retrieved context truly comprise the reply to the consumer’s query?” and let the decide LLM reply to that.
Acquire consumer suggestions in manufacturing. Person score of a response together with any handbook correction can be utilized as a set off for efficiency tuning.
Whereas tuning ef_search, begin with a conservatively excessive worth, measure Recall@okay, then scale back till latency is appropriate.
Measure Recall on the top_k that the RAG makes use of, normally between 3 and 10. Contemplate enjoyable top_k to fifteen or 20 and let the LLM determine which chunks within the given context to make use of for the response throughout synthesis step. Assuming the context doesn’t change into too massive to slot in the LLM’s context window, such an method would allow a excessive recall whereas having a average ef_search worth, thereby conserving latency low.

Hybrid RAG pipeline

HNSW tuning utilizing ef_search can not repair the problem of falling recall with rising database dimension past some extent. That’s as a result of vector search even utilizing a flat index, turns into noisy when too many vectors are packed shut collectively within the N dimensional area (N being the variety of dimensions output by the embedding mannequin). Because the charts within the above part present, recall drops by 10%+ as database grows from 50k to 200k. The dependable method to keep recall is to make use of metadata filtering (eg; utilizing a information graph), to determine potential doc ids and run retrieval just for these. I talk about this intimately in my article GraphRAG in Observe: The best way to Construct Value-Environment friendly, Excessive-Recall Retrieval Programs

Key Takeaways

HNSW is the default retrieval algorithm in most vector databases, however it’s hardly ever tuned or monitored in manufacturing RAG methods.
Retrieval high quality degrades silently because the vector database grows, even when latency stays secure.
For a similar corpus dimension, Flat search constantly achieves increased Recall@okay than HNSW, serving as a helpful higher sure for analysis.
HNSW recall degrades sooner than Flat seek for mounted ef_search values as database dimension will increase.
Growing ef_search improves recall, however latency grows quickly, creating a pointy recall–latency trade-off.
Merely tuning HNSW parameters is inadequate at scale—vector search itself turns into noisy in dense embedding areas.
Hybrid RAG pipelines utilizing metadata filters (SQL, graphs, inverted indexes) are essentially the most dependable method to keep recall at scale.

Conclusion

HNSW has earned its place because the spine of recent vector databases—not as a result of it’s completely correct, however as a result of it’s quick sufficient to make large-scale semantic search sensible.

Nevertheless, in RAG methods, pace with out recall is a false optimization.

This text exhibits that as vector databases develop, retrieval high quality deteriorates quietly—particularly underneath approximate search—whereas latency metrics stay deceptively secure. The result’s a system that seems wholesome from an infrastructure perspective, however progressively feeds weaker context to the LLM, rising hallucinations and decreasing reply high quality.

The answer is to not abandon HNSW, nor to arbitrarily improve ef_search.

As a substitute, production-grade RAG methods should:

Measure retrieval high quality explicitly and repeatedly.
Deal with Flat search as a recall baseline.
Constantly rebalance recall and latency.
And in the end, transfer towards hybrid retrieval architectures that slender the search area earlier than vector similarity is utilized.

In case your RAG system’s solutions are getting worse as your knowledge grows, the issue might not be your LLM, your prompts, or your embeddings—however the retrieval algorithm you by no means realized you have been counting on.

Join with me and share your feedback at www.linkedin.com/in/partha-sarkar-lets-talk-AI

_{Photographs used on this article are synthetically generated. LAOIN-Aesthetics dataset used underneath CC-BY 4.0 license. Figures and code created by me}

HNSW at Scale: Why Your RAG System Will get Worse because the Vector Database Grows

Optimizing Token Era in PyTorch Decoder Fashions

Is the AI and Knowledge Job Market Lifeless?

Related Posts

Optimizing Token Era in PyTorch Decoder Fashions

Is the AI and Knowledge Job Market Lifeless?

Construct Efficient Inner Tooling with Claude Code

The Actuality of Vibe Coding: AI Brokers and the Safety Debt Disaster

AI in A number of GPUs: How GPUs Talk

Architecting GPUaaS for Enterprise AI On-Prem

FRNT is obtainable for buying and selling!

Leave a Reply Cancel reply

POPULAR NEWS

Chainlink’s Run to $20 Beneficial properties Steam Amid LINK Taking the Helm because the High Creating DeFi Challenge ⋆ ZyCrypto

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

Easy methods to Use LLMs for Highly effective Computerized Evaluations

XMN is accessible for buying and selling!

College endowments be a part of crypto rush, boosting meme cash like Meme Index

EDITOR'S PICK

Learn how to Guarantee Reliability in LLM Purposes

Defending towards Immediate Injection with Structured Queries (StruQ) and Choice Optimization (SecAlign)

Characteristic Engineering Strategies for Healthcare Information Evaluation | Actual-World Examples & Insights by Leo Anello

AI is definitely unhealthy at math, ORCA reveals • The Register

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

HNSW at Scale: Why Your RAG System Will get Worse because the Vector Database Grows

READ ALSO

What’s HNSW?

Parameters of HNSW index

What’s Recall@okay?

Use Case

Measuring Recall@okay for Flat vs HNSW

Observations

Does HNSW degrade sooner than Flat Search?

Observations

Recall-latency trade-off

Observations

Tuning the RAG pipeline

Hybrid RAG pipeline

Key Takeaways

Conclusion

Related Posts

Leave a Reply Cancel reply

POPULAR NEWS

EDITOR'S PICK

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?