Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval

Scene 1: A RAG system over just a few hundred pages of coverage paperwork goes reside for a small workforce.

The very first thing that impresses everybody: it handles paraphrase. Somebody asks “how do I cancel?”, the doc by no means makes use of the phrase cancel, it makes use of termination procedures, and the system finds it anyway.
One other consumer asks in French whereas the coverage is in English, and the precise web page comes again. A typo right here, a phonetic spelling there, no drawback. After just a few days the workforce is genuinely impressed. The closest factor RAG has to magic is sitting in entrance of them, and it didn’t take any hand-coded synonym desk to make it work.

Scene 2: The identical system, two weeks later.

The consumer asks “what’s the rule on contractor time beyond regulation?” The system solutions “I couldn’t discover that data.” The consumer, who occurs to be the enterprise knowledgeable who wrote half this handbook, frowns, opens the PDF, sorts non-employee labor into Ctrl-F, and lands on the precise paragraph in three seconds. The suitable key phrase wasn’t time beyond regulation. It was the time period the doc truly makes use of. The knowledgeable knew that; the embedding didn’t.
Fairly rapidly, extra circumstances like this floor. Negation breaks. Actual contract reference numbers break. An inside product code returns the incorrect tier. None of it’s fixable by swapping the embedding supplier.

The place of the sequence, said up entrance: most enterprise reliability beneficial properties come from robust upstream filtering (knowledgeable key phrases, doc construction), not from a reranker stacked on prime of weak retrieval.

The classical stack ranks the layers by value:

low cost embedding similarity on the backside,
an elective cross-encoder reranker between,
the chat-completion LLM on prime.

None of them is magic; every breaks in particular methods.

This text is one piece of the broader Entreprise Doc Intelligence Vol. 1 sequence, which builds enterprise RAG brick by brick from a baseline pipeline to corpus-scale structure.

1. What embeddings nail

Earlier than the failures, what embeddings truly impress at. The failures solely make sense in distinction.

An embedding turns a chunk of textual content right into a vector. Texts with comparable phrases find yourself shut in vector area.

An embedding is a listing of numbers that captures the which means of a chunk of textual content: an extended listing can carry extra nuance. Embeddings have improved with every technology. Each case beneath runs on the identical 4 fashions, weakest to strongest:

Loading every is a one-liner. The 2 native fashions come from sentence-transformers (HuggingFace weights pulled to disk on first name); the 2 OpenAI fashions undergo the API consumer. Similar name form throughout all 4, returning a vector.

from sentence_transformers import SentenceTransformer
from openai import OpenAI

# Native fashions: weights downloaded from HuggingFace, run in-process.
glove  = SentenceTransformer("average_word_embeddings_glove.6B.300d")  # 2014, 300-dim
minilm = SentenceTransformer("all-MiniLM-L6-v2")                       # 2021, 384-dim

# OpenAI fashions: referred to as via the API.
consumer = OpenAI()
def openai_embed(textual content: str, mannequin: str) -> listing[float]:
    return consumer.embeddings.create(enter=textual content, mannequin=mannequin).knowledge[0].embedding

# Similar name form throughout all 4; every returns a vector of its personal dimension.
v_glove  = glove.encode("coverage renewal")
v_minilm = minilm.encode("coverage renewal")
v_ada    = openai_embed("coverage renewal", "text-embedding-ada-002")   # 2022, 1536-dim
v_large  = openai_embed("coverage renewal", "text-embedding-3-large")   # 2024, 3072-dim

Every mannequin lives in its personal vector area with its personal cosine distribution, so uncooked scores throughout columns are usually not comparable. What’s significant is the separation inside a column: does the goal win towards the decoys, and by how a lot? Watching the hole widen throughout the gradient is the empirical proof that embeddings actually did get higher.

The primitive each comparability desk beneath makes use of is similar: embed the question and every candidate with the 4 fashions, rating with cosine similarity, return a row per candidate:

def _cos(u, v):
    """Cosine similarity : dot-product of two vectors, normalised by their lengths."""
    return float(np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v)))

def compare_models(question, candidates, goal=None):
    qg = glove.encode(question)
    qm = minilm.encode(question)
    qa = openai_embed(question, "text-embedding-ada-002")
    ql = openai_embed(question, "text-embedding-3-large")
    rows = []
    for c in candidates:
        rows.append({
            "candidate": c,
            "GloVe-avg":  _cos(qg, glove.encode(c)),
            "MiniLM":     _cos(qm, minilm.encode(c)),
            "ada-002":    _cos(qa, openai_embed(c, "text-embedding-ada-002")),
            "3-large":    _cos(ql, openai_embed(c, "text-embedding-3-large")),
        })
    return pd.DataFrame(rows).set_index("candidate")

1.1 Conceptual proximity

automobile matches passages about automobiles, vehicles, motor automobiles. fireplace harm finds passages on smoke harm and scorching. supervisor approval matches a clause about government approval. The mannequin captures the semantic discipline, not simply the floor phrases. That is what makes embeddings really feel highly effective: the consumer doesn’t need to guess the doc’s vocabulary; the embedding bridges the remainder.

*Informal question bridges to formal paraphrase. All 4 fashions decide TARGET; larger fashions widen the margin – Picture by creator*

1.2 Synonyms and paraphrase

Telephone quantity matches phone. Coverage cancellation matches a bit titled termination procedures. Price matches cost. Month-to-month value matches premium. Expiration matches coverage finish date. Physician matches doctor, lawyer matches legal professional, automobile matches car. Single phrases and multi-word compounds alike. The mannequin has discovered that two vocabularies say the identical factor, together with the hole between informal consumer phrasing and the formal language paperwork are written in. No person coded that mapping by hand.

The check: question what's the month-to-month payment towards a synonym TARGET (A flat cost of $9.99...), a literal-overlap decoy (Premium funds are due month-to-month..., which shares the literal month-to-month token), and two off-topic decoys.

*Question `month-to-month payment`. Three fashions bridge `payment ↔︎ cost`; GloVe picks the literal-overlap decoy – Picture by creator*

Solely GloVe-avg falls for the literal-overlap decoy. Sentence-encoder coaching (already in 2021’s MiniLM) is what provides actual synonym dealing with. With out it, a candidate that simply repeats the question’s tokens in any order wins. With it, the mannequin bridges payment ↔︎ cost although the 2 phrases share no letters. The question can also be phrased as a query (what's the month-to-month payment) and the TARGET as an assertion (A flat cost of $9.99...). The synonym dealing with is what wins right here. However the precise reply (the naked quantity $9.99 alone, or Sure for a sure/no query) wouldn’t essentially win no matter mannequin power. Part 2.2 demonstrates that instantly.

1.3 Typos and misspellings

insurence nonetheless embeds near insurance coverage. polciy nonetheless finds the coverage part. deductable with the incorrect vowel nonetheless lands on the deductible web page. Diacritics dropped on French phrases (resiliation with out the accent) nonetheless match the canonical kind. Fashionable embedding fashions had been skilled on a web-scraped soup of textual content the place these typos are fixed, and so they have discovered to soak up the noise.

*Typoed question. GloVe collapses to destructive cosines; margin to TARGET grows from MiniLM to 3-large – Picture by creator*

Have a look at the rating gaps, not absolutely the scores. GloVe-avg has no notion of typos. Misspelled tokens are out of vocabulary, so the embeddings collapse and the cosines go destructive. The ordering is mainly random. The OpenAI fashions soak up the typos cleanly. Character-level robustness is actual, and it scales with mannequin capability.

1.4 Cross-lingual matching

Multilingual embeddings place premium, prime and Prämie in close by areas of the area. Similar for deductible / franchise / Selbstbeteiligung, for declare / sinistre / Schadensfall. A French key phrase retrieves an English passage about the identical idea. For enterprises with mixed-language corpora (French contracts, English correspondence, German coverage schedules), that is genuinely helpful when it really works, and on trendy fashions it often does.

*French question towards English candidates. GloVe and MiniLM battle; ada-002 and 3-large bridge languages cleanly – Picture by creator*

GloVe fails outright: it picks Protection restrict: $50,000 per yr. over Annual premium: $1,200. as a result of the French annuelle lexically associates with yr in its averaged phrase area, and it has no concept that prime means premium. MiniLM technically picks TARGET, however the cosines sit round 0.12, mainly noise. ada-002 and 3-large are multilingual by coaching, like BGE-M3 and multilingual-e5, and so they bridge French to English cleanly. The selection shouldn’t be “vector vs key phrase”, it’s “multilingual vector mannequin vs English-only one”.

1.5 Compound polysemy

Polysemic phrases have a number of meanings that the context disambiguates:

financial institution (monetary establishment / river edge),
declare (insurance coverage occasion / assertion),
retailer (verb: put away / noun: retail outlet),
inexperienced card (immigration doc / a card coloured inexperienced),
scorching canine (meals / a canine that’s scorching).

When a candidate makes use of the literal phrase within the incorrect sense, a robust embedding ought to nonetheless decide the semantically proper one. That is additionally the place the literal-token bias of weak fashions reveals clearest: GloVe-avg can not distinguish the 2 readings of a compound and picks whichever candidate shares essentially the most tokens with the question. Sentence encoders progressively get better the precise sense, however how progressively will depend on how identified the compound is in coaching knowledge.

We check two compounds, straightforward first, then arduous.

First, inexperienced card, the straightforward case. The immigration sense is so closely attested in coaching corpora (information, authorized textual content, Wikipedia) that even MiniLM resolves the compound. The check: question inexperienced card, three candidates. A paraphrase of the immigration doc (TARGET, zero shared tokens), a gaming-context sentence that comprises each inexperienced and card in literal senses (the lure), and one off-topic decoy.

*`inexperienced card` towards immigration paraphrase vs gaming lure. Solely GloVe falls within the lure – Picture by creator*

Solely GloVe falls within the lure. Phrase-averaging fashions don’t have any notion that “inexperienced card” as a compound refers to immigration. They see two tokens, search for candidates sharing these tokens, and the gaming lure wins. MiniLM is already sufficient to flip it, as a result of sentence-level coaching captures the institutional sense. ada-002 picks TARGET by a cushty margin; 3-large by a large one. That is the form of polysemy embeddings deal with properly, as a result of the general public net teaches the compound all over the place.

Now scorching canine, the arduous case. Similar structural setup (a compound that additionally reads actually), however the literal studying (a canine that’s scorching) is additionally closely attested in coaching textual content. The mannequin has seen loads of sentences about scorching climate and canines in it. The meals sense and the literal sense compete on near-equal footing, and the literal-token bias of weak and mid fashions wins.

*`scorching canine` towards meals paraphrase vs literal-token lure. Solely 3-large flips the polysemy cleanly – Picture by creator*

That is the part 1 case the place the mannequin gradient helps essentially the most. GloVe-avg, MiniLM, and ada-002 all fall within the lure. They latch onto the shared scorching + canine tokens regardless of the incorrect sense. The identical impact was already seen on GloVe in part 1.2 (literal month-to-month token beating the payment ↔︎ cost synonym). Compound polysemy is the worst case of it: the literal tokens of the question seem within the decoy, so even ada-002 can not inform the 2 senses aside. 3-large is the primary mannequin that recovers: it picks the meals paraphrase by a large margin although TARGET shares zero tokens with the question.

So the sensible query in your corpus shouldn’t be “is there polysemy” however “how institutional is the polysemy I’ve”. An insurance coverage corpus has loads of compound polysemy that’s not within the public coaching distribution (declare dealing with as a verb in a workflow, pool as a risk-sharing instrument). On these, even ada-002 behaves like GloVe behaves on scorching canine. The 2024-class mannequin is the practical repair; the remainder of the sequence goes after the structural one.

1.6 What these wins actually present, and don’t

The vocabulary on this part has one factor in frequent: it’s public. The mannequin noticed inexperienced card ↔︎ everlasting resident card, prime ↔︎ premium, polciy → coverage in hundreds of thousands of coaching paperwork. Embeddings deal with them properly as a result of the equivalence is baked into the weights. What the literature calls the parametric reminiscence of the mannequin (the half that “is aware of” issues from coaching, with none retrieval) is doing many of the work.

Two penalties price naming earlier than we transfer on.

1. For these circumstances, you may not want RAG in any respect. Ask GPT-4 “what’s one other title for inexperienced card?” and also you get the reply with out retrieval. The parametric a part of the mannequin already is aware of. RAG earns its place precisely the place the parametric half doesn’t: information that aren’t on the general public net, contract clauses that don’t generalise, inside product codes the mannequin by no means noticed. Part 1 used well-known vocabulary so the demos are reproducible and browse cleanly. Manufacturing RAG shouldn’t be used to reply these questions.

2. The part 1 wins don’t switch to enterprise vocabulary. An insurance coverage firm has ShieldPro Elite (a product tier), pool (a risk-sharing instrument, not a swimming pool), non-employee labor (the contract’s phrase for contractor), regulatory citations like Solvency II Article 7. None of that is within the mannequin’s coaching distribution. On enterprise phrases, embeddings fail the identical manner GloVe fails on scorching canine, as a result of the institutional sense the embedding would wish to get better shouldn’t be institutionalised anyplace exterior that firm.

The repair shouldn’t be a much bigger embedding mannequin. The repair is the knowledgeable who is aware of the vocabulary, codified as a key phrase dictionary (part 3.3 develops this). Part 2.1 makes the failure concrete on the pool instance.

Part 2 catalogues the structural failures. Learn them with this in thoughts: each considered one of them is the rule, not the exception, on enterprise corpora.

2. The place they break, and why

The skills in part 1 are actual; the failures beneath are equally actual, equally reproducible, and persist throughout all 4 fashions. A bigger mannequin doesn’t transfer the rating. The repair is architectural, not “decide a stronger embedding”.

Part 1.6 already raised the apparent counter (“for these circumstances, simply ask the LLM instantly”). At corpus scale that doesn’t scale: a 200k-document corpus can’t be handed via an LLM on each question. Some retrieval step has to return first. The mainstream pipeline stacks a reranker between embeddings and the LLM; the sequence’s reply is upstream filtering via knowledgeable key phrases and doc construction (articles 6, 7, 9). Both manner, the failures catalogued beneath apply to the embedding stage. None of those layers is magic.

2.1 The only break: the time period isn’t within the mannequin

Earlier than the structural failures, essentially the most primary one. Part 1.6 stated it in phrases. Right here is the demo.

Take pool. In an insurance coverage contract, pool is a risk-sharing instrument: a bunch of insureds that collectively soak up losses via aggregated premiums. Usually English, pool is a physique of water you swim in. Two senses of the identical phrase, with one stark distinction: the swimming sense is all over the place on the general public net; the risk-pool sense is buried in actuarial textbooks, regulatory filings, and reinsurance treaties that the mannequin barely noticed at coaching time.

The check mirrors the hot-dog setup from part 1.5, with one twist. Question the naked phrase pool. Three candidates: a swimming paraphrase (the general public sense, no pool token within the sentence), a reinsurance paraphrase utilizing actual trade jargon (the specialist sense, additionally no pool token), and a random management sentence a couple of prepare departure (no pool token, no insurance coverage connection, no swimming).

*Question `pool`. The reinsurance sense ranks beneath a random management on three of 4 fashions – Picture by creator*

The swim paraphrase wins on each mannequin, by a large margin (0.353 to 0.843 cosine, relying on the mannequin). The reinsurance paraphrase, written in real trade vocabulary, ranks beneath the random train-departure management on three of the 4 fashions. Even ada-002, the workhorse of most enterprise RAG deployments, places the prepare timetable 0.010 forward of the specialist sentence. Solely 3-large provides the specialist sense a 0.006 raise over the management, properly contained in the noise of the measurement.

That is essentially the most direct failure mode there may be: the embedding area merely doesn’t encode the specialist sense of pool. A reranker stacked on prime wouldn’t assist, as a result of the candidate scores it might re-evaluate are themselves noise. An even bigger embedding mannequin wouldn’t assist, as a result of the mannequin that noticed the swimming pool 1,000,000 instances and the reinsurance pool perhaps 100 instances will preserve weighting the swimming sense.

pool is actually a smooth OOV case: the swim sense and the danger sense share a register and 3-large catches some sign. The tougher circumstances are strict OOV phrases: ShieldPro Elite (a fictional product tier), Solvency II Article 7 (an actual regulatory quotation), ZRX-2025 (an inside product code). For these the embedding has no anchor in any respect. The mannequin treats them as random byte strings; rating them towards another textual content is a coin flip biased by tokenization quirks.

The repair is the knowledgeable who is aware of the vocabulary, codified as a key phrase dictionary. Part 3.3 develops the workflow.

The remainder of part 2 walks via the structural failures that present up even when the time period is within the mannequin. The pool case is the easier break that comes first.

2.2 The structural break: time period similarity, not reply relevance

Part 2.1 coated the case the place the time period merely isn’t within the mannequin. The remainder of part 2 covers the case the place the time period is within the mannequin, and the embedding nonetheless provides the incorrect reply. These failures share one structural root. An embedding sees textual content and ranks it by time period similarity. It doesn’t symbolize the question-to-answer relation in any respect. Two of the best queries you possibly can ask make this concrete. They aren’t enterprise edge circumstances, they’re essentially the most basic questions on the earth.

*Sure/no query. The naked key phrase `Termination` beats the precise `Sure` reply on each mannequin – Picture by creator*

“Sure” is the precise reply to a sure/no query. It by no means wins. The literal copy of the question’s noun does. On each mannequin from 2014 to 2024.

A subtlety price naming. This specific failure is much less dangerous in observe than it appears. For a sure/no query, what we truly need from retrieval shouldn’t be the literal phrase sure. We wish the proof in regards to the matter: the web page the place the rule lives. The reply-phase LLM produces sure/no from that proof. So retrieval pulling Termination or Termination could also be required. (the topical matches) moderately than Sure, it's attainable. is nearer to the precise behaviour than the demo’s verdict suggests. The precept the article retains surfacing is right here too: the retrieval section shouldn’t be the reply section, and so they need to be separated and optimised as two distinct steps. Articles 6, 7, and eight develop the separation.

The failure is sharper on the subsequent instance, the place retrieval truly wants to search out the answer-bearing line.

Now the cleanest factoid on the earth: “What’s the capital of France?” The web has seen “Paris is the capital of France” hundreds of thousands of instances. If question-answer mapping confirmed up anyplace in any embedding area, that is the place it might present up.

*Question `Capital of France`. Paris by no means wins; topic-decoys sharing `Capital of` or `France` at all times do – Picture by creator*

Paris isn’t #1. On three of the 4 fashions (GloVe, ada-002, 3-large) the winner is Capital of Italy, the candidate that shares the literal phrase Capital of with the question. On MiniLM a special decoy wins: France is in Europe., as a result of it shares the token France. Totally different decoys, similar root trigger: matter similarity, not reply relevance. Going from a 300-dim 2014 bag-of-word-vectors mannequin to a 3072-dim 2024 OpenAI mannequin doesn’t flip the lure. For a factoid query, retrieval ought to fetch the road that comprises the reply. As a substitute, each mannequin picks the road that matches the question’s vocabulary topically.

A second nuance price naming. Fashionable embedding fashions prepare on question-passage pairs (MS MARCO, Pure Questions, BEIR). This does push answer-bearing passages a bit of nearer to the questions they reply. The bias exists. It’s weak. On very basic factoids it typically flips the choice. On specialised vocabulary the mannequin by no means noticed at coaching (inside product codes, knowledgeable terminology, contract jargon), the bias vanishes. Subject similarity dominates once more.

The sections beneath catalogue this root trigger in 4 concrete failure shapes (negation, magnitudes, topical proximity, sign dilution) plus a survey of the apparent circumstances. Every is similar mechanism utilized to a special question sort.

2.3 Negation

A negation query turns the logical relation the other way up: the consumer needs the candidate that’s the complement of the subject, not the candidate that’s closest to the subject. Embeddings can’t do this. They measure topical proximity, not logical complementation. The starker the check, the clearer the failure.

Question: “What’s NOT a metropolis?” 4 candidates: three are actual entities (two particular cities + the literal phrase Metropolis), and one is Desk, a secular object that occurs to be the solely candidate that solutions the query appropriately.

*Question `What's NOT a metropolis?`. Each mannequin ranks the right reply final; negation is invisible – Picture by creator*

Each mannequin fails the identical manner. The candidates that match the subject (Metropolis, Paris, New York) sit on prime, and Desk, the one candidate that truly solutions the query, lands final. The question phrase NOT carries nearly no sign within the embedding area: the embedding sees a bag containing “metropolis” and ranks something city-related greater than something that isn’t. The repair isn’t a stronger embedding mannequin. It’s a step that detects the negation at question-parsing time and inverts the retrieval (Article 6).

“Positive, however no actual consumer writes a negation question.” An inexpensive objection that holds for a second after which breaks in manufacturing. Customers don’t pose “what’s NOT a metropolis?” They pose “what’s the premium quantity on this coverage?” The system returns the deductible by mistake. The consumer, pissed off, naturally tries to appropriate: “I would like the premium quantity, not the deductible.” That second question is a negation, and it’s precisely the second an actual enterprise consumer writes one.

The intuition is affordable: a human reader treats not as an exclusion. The embedding does the alternative. By including deductible to the question, even prefixed with not, the embedding pulls deductible-bearing strains nearer, not additional. The consumer’s correction makes the failure strictly worse than the unique question.

That is the bigger precept the part retains surfacing: the uncooked query isn’t the precise enter to the retriever. The repair is upstream, in query parsing: negation will get detected, lifted out of the prose, encoded as a structured exclude-filter, and utilized after retrieval, not embedded with the remainder of the question. Sections 3.2 and three.3 return thus far with a constructive model: what the retriever truly consumes is a structured illustration (key phrases, filters, exclusions), not the consumer’s free-form sentence.

2.4 Magnitudes and thresholds

Numerical comparisons, dates, contract quantities, account balances. Something the place the reply will depend on the worth itself. Take a stripped-down model: question discover worth larger than 1M, 4 candidates which are naked quantities.

*Question asks for worth > 1M. `1M` wins all over the place; `3B`, the one appropriate reply, ranks final – Picture by creator*

Each mannequin picks 1M, the candidate that equals the brink however doesn’t strictly exceed it. The win is pure lexical match: the literal 1M token sits within the question. 3B, the one candidate that truly solutions the query, lands at #4 (lifeless final) on each ada-002 and 3-large. The embedding has no idea of magnitude. It sees 1M subsequent to 1M and that wins.

This generalizes to any value-comparison or threshold query: financial thresholds, dates (“after 2020”), durations (“longer than 30 days”), counts. Embeddings are dangerous at this nearly by design: they compress which means into dense vectors, and the discriminating sign (the worth itself, or the operator that picks amongst values) is precisely what compression destroys. The repair is well-known: BM25 / full-text indexing for the lexical match, plus a question-parsing step that lifts the operator and the brink out as structured fields (Article 6) so a downstream filter can do the comparability.

2.5 Topical proximity vs reply relevance

Person query: “Who signed the contract?” The corpus has one passage describing how contracts have to be signed (approved consultant, signature necessities) and one passage with the precise signature (“Signed: John Smith, Advertising and marketing Director, dated 2025-03-15”). The primary passage talks about signing; the second is the signature. Which one wins?

*`Who signed the contract?`. The procedural passage about signing outranks the precise signature line – Picture by creator*

That is the structural failure that the mannequin gradient doesn’t repair. Embedding similarity measures topical proximity, not question-to-answer relationship. A web page that talks about a subject will usually rating greater than a web page that solutions a query in regards to the matter. Definitions outscore values. Background sections outscore conclusions. Procedures outscore the concrete situations they describe.

Three of 4 fashions affirm the sample right here (GloVe, ada-002, 3-large). MiniLM is the exception: its sentence-pair coaching pushes the concrete-answer phrasing barely greater than the procedural-density phrasing. The sample is steady on the opposite three, and reproduces throughout most factoid-against-procedure pairs we’ve got tried.

2.6 Sign dilution in lengthy context

The earlier checks used candidates roughly the size of the question. Actual corpus pages are usually not. An actual web page is 300-500 phrases, dense with particulars, with the reply to a particular query buried in a single sentence someplace within the center. Whenever you embed the entire web page as a single vector, the sign of that one answer-bearing line will get averaged with all the things else, and the page-level embedding drifts towards the centroid of the encompassing noise.

The cleanest approach to see it is a one-variable experiment. Hold the reply sentence mounted. Prepend it with an rising variety of unrelated office-life sentences (workplace hours, parking guidelines, HR boilerplate, nothing about deductibles or water harm). Rating towards a set management candidate that shares no particular time period with the question, simply lives in the identical broad insurance coverage/claims vocabulary.

Question: deductible for water harm claims
Reply (various): For water harm claims, the usual deductible is $500. prepended with N ∈ {0, 1, 2, 4, 8, 16} unrelated sentences
Management (fixed throughout N): Claims should embody pictures, restore estimates, and police stories the place relevant.

*Reply sign vs noise: prepending unrelated sentences makes the reply rating collapse on each mannequin – Picture by creator*

Every mannequin fails in its personal time, however all of them fail. GloVe collapses instantly as a result of bag-of-words averaging drags the embedding towards the noise after a single sentence. MiniLM holds out for 4 sentences earlier than its sentence-encoder illustration provides up. ada-002 and 3-large, each 2022+ OpenAI fashions skilled on question-passage pairs, final the longest, however by the point the candidate is 144 phrases (eight unrelated sentences), the precise reply ranks beneath a candidate that doesn’t include the phrases deductible, water, or harm in any respect. Embedding a 300-word web page is the manufacturing model of “reply + 16 noise sentences”.

This is the reason manufacturing pipelines that embed on the web page degree ceaselessly miss the precise web page even when the reply is genuinely on it. The page-vector averages 300-500 phrases of topical noise round one or two answer-bearing strains. Part 3.1 is the architectural repair: embed line by line, not web page by web page. Solely mixture as much as the web page when technology wants the encompassing context. The suitable line on a loud web page turns into findable once more as a result of its embedding shouldn’t be averaged with all the things else.

2.7 The apparent circumstances (no demo wanted)

Some question sorts break embeddings so plainly {that a} four-model comparability would simply repeat the identical end result. They’re listed right here for completeness, and to make a broader level: no embedding improve rescues them. The repair is upstream (query parsing, Article 6) or in a special device completely (BM25, metadata filter, aggregation pipeline).

OOV identifiers and inside jargon: contract references (Part 4.2.1), regulatory citations (GDPR Artwork. 17.3), bill numbers, ticket IDs, inside product names (ShieldPro Elite, SAP-MRP, KPI-This autumn-V3). The embedding treats them as opaque sequences and can’t rank them semantically. Repair: BM25 or an exact-match index for the lookup, plus a glossary that maps aliases to canonical phrases (ShieldPro Elite → top-tier owners plan) maintained as knowledgeable key phrases (Article 6).
Boolean composition: “paperwork reviewed by Alice however not by Bob”, “claims with harm and witness”. Bag-of-words averaging erases the logical operators. Repair: parse the query right into a structured filter (Article 6) and apply it after retrieval.
Counting and aggregation: “What number of contracts did Alice signal?”, “Listing all open claims”. Embeddings return one most-similar passage; a counting reply wants a full scan or a SQL-style question over an index. Repair: route these to an aggregation pipeline (Articles 15-20).
Temporal predicates: “the newest model”, “claims filed after 2020”, “insurance policies expiring earlier than December”. Embeddings don’t symbolize temporal order. Repair: extract the temporal filter at question-parsing time and apply it as a metadata filter on the index.
Multi-hop reasoning: “Who’s the supervisor of the one who signed contract X?” Every hop is a separate retrieval; the embedding provides you one shot. Repair: an agentic chain, or a graph traversal over a correctly listed corpus.

The sample is constant. When an embedding fails clearly, the reply isn’t “purchase a much bigger embedding mannequin”. It’s “raise the question out of the embedding lane and into the precise device”.

2.8 Similar cracks at web page scale (actual doc)

The 4 failures above had been demonstrated on hand-written candidates. They present up identically when retrieval runs page-by-page on an actual doc. We embed each web page of Consideration Is All You Want (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv summary web page; 15 pages) and run three questions; every surfaces a special rating pathology at web page granularity.

What every end result reveals.

Q1, barely wins. Three pages inside 0.01 of one another; the precise web page (web page 7, the place the Adam learning-rate system lives) wins by 0.007. That’s the margin of luck, not retrieval. A variant of part 2.5 (topical proximity) compounded with basic rating fragility.
Q2, top-3 saves us. Web page 8 outranks web page 9, however the reply (Desk 3, the d_k row of the ablation) lives on web page 9. Prime-3 is sufficient; top-1 would have failed silently. Similar flavour as part 2.4 (precise values inside a numeric desk).
Q3, whole failure. The reply web page (web page 8, ε_ls = 0.1) falls out of the top-3 completely. Web page 15 (with instance sentences stuffed with ε symbols in formulation) sneaks in as a substitute. That is part 1.5 (compound polysemy) firing on ε: the embedding can’t inform the ε of the Adam optimizer (web page 7), the ε_ls of label smoothing (web page 8), and the ε of unrelated formulation (web page 15) aside.

Similar failure classes, scaled as much as an actual doc. The repair is similar one part 3 develops.

3. The best way to truly use them

Part 1 confirmed what embeddings impress at. Part 2 confirmed the place they break, with two distinct roots: when the time period merely isn’t within the mannequin (part 2.1) and when the time period is within the mannequin however time period similarity isn’t reply relevance (sections 2.2 onward). The pure subsequent query: on condition that, how can we truly use them in manufacturing?

4 sections. Part 3.1: the precise psychological mannequin (line-level synonym-tolerant search). Part 3.2: the trick that bridges the question-to-answer hole shouldn’t be actually about embeddings, it’s about extracting the key phrases the reply would include. Part 3.3: the manufacturing workflow that makes each work, by discovering the corpus’s vocabulary with specialists, codifying it right into a key phrase dictionary, then working focused retrieval on prime. Part 3.4: the particular case of sentiment-heavy corpora (HR suggestions, buyer surveys, help tickets), the place the identical discovery mechanism applies to emotional vocabulary.

3.1 The reframing: line-level synonym-tolerant search

The only approach to maintain what embeddings are: vector search is key phrase search that handles synonyms, typos, and different languages, utilized line by line. It’s not magic. It’s not “page-level semantic understanding”. On a single line, the mannequin treats cancel and terminate as shut. It absorbs polciy as coverage. It bridges prime and premium throughout languages. Each match that labored in part 1 labored because of this.

Whenever you embed a complete web page right into a single vector (part 2.6 confirmed it instantly), the sign of 1 good line will get averaged with the remainder, and the precise line hides inside a web page that principally talks about different issues. So embed line by line. Solely mixture as much as the web page when technology wants the encompassing context.

Web page-level embedding nonetheless earns its place in just a few circumstances: when no single line carries the key phrase (the web page is about automobile insurance coverage however by no means makes use of that phrase), when the subject is implied by surrounding vocabulary (medical web page mentioning A1C / insulin / blood sugar however by no means diabetes), when type or register issues, when the heading is generic (“Notes”, “Part 5”). Exterior these circumstances, line-level wins nearly each time.

The demo beneath makes it concrete on an actual paper. The earlier sections embedded quick hand-written candidates. Right here we embed each line of the Consideration Is All You Want paper (15 pages, ~1000 strains) and search by a brief key phrase anchor. The highest-Okay outcomes are strains, with their web page and line quantity. You’ll be able to learn every match and see why it matched: the anchor’s key phrase or a transparent paraphrase is correct there within the textual content.

5 operations on prime of pandas and numpy: encode the question, stack the road embeddings right into a matrix, batch-compute cosine in a single matmul, type by similarity, return the top-k. No vector database, no framework, no infra. The “vector retailer” is a DataFrame column plus a numpy dot product.

def top_lines_for(query: str, line_df: pd.DataFrame, ok: int = 10) -> pd.DataFrame:
    """Rank each line by cosine similarity to `query`. Return the top-k."""
    q_vec = get_embedding(query, consumer=consumer)
    line_matrix = np.vstack(line_df["embedding"].values)
    sims = line_matrix @ q_vec / (
        np.linalg.norm(line_matrix, axis=1) * np.linalg.norm(q_vec)
    )
    return (
        line_df.assign(similarity=sims)
        .nlargest(ok, "similarity")[["page_num", "line_num", "similarity", "text"]]
        .reset_index(drop=True)
    )

*Prime 10 strains for `multi-head consideration`: paraphrases and literal matches from pages 1, 4, 5, 10 – Picture by creator*

Two issues to remove from the line-level demo.

1. The matched strains actually present why every one matched. No magic, no rating opacity. Each prime end result comprises both the anchor’s key phrase or a transparent paraphrase of it. That’s line-level embedding in a single phrase: a fuzzy, synonym-tolerant Ctrl-F over the doc.

2. The matched line is an anchor, not the passage you ship to technology. The road is a small factor the retriever can confidently find. The passage that goes to the LLM is often bigger: the encompassing paragraph, the part, typically the entire web page. Article 7 develops this as a two-step sample: detect anchors first (line-level, keyword-level, structure-level), then select a passage round every anchor primarily based on what the query wants. Focused retrieval = small N round a pointy anchor, not 30 fuzzy pages thrown on the LLM.

3.2 HyDE: search what the reply would include, not the query

Part 2.2 confirmed that embeddings don’t see questions; they see time period similarity. The pure response: cease feeding the query into the retriever. Feed it textual content that appears like the reply as a substitute. That’s the concept behind HyDE (Hypothetical Doc Embeddings). Write (or have an LLM write) a sentence that plausibly solutions the query, within the vocabulary the doc would use, and embed that. The retriever compares the hypothetical-answer vector to the corpus.

The purpose everybody makes about HyDE is the embedding aspect: “the rewritten question lands within the doc’s neighbourhood as a substitute of the consumer’s”. That’s true and it helps. However the true worth of HyDE, particularly in enterprise contexts, is on a special layer. Writing a hypothetical reply is additionally an extraction step: it surfaces the key phrases the reply would include. “Termination procedures”, “rights of rescission”, “cancellation payment”. These are the phrases that anchor the search, whether or not the retriever is vector-based or keyword-based.

*Uncooked question ranks goal #4; HyDE rewrite injects doc vocabulary, goal climbs to #1 – Picture by creator*

Why HyDE labored right here, and what truly did the work. The uncooked question says cancel. The goal line says rescission and terminate. Zero shared content material tokens. Three lexical decoys within the candidate pool every repeat cancel/cancellation a number of instances, and collectively they push the formal goal right down to rank #4. The HyDE rewrite is a fictional reply that occurs to include rescission, terminate, written discover, renewal, the precise vocabulary the goal makes use of. As soon as these tokens enter the question aspect, the rating flips and the goal climbs to #1.

The dominant issue is the key phrases the rewrite comprises. Register matching (the rewrite’s formal declarative tone aligning with the doc’s register) and latent semantic associations from the LLM’s coaching contribute smaller second-order results (Article 6 decomposes them in depth); in enterprise vocab-bounded corpora, these don’t transfer the end result. Run key phrase search on the time period set {rescission, terminate, written discover, renewal} and also you get the identical goal with no embedding go in any respect.

HyDE is implicit key phrase enlargement routed via an embedding step. The LLM writes a full hypothetical reply, the system embeds it, the retriever runs cosine over the corpus. All of that work to inject a handful of key phrases into the question. Two easier paths do the identical vocabulary raise, explicitly:

Ask the LLM for the key phrases instantly. One immediate: “What phrases would the reply to this query include in a typical insurance coverage contract?” Output: rescission, terminate, written discover, renewal. Use them in key phrase search. No fictional doc, no embed, no cosine.
Have the knowledgeable hand you the dictionary. Legal professionals, claims adjusters, compliance officers already know that cancellation in consumer vocabulary equals rescission in contract vocabulary. Codifying that mapping as soon as is sturdy; asking the LLM to rediscover it on each question is wasteful.

Each paths beat the HyDE pipeline on three fronts. Auditability: the matched key phrases are seen to the workforce and to a regulator; a 0.83 cosine rating shouldn’t be. Latency: one LLM name, no embed round-trip per question. Sturdiness: the key phrases persist in a dictionary, reusable throughout queries; HyDE regenerates the speculation from scratch each time. Article 6 (Query Parsing) formalises this as the express knowledgeable key phrase dictionary that grows with the corpus.

Shopper vs enterprise. On consumer-shaped corpora (basic insurance coverage FAQs, e-commerce assist, public-service kinds), the LLM has seen loads of coaching textual content in the precise register, so its key phrase guess is often respectable. HyDE works with out an knowledgeable within the loop. On enterprise corpora (inside product codes, regulatory citations, contract jargon, customized acronyms), the LLM falls again on generic legalese (“…shall be outlined within the phrases and situations…”) and misses the doc’s precise vocabulary. The knowledgeable already is aware of that vocabulary. Asking the LLM to guess what the knowledgeable can hand you, on each single question, is the gradual path.

3.3 The manufacturing reply: uncover key phrases with specialists

The usual recommendation (“use embeddings for semantic retrieval”) is simply too obscure. A sharper query is when do they really earn their slot within the pipeline? 4 solutions, every pointing someplace totally different.

Already know the precise key phrases? Use key phrase search. It’s sooner, cheaper, auditable, and never opaque the way in which a vector match is. If a regulator asks why a selected passage was retrieved, “the road comprises drive majeure and pandemic” is a defensible reply. “The cosine similarity was 0.83” shouldn’t be.

Typos within the question? Repair the question. A single LLM name corrects polciy to coverage and also you’re again to scrub key phrase search. No embedding pipeline required.

Typos within the paperwork? Now embeddings genuinely earn their place. OCR’d contracts, scanned kinds, hand-typed notes. Key phrase search actually can not match a misspelled token, however a line-level embedding nonetheless lands in the precise neighbourhood. That is the case the place vector search is structurally irreplaceable.

Multilingual corpus? Similar reply, totally different mechanism. Contracts in French, correspondence in English, regulatory annexes in German. A multilingual embedding lets the consumer question in a single language and floor strains from the others. prime annuelle finds Annual premium: $1,200. (part 1.4 confirmed it). Sustaining bilingual key phrase dictionaries by hand is feasible however costly; the multilingual embedding bridges the languages totally free, and the knowledgeable retains the dictionary working in a single language with embeddings because the cross-language fallback. Requires a multilingual mannequin: ada-002, 3-large, BGE-M3 work; GloVe and English-only sentence encoders don’t.

Synonyms particular to your enterprise that you just don’t know but? That is essentially the most production-relevant case, and the place embeddings are most helpful: as a discovery mechanism, not because the retriever itself.

The explanation issues. In authorized, medical, insurance coverage, monetary corpora, the significant synonyms aren’t dictionary synonyms. Drive majeure and act of God imply the identical factor in a contract, however the embedding mannequin doesn’t know that. They’re not lexical neighbours and never embedding-space neighbours both. They’re business-specific equivalences that solely specialists (attorneys, claims adjusters, compliance officers) know.

Concrete pairs throughout domains. What “area synonyms” appears like in observe:

Insurance coverage contracts: cancellation ↔︎ rescission, termination, lapse of canopy, give up of the coverage. deductible ↔︎ extra (UK), franchise (FR). declare ↔︎ loss notification, incident report. policyholder ↔︎ insured, assured, named celebration.
Medical information: blood sugar ↔︎ glycemia, A1C, HbA1c, fasting plasma glucose. coronary heart assault ↔︎ myocardial infarction, MI, acute coronary occasion. hypertension ↔︎ hypertension, elevated BP studying.
Authorized and contract clauses: drive majeure ↔︎ act of God, unforeseeable circumstances, occasions past affordable management. non-compete ↔︎ restrictive covenant, restraint of commerce clause. confidentiality ↔︎ non-disclosure, NDA, proprietary data clause.
HR and employment: dismissal ↔︎ termination of employment, separation, severance occasion. wage ↔︎ compensation, base pay, gross remuneration. harassment ↔︎ undesirable conduct, hostile surroundings, inappropriate behaviour.

None of those aliases are dictionary synonyms within the normal sense. They’re domain-specific equivalences validated by an insurance coverage underwriter, a clinician, a contract lawyer, an HR skilled. The embedding finds them as candidates; the knowledgeable says sure or no. Drive majeure equals act of God provided that you recognize it does.

HyDE makes this implicit (the LLM invents the doc’s seemingly vocabulary on the fly, part 3.2 confirmed the place it falls quick). The sequence makes it express: a curated key phrase dictionary maintained by area specialists.

# Discovery loop. One corpus, seed phrases the knowledgeable already is aware of.
# Similar `top_lines_for` primitive from part 3.1: no new infrastructure.

SEED_TERMS = ["cancellation", "deductible", "claim", "policyholder"]

draft_aliases = {
    seed: top_lines_for(seed, corpus_lines, ok=10)
    for seed in SEED_TERMS
}
# Every draft is the top-k corpus phrasings closest to the seed.
# Hand to the knowledgeable: they preserve the true aliases, drop the coincidences.

validated_dictionary = {
    "cancellation": ["rescission", "termination", "lapse of cover",
                     "surrender of the policy"],
    "deductible":   ["excess", "franchise"],
    "declare":        ["loss notification", "incident report"],
    "policyholder": ["insured", "assured", "named party"],
}

# Manufacturing retrieval hits this dictionary instantly. No embedding name
# on the recent path; the embedding solely ran as soon as, at discovery time.

The outcomes, on a small insurance coverage corpus. Run the seed cancellation towards seven candidate strains (4 actual aliases, three off-topic decoys) and the 4 aliases rise to the highest.

*One seed question, seven candidates. The 4 actual aliases rank top-4 on three of 4 fashions – Picture by creator*

The sample is the invention workflow at work. The mannequin lists candidates ranked by similarity. The knowledgeable reads them, retains rescission, termination, lapse of canopy, give up of the coverage, drops premium funds and the opposite off-topic strains, and the dictionary entry for cancellation is in-built one evaluation go. From that time on, retrieval is key phrase search on the dictionary.

The workflow is progressive and runs with the specialists, not round them. First few queries on a brand new corpus, run embeddings line-by-line as in part 3.1. They floor doc phrasings no one anticipated: the contract makes use of non-employee labor the place the consumer stated contractor; the medical file makes use of A1C the place the consumer stated blood sugar degree; the process handbook makes use of part 4.2 the place the consumer stated time beyond regulation rule. Seize these phrasings as key phrase aliases in a rising dictionary, with the knowledgeable validating every one (they know which aliases are actual equivalences and that are coincidences).

Subsequent queries undergo key phrase search with the enriched dictionary, no embedding name wanted. Every retrieval is now auditable (we all know which key phrases matched), sooner (no LLM/embedding latency on the recent path), and the dictionary itself turns into a sturdy enterprise asset that survives engineering turnover.

The reframing is sharper than the usual one. Embeddings aren’t the manufacturing retriever. They’re the bootstrap that builds the manufacturing retriever, one key phrase alias at a time, in collaboration with the individuals who already know the corpus. Article 6 (Understanding the Query) develops the dictionary engineering: area hints, knowledgeable aliases, a number of various phrasings, the suggestions loop with retrieval outcomes. Article 7 (Retrieval) develops the focused retrieval structure that consumes the dictionary.

3.4 The HR and customer-feedback case

Most enterprise paperwork aren’t sentiment-heavy. Contracts, regulatory texts, monetary stories, technical specs are factual corpora; sections 3.1 via 3.3 are constructed for them. A subset of enterprise corpora is totally different: buyer survey verbatims, worker barometer feedback, help ticket free-text, model mentions on social. The vocabulary right here is emotional (drained, pissed off, delighted, let down) moderately than technical (drive majeure, Solvency II, cedent).

The invention workflow nonetheless applies. An HR analyst constructing a burnout-signal lexicon sorts an express idea they care about, say feeling overwhelmed. The embedding surfaces phrasings from the corpus in the identical emotional cluster. The highest match beneath shares zero content material phrases with the question; all 4 fashions, GloVe via 3-large, rank it #1.

*Question `feeling overwhelmed` towards an emotional paraphrase with zero shared tokens. TARGET wins on each mannequin – Picture by creator*

No emotional understanding right here. Emotional vocabulary clusters within the mannequin’s area the way in which insurance coverage vocabulary does in part 1.2 (payment ↔︎ cost). TF-IDF + logistic regression hit roughly 88% on IMDB sentiment in 2010, earlier than contextual embeddings, as a result of emotional phrases carry sign on their very own. Embeddings prolong that with synonymy: overwhelmed, drained, empty, hole, on the sting are robotically shut within the area, so a question in a single time period surfaces sentences utilizing any of them. The identical mechanism as part 1.2, utilized to a special vocabulary.

A helpful cut up for manufacturing. If sentiment classification is the purpose (rating every suggestions entry, mixture traits, detect disaster spikes), a devoted sentiment mannequin outperforms a basic embedding. The devoted mannequin is skilled for the duty; the embedding is skilled for similarity. For vocabulary discovery (what phrasings categorical misery in our corpus?), the embedding stays the precise device. It surfaces the lexicon the knowledgeable validates. Two duties, two instruments. Sarcasm (“Oh nice, one other Monday”) breaks each, and reliability there wants context the verbatim often doesn’t present.

The sample right here is the article’s bigger one. First impression: this appears like emergent emotional understanding. Look nearer: it’s keyword-similarity with a better notion of “shut”. Apply accordingly: use the mannequin to find the vocabulary you didn’t have; don’t ask it to grasp the intent behind the vocabulary.

4. Conclusion

Embeddings are one brick of Enterprise Doc Intelligence Quantity 1, which builds enterprise RAG brick by brick. The key phrase dictionary this text ends on is what manufacturing retrieval (Article 7) reads at question time, quick and auditably.

Embeddings are highly effective and restricted in particular, predictable methods.

A Mild Introduction to Autoencoders & Latent House

Context Rot: Why Claude Code Classes Decay, and Learn how to Govern Them

Part 1: what they deal with. Synonyms, paraphrase, typos, cross-lingual queries, and polysemy work properly, with every technology of mannequin widening the protection margin.
Part 2: the place they break. Two distinct roots. First, typically the time period merely isn’t within the mannequin.
Part 2.1 made this concrete with pool: a random train-timetable sentence beat the reinsurance paraphrase on three of 4 fashions. Enterprise vocabulary lives right here. Second, when the time period is within the mannequin, the embedding ranks by time period similarity, not by question-to-answer mapping.
Part 2.2 confirmed this instantly on the best queries. From that second root cascade negation (part 2.3), precise values (part 2.4), topical proximity beating reply relevance (part 2.5), and sign dilution in lengthy context (part 2.6). An entire catalog of “apparent” failures (OOV identifiers, Boolean composition, counting, temporal predicates, multi-hop reasoning) wants no demo.
Part 3: tips on how to use them in manufacturing. Use embeddings line by line as a synonym-tolerant Ctrl-F (part 3.1). Whenever you do must bridge the question-to-answer hole, the load-bearing piece is the key phrases that the reply would include, not the embedding of the rewritten question (part 3.2). The manufacturing reply is a curated key phrase dictionary, constructed by specialists and bootstrapped by line-level embedding discovery (part 3.3). Embeddings aren’t the manufacturing retriever; they’re how you discover the key phrases that the manufacturing retriever then makes use of, quick, auditably, each time.

A case from actual initiatives. A workforce constructed a RAG system over industrial insurance coverage contracts and spent three months chasing recall. They began with OpenAI’s text-embedding-3-small at 71% recall, benchmarked Voyage, Cohere, BGE-M3 (recall moved between 69% and 73%), then fine-tuned BGE on artificial question-passage pairs. Recall climbed to 76%. 5 factors after three months. Then they broke the 200 questions down by sort: 92% on conceptual, 23% on negation, 31% on exact-reference, 18% on internal-acronym. The mixture of 76% hid two classes at near-zero efficiency — no fine-tuning might repair them. Including BM25 alongside the vector search took two days, lifting exact-reference recall to 88%. Including a question enlargement step for acronyms through an organization glossary took one other day, lifting internal-acronym recall from 18% to 71%. One week of structural work outweighed three months of embedding fine-tuning.

Two alerts a workforce has over-invested in embeddings: the roadmap options “fine-tune the embedding mannequin” as the subsequent milestone earlier than anybody has damaged down the precise failure circumstances; retrieval metrics are reported as a single recall quantity with no per-question-type breakdown, hiding the classes the place embeddings are structurally incorrect.

You simply watched embeddings fail in predictable, structural methods. The reflex, particularly for engineers from an ML background, is to repair the mannequin: extra coaching knowledge, fine-tune, swap suppliers, run a sweep. Article 3 makes the case that that is the incorrect body. The failures you simply noticed are usually not bugs the mannequin can study its manner out of. RAG shouldn’t be machine studying, and treating it like one is how groups waste six months optimising the a part of the system that wasn’t damaged.

5. Additional studying

The empirical sample on this article (synonyms, typos, polysemy work; negation, precise identifiers, OOV acronyms fail) matches each managed research of dense retrievers on out-of-domain enterprise corpora. Reimers and Gurevych (Sentence-BERT, 2019) is the reference for what embedding a line means technically. Ravichander et al. (CONDAQA, 2022) doc the negation failure cleanly. The article reframes HyDE (Gao et al. 2023): the load-bearing piece is the key phrases the hypothetical reply comprises, not the embedding step itself; asking the LLM for the key phrases instantly recovers the identical passage with much less infrastructure. High-quality-tuning embeddings on enterprise corpora is out of scope right here and revisited in Article 21 (manufacturing).

Similar course because the article:

Reimers & Gurevych, Sentence-BERT, EMNLP 2019 (arXiv:1908.10084). The reference for what embedding a line means technically.
Ravichander et al., CONDAQA, EMNLP 2022 (arXiv:2211.00295). Paperwork that dense fashions systematically fail on negation. Similar course because the empirical sample on this article.
Gao et al., HyDE: Exact Zero-Shot Dense Retrieval with out Relevance Labels, ACL 2023 (arXiv:2212.10496). The HyDE method the article reframes: key phrases from the hypothetical reply are what does the work.
Formal et al., SPLADE, 2021 (arXiv:2107.05720). Realized sparse retrieval; a bridge between key phrase and embedding worlds, in the identical spirit because the vector search is key phrase search framing.

Totally different angle, totally different context:

Karpukhin et al., Dense Passage Retrieval for Open-Area QA, EMNLP 2020 (arXiv:2004.04906). The canonical dense beats BM25 end result on open-domain QA benchmarks. The context is in-domain coaching knowledge; this text appears at out-of-domain enterprise corpora the place the end result doesn’t switch cleanly.
Wang et al., Textual content Embeddings by Weakly-Supervised Contrastive Pre-training (E5), 2022 (arXiv:2212.03533) and Lee et al., NV-Embed, 2024 (arXiv:2405.17428). The scale-fixes-it line: bigger contrastive pre-training corpora shut the OOV hole. The article’s declare is that the failures are structural (compression destroys exact-value sign), not data-volume sure.
Khattab & Zaharia, ColBERT, SIGIR 2020 (arXiv:2004.12832). Late-interaction retrieval as a solution to exact-token matching on the embedding degree; related to the “precise values, inside acronyms” failure mode.
Muennighoff et al., MTEB: Huge Textual content Embedding Benchmark, EACL 2023 (arXiv:2210.07316). The benchmark driving the “decide the highest-scoring embedding” mindset. Helpful for buying fashions; the article’s declare is that the leaderboard shouldn’t be the related axis for enterprise OOD vocabulary.