The Untaught Classes of RAG Retrieval: Cosine Is Not the Basis

companion to Enterprise Doc Intelligence, the collection whose philosophy is specified by Amplify the Skilled. It zooms in on brick 3 (retrieval) of the four-brick structure and surfaces the teachings most tutorials skip.

The mainstream story has retrieval as embed the query, return top-k by cosine, optionally rerank. We disagree with nearly each a part of it. Retrieval is filtering on structured tables, not looking free textual content. Embeddings are the optionally available fallback, not the muse. Anchor and context are two granularities, not one. Every of those is a place we are able to defend, with penalties you may measure.

*the place this text sits within the collection: brick 7 (retrieval) highlighted – Picture by creator*

📓 Runnable companion notebooks are on GitHub: doc-intel/notebooks-vol1.

*The general public companion-code repo at doc-intel/notebooks-vol1 – Picture by creator*

The naive baseline this text pushes again on

*The architectural distinction: a single cosine sign over chunks vs three alerts in parallel on structured tables – Picture by creator*

The naive pipeline chunks the doc, embeds each chunk, embeds the query, ranks by cosine. That single sign is opaque, and it throws away the doc’s construction. We maintain the doc as line_df + toc_df and run three retrieval alerts in parallel (key phrase on strains, TOC reasoning, embedding cosine), then let an LLM arbiter rank as soon as on the finish with all three units of hits in view.

*Key phrases at all times run, the TOC at all times causes, embeddings hearth solely when the vocabulary mismatches – Picture by creator*

Under are the six untaught classes of this brick.

Lesson 1 – Retrieval is filtering, not looking

As soon as parsing is completed, retrieval is a SQL-like filtering downside over line_df and toc_df, the reverse of the chunk-embed-cosine-top-k framing. The shift is easy to state: the query has columns, the doc has columns, and retrieval is the be a part of.

Why it issues. Search and filter aren’t synonyms , the 2 operations have completely different mechanics. Search scores each candidate on a steady similarity (cosine , BM25), forces a top-k cutoff, and at all times returns one thing, even when the reply isn’t within the doc. Filter applies a boolean situation (line.incorporates("X") , toc.title in [...]), retains each row that matches and no extra, and may return zero rows when the doc doesn’t carry the reply. The audit consequence is the most important a part of the hole: a filter’s situation is one line of inspectable code that runs the identical means in six months; a search’s rating is dependent upon which dimensions of the embedding mattered, and you can’t replay that judgment with out re-running the mannequin.

Concrete distinction. The person asks “What positional encoding does the paper use?”. Naive RAG embeds the query, scores 300+ chunks, returns the top-5. Sequence RAG filters line_df the place the road incorporates "positional encoding" (4 hits), filters toc_df the place the part title incorporates "positional" (1 part, 3.5 Positional Encoding), and the arbiter sees each, anchor: the road; scope: the part. No cosine wanted.

→ Article 7A: Retrieval is filtering, not search lays out the psychological mannequin.

Lesson 2 – Anchor and context, stored aside

You anchor on the only line that mentions “premium” (exact) however move the entire surrounding part to era (enough context); conflating them breaks precision and protection in a single transfer. High-k forces you to select: tiny chunks lose context, big chunks lose precision. We get each, by retaining them aside.

Concrete distinction. For a definition query, the anchor is the one line ( "the deductible is the quantity the insured pays earlier than protection begins" ), the scope is the paragraph round it ( three sentences of context the LLM must phrase the reply ). Naive top-k both returns the road (no context) or the paragraph (anchor unclear). Sequence retrieval returns anchor + scope as a typed pair.

→ Article 7A: Retrieval is filtering, not search attracts the road between anchor and context.

Lesson 3 – Embeddings come final, not first

Key phrases at all times run (low-cost, deterministic); the doc’s personal TOC is a first-class retrieval methodology; embeddings are the optionally available last sign, solely when vocabulary mismatch is anticipated. The 2024-era reflex begins with embeddings; we depart them for the circumstances the place the cheaper alerts failed.

Concrete distinction. A factual lookup on insurance coverage coverage: “efficient date?”. Naive RAG embeds, returns 5 chunks. Sequence runs key phrase on "efficient" and "date" → 1 line discovered → performed. Embeddings by no means ran. Price: one regex move over line_df; just a few milliseconds. The two-cent cosine search didn’t occur.

→ Article 7B: Discovering the suitable anchors builds the three-signal pipeline.

Lesson 4 – Key phrases show absence; embeddings can’t

A zero on key phrase search means the reply is genuinely not there; a zero on embedding similarity could possibly be absence or simply completely different phrases, so embeddings are a refinement, not a choice gate. This asymmetry is the case for key phrases as the first sign in enterprise RAG.

Concrete distinction. The person asks “does this contract cowl earthquake injury?” on a flood-only coverage. Key phrase seek for "earthquake" returns zero matches in line_df . The pipeline can ship answer_found = False confidently. Embedding cosine returns 5 chunks (the closest topically associated strains about pure disasters ) and the LLM, seeing them, could infer a flawed sure. Key phrases saved the day.

→ Article 7B: Discovering the suitable anchors explains the keyword-first self-discipline.

Lesson 5 – Co-occurrence beats BM25 on slender corpora

BM25 ranks by time period frequency, however the enterprise reply form is one point out of a subject subsequent to a selected worth, so co-occurrence boosts and high-value regex anchors beat statistical IDF on slender corpora. The IDF assumptions break on a 20-document corpus the place each time period is “uncommon” by Wikipedia requirements.

Concrete distinction. The query is “what’s the deductible quantity?”. BM25 ranks by frequency of "deductible"; the road that seems 12 instances in a glossary part ranks first. Co-occurrence search ranks strains that comprise each "deductible" and a quantity; the precise coverage line ( "the deductible is $1000" ) ranks first as a result of it co-occurs with $1000 , and the LLM can extract the worth cleanly.

→ Article 7B: Discovering the suitable anchors measures co-occurrence in opposition to BM25.

Lesson 6 – One LLM move over the TOC

Handing the 20-100 row toc_df to a small mannequin and asking which sections reply the query prices one cached name and catches the paraphrases (“exit early” ≈ “Termination”) key phrase matching misses.

TOC reasoning is without doubt one of the most under-used retrieval alerts in manufacturing RAG.

Concrete distinction. The person asks “when can I depart the coverage early?”. Substring matching on "depart" returns zero TOC entries. An LLM name on the complete TOC ( 28 rows, suits in a single small immediate ) returns part “Termination and Cancellation”, the right paraphrase. One cached LLM name, deterministic afterwards, and the suitable anchor.

→ Article 7B causes over the TOC, and Article 7C: An LLM as arbiter provides the arbiter.

The six classes share one transfer: refuse the chunk-embed-cosine reflex, and deal with retrieval as filtering on structured tables as a substitute. Key phrases at all times run as a result of they show absence; the TOC is a first-class sign as a result of the doc already declared its construction; embeddings are the optionally available refinement, not the muse. The deep-dives (7A, 7B, 7C, 7bis) ship runnable code on actual paperwork; this piece is {the catalogue} that factors at them.

Throughout sectors and professions

The identical three-signal retrieval sample ( key phrase on line_df + reasoning on toc_df + embedding fallback ) holds in each area. The vocabulary and the TOC depth differ; the sign hierarchy doesn’t. 5 sectors beneath, one retrieval sample, one audit hint per name.

*Embeddings hearth solely on the medical row the place vocabulary diverges from the doc – Picture by creator*

Embeddings hearth solely on the medical row, the place the person’s vocabulary ( “tachycardia” ) diverges from the doc’s ( “fast coronary heart fee” ). The opposite 4 rows resolve completely on key phrase + TOC. Key phrases show absence (Lesson 4), the TOC catches paraphrases (Lesson 6), and the anchor / scope break up retains precision and context aside (Lesson 2) in each row. The fee gradient is actual: the 4 keyword-resolved rows run in milliseconds with zero LLM tokens; the medical row pays for one embedding move and one arbiter name.

Tokenminning: Learn how to Get Extra from Your Chatbot for Much less

Why Highly effective ML Is Deceptively Simple — Half 2

Sources and additional studying

The mainstream literature on retrieval is formed by web-scale search and shorter shopper corpora. The collection stance assumes a small enterprise corpus the place the construction is thought and the vocabulary is the asset.

Retrieval is filtering, not search (Article 7A). The revealed mental-model article: retrieval as filtering on structured tables.
Embeddings Aren’t Magic (Article 2). The revealed failure-modes catalogue for embedding similarity.
Rerankers Aren’t Magic Both (Article 2bis). When the cross-encoder pays off and when it doesn’t.