Baseline Enterprise RAG, From PDF to Highlighted Reply

Pydantic + OpenAI: The Cleanest Strategy to Get Structured Outputs from LLMs

Agentic RAG: Let the Agent Search

quickest method to perceive what RAG is is to construct the smallest model that truly works, run it on an actual doc, and look intently at what simply occurred.

That’s this text. A few hundred strains of Python (no vector database, no framework, no brokers) operating on the Consideration Is All You Want paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv summary web page), returning a sourced reply with the precise supply strains highlighted on the web page.

Then we stroll again by means of every block and ask the query it naturally raises. Every query is what a later article develops.

The minimal pipeline is the smallest quantity of code that respects the 4 bricks and produces a verifiable reply. Each later article provides functionality the group wants after a particular failure on actual paperwork, not as a result of the structure wanted extra layers.

This text is one piece of the broader Entreprise Doc Intelligence Vol. 1 collection, which builds enterprise RAG brick by brick from a baseline pipeline to corpus-scale structure.

1. What we’re constructing

The pipeline has 4 bricks (Half II goes into every one intimately) plus a ultimate, non-obligatory rendering step. Every brick says what it takes in and what it offers again; what we go from one brick to the following is what we save.

Doc parsing takes a PDF path and returns line_df (one row per textual content line, with page_num, line_num, textual content, and the bounding field) plus page_df. The minimal model holds each in reminiscence; larger techniques persist them (Article 23 covers when to maneuver to a database).
Query parsing turns the person’s query right into a ParsedQuestion carrying the normalized query plus a brief checklist of checked key phrases. It stays slender on goal: no retrieval logic right here, no query embedding.
Retrieval consumes the ParsedQuestion and emits top-k web page numbers (and, when wanted, the matching line numbers inside these pages). Retaining the handoff to web page numbers solely retains it small; the following step rebuilds the filtered strains from line_df on the spot. The query embedding lives on this brick as a result of it relies on the corpus index.
Era brings collectively the query, line_df, and the retrieved web page numbers, and produces an AnswerWithEvidence: a typed JSON carrying the reply, the proof span (start_page, start_line, end_page, end_line), a confidence, a justification, the precise quotes from the supply, and any caveats. The complete JSON is price saving for analysis, audit, and replay.
PDF annotation is non-obligatory. Given the supply PDF and the proof span, it writes an annotated PDF with rectangles drawn across the cited strains. A CLI device, a batch job, or an API client can skip it; the reply with citations is already full after technology.

The primary 4 are the 4 bricks (Article 5 develops doc parsing, Article 6 query parsing, Article 7 retrieval, Article 8 technology). PDF annotation is the rendering step, not a brick in itself.

*The baseline RAG pipeline, finish to finish – Picture by writer*

A PDF and a query go in. Every brick turns its enter into one thing extra structured: doc parsing turns the PDF into rows, query parsing turns the query into search-ready key phrases, retrieval cuts the rows down to some web page numbers, technology produces a typed reply, and PDF annotation attracts the cited strains again onto the supply. What comes out is just not a chatbot bubble. It’s a sourced JSON reply plus an annotated PDF you may open and examine.

The dependencies are minimal:

pymupdf parses PDFs into textual content plus place data; the bounding bins it returns are what we use to spotlight the reply again on the supply web page.
openai is the LLM shopper; by way of base_url the identical library serves Azure, OpenRouter, Ollama, or any suitable endpoint.
pandas holds the doc as a DataFrame, the format each parsing and retrieval step makes use of.
pydantic defines the reply schema that forces structured JSON with citations.

No vector database, no orchestration framework, no specialised RAG library. Later articles have a look at when these libraries’ helpers turn out to be helpful, and once they get in the best way of seeing what’s happening.

“For a 15-page paper, the LLM can learn the entire thing. Why trouble with retrieval?” Honest level on this one doc. We use the paper to show the strategy, to not save tokens on these 15 pages. The objection typically factors to the Needle in a Haystack benchmark (Kamradt, 2023), the place frontier fashions rating near-perfectly retrieving a single verbatim sentence from a 1M-token context.

That benchmark is analysis, not follow. A needle is one remoted, verbatim reality, whereas enterprise questions mixture (“each contract whose deductible exceeds €5,000”), evaluate (“clause 12 throughout these three insurance policies”), or summarize throughout many passages. None of these is a single sentence to search out.

Two extra sensible causes preserve retrieval within the loop. Enterprise paperwork are sometimes lengthy:

a 300-page insurance coverage contract,
a 500-page regulatory submitting,
a multi-volume technical specification.

Sending the entire thing to the LLM prices actual cash on each query, each rerun, each person, and dilutes its consideration throughout irrelevant pages.

And the identical query runs throughout a whole lot or hundreds of paperwork directly:

“discover each contract that excludes earthquake harm”,
“summarize this yr’s regulatory modifications throughout all filings”.

At that scale, “throw all of it in” stops being a technique. Retrieval is what makes the pipeline survive each strikes: from one quick paper to 1 lengthy contract, and from one doc to an entire corpus.

2. The 4 bricks, and a PDF spotlight

Every step declares its inputs and outputs, and the steps are impartial. The output of step N is the enter of step N+1, saved as a named DataFrame so any step will be re-run by itself in opposition to the saved output of the earlier one. Within the AI-coding period, an assistant advised to “repair retrieval” can quietly modify the query parser when it ought to have stayed untouched. Impartial modules are how you’re employed confidently on one piece with out breaking the remainder.

The setup chunks under load them alongside the OpenAI shopper.

Each brick that talks to a mannequin wants a configured shopper. The collection makes use of OpenAI’s Python SDK; any supplier that exposes an OpenAI-compatible endpoint (Azure OpenAI, vLLM, llama.cpp’s --api-server, …) drops in by altering base_url and the mannequin title.

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

shopper = OpenAI(
    api_key=os.getenv("API_KEY"),
    base_url=os.getenv("BASE_URL"),
)
model_chat = os.getenv("MODEL_CHAT", "gpt-4.1")
model_embed = os.getenv("MODEL_EMBED", "text-embedding-3-small")

2.1 Doc parsing

We extract each textual content line of the PDF together with its place on the web page. The output is a DataFrame the place every row is one line, with page_num, line_num, the textual content itself, and the 4 bounding-box coordinates x0, y0, x1, y1.

In: a PDF path.

Out: line_df (one row per textual content line, with page_num, line_num, textual content, and the bounding field) plus a page_df we’ll construct in part 2.3.

The bounding bins matter: they’re what we use to attract highlights on the supply PDF on the finish.

def fitz_pdf_to_line_df(file_path):
    doc = fitz.open(file_path)
    information = []
    for page_num in vary(len(doc)):
        web page = doc[page_num]
        blocks = web page.get_text("dict").get("blocks", [])
        line_num = 0
        for block in blocks:
            if block.get("sort") != 0:
                proceed
            for line in block.get("strains", []):
                spans = line.get("spans", [])
                if not spans: proceed
                textual content = "".be part of(s["text"] for s in spans)
                rect = fitz.Rect(spans[0]["bbox"])
                for span in spans[1:]:
                    rect |= fitz.Rect(span["bbox"])
                information.append({
                    "page_num": page_num + 1,
                    "line_num": line_num + 1,
                    "textual content": textual content,
                    "x0": float(rect.x0), "y0": float(rect.y0),
                    "x1": float(rect.x1), "y1": float(rect.y1),
                })
                line_num += 1
    return pd.DataFrame(information)

Operating line_df = fitz_pdf_to_line_df(pdf_path) on the Consideration paper returns 1048 strains throughout 15 pages.

*First 5 rows of line_df with web page, line quantity, textual content, and bounding field – Picture by writer*

The paper, changed into rows. Every line is one row, with its textual content and the 4 numbers that find it on the web page. The x0, y0, x1, y1 columns don’t imply a lot but; in part 2.5 they’re what we use to attract rectangles on the supply PDF, precisely over the strains the mannequin cited.

This DataFrame, line_df, is the core information construction of the remainder of the collection. Article 5 introduces a richer relational mannequin round it (line_df, chunk_df, toc_df, page_df, image_df).

What this parser doesn’t do: detect tables (Desk 1 web page 4, Desk 3 web page 9 flatten into plain strains), reconstruct headings, footnotes, cross-references, or deal with multi-column layouts. None of this issues for the query we ask right here. For different questions on the identical paper, it is going to. Article 5 covers parsing in full.

2.2 Query parsing

Earlier than the query goes to retrieval, we run it by means of a tiny LLM name. The aim is to extract the key phrases most helpful for looking out the doc: quick phrases the doc is probably going to make use of, not essentially the literal phrases of the query.

In: a textual content query.

Out: a ParsedQuestion holding the normalized query and a brief checklist of checked key phrases.

This step doesn’t learn about retrieval. It doesn’t compute the query embedding both. That one is tied to the corpus index and lives in part 2.3. Maintain that line clear and you may swap the embedding mannequin or add a hybrid retriever tomorrow with out touching query parsing.

Why trouble on a minimal pipeline? Two causes:

You’ll be able to clarify why retrieval picked what it picked. When the system solutions flawed, we will see whether or not the key phrases have been off (question-parsing downside) or the proper key phrases landed on the flawed web page (retrieval downside). With out query parsing, retrieval is a black field.
The query is an actual enter, similar to the doc. Part 2.1 parsed the doc into line_df. This subsection parses the query into ParsedQuestionMinimal. Each inputs need to be parsed earlier than they hit the search step. Article 6 builds the richer brick (parse_question, with reply form, scope filters, decomposition, …).

On the query “What are the choices talked about for positional encoding?”, the decision parsed_question = get_keywords_from_question(query, shopper=shopper) returns parsed_question.key phrases = ['positional encoding', 'options', 'mentioned'].

query = "What are the choices talked about for positional encoding?"
parsed_question = get_keywords_from_question(query, shopper=shopper)
print(parsed_question.key phrases)

['positional encoding']

The LLM produces a single, literal phrase like ['positional encoding']. That’s deliberate. An earlier draft of this immediate requested for “3 to five quick key phrases helpful for looking out”, and the LLM fortunately stuffed the quota with paraphrases (positional encoding choices, kinds of positional encoding, transformer positional encoding). None of these are written within the doc. Solely positional encoding is. Substring matching is strict: a single lacking phrase kills the match. The minimal model asks the LLM to do much less (extract the literal noun phrase, drop the query framing) and trusts the following block to do the remainder.

What this minimal model doesn’t do:

detect an answer_shape (Q&A vs summarization)
decompose compound questions
pull from a site glossary
connect retrieval hints

All lined in Article 6, beneath the richer parse_question brick. Right here we preserve two fields, corrected_question and key phrases, the smallest model that makes the brick seen.

Notice: overriding the system immediate. get_keywords_from_question exposes the system immediate as a kwarg with KEYWORDS_PROMPT as default. To check a variant (completely different area, stricter guidelines, additional examples), go system_prompt=... on the name web site. No edit to the perform. Similar sample for each LLM helper in docintel (llm_answer_with_evidence exposes each system_prompt and user_template). Beneath: the identical name, run twice on a contract-style query. First with the research-paper default, which stays generic. Then with a contract-domain immediate, which picks up insurance coverage vocabulary like exclusions, deductible.


demo_question = "Are earthquakes excluded from protection?"

# Default: research-paper immediate.
parsed_question_default = get_keywords_from_question(demo_question, shopper=shopper)
print("Default (research-paper):", parsed_question_default.key phrases)

# Override: insurance coverage / authorized contract immediate.
contract_prompt = (
    "Extract 1 to three quick key phrases from the person query for looking out an "
    "insurance coverage contract or authorized coverage. Want literal phrases the contract is "
    "possible to make use of: clauses, exclusions, named perils, deductibles, caps. Drop "
    "query framing phrases. Output 1 to three key phrases."
)
parsed_question_contract = get_keywords_from_question(
    demo_question, system_prompt=contract_prompt, shopper=shopper,
)
print("Contract immediate:        ", parsed_question_contract.key phrases)

Default (research-paper): ['earthquakes', 'coverage']
Contract immediate:         ['earthquakes', 'exclusions', 'coverage']

2.3 Retrieval

Sending all 1048 strains to the LLM works on a paper this dimension however doesn’t scale and dilutes the mannequin’s consideration. We reduce the doc all the way down to the few pages most definitely to include the reply.

In: the checked key phrases (and/or the normalized query, relying on the strategy) from part 2.2.

Out: the top-k web page numbers, plus optionally the matching line numbers inside these pages.

The query embedding is computed right here, not in part 2.2, as a result of an embedding solely is smart relative to the index it was constructed on. Similar logic for any hybrid scoring or BM25 statistics.

The usual reply in 2024 RAG tutorials is embeddings: flip every web page right into a vector, rating by cosine similarity. Article 2 is devoted to them. For the minimal model, we intentionally don’t, for one cause.

Embeddings are opaque. Cosine similarity returns a quantity like 0.7798 and asks the person to belief that “web page 6 is related to the query”. Present that rating to a site knowledgeable, a product proprietor, or a supervisor: no person understands what 0.78 means, or why it’s greater than 0.65. Builders could argue they perceive it (“dot product of normalized vectors”). They perceive the mathematics, not the relevance. Requested why this particular web page scored 0.7798 in opposition to this particular query, they shrug and level on the mannequin.

In an enterprise context, retrieval is the step customers query essentially the most. Why did the system have a look at this web page and never that one? It’s important to clarify it. So the minimal model makes use of one thing we will learn with our personal eyes: key phrase matching. Part 2.2 pulled the key phrases; we rating every web page by what number of of these key phrases seem in it, and preserve the highest three.

The place we search vs what we return: each pages right here. Actual retrieval has two ranges. The anchor is the place the key phrase or embedding really hits (a line, a sentence). The context is what we hand to technology (the strains round it, the web page). We search small, we return large. Right here we use the web page for each. That works on an educational paper the place every web page is roughly one concept. Article 7 separates the 2 ranges for lengthy contracts, multi-column experiences, table-heavy paperwork.

page_df = build_page_df(line_df) collapses the 1048 strains into 15 pages, one row per web page.

*First 5 rows of page_df, one row per web page with the complete textual content concatenated – Picture by writer*

2.3.a Embeddings + cosine similarity

Embed each web page (one name per web page), embed the query, compute cosine similarity, preserve the top-k. The output: a quantity like 0.7798 per web page. Take a look at the scores under: are you able to inform why a web page made the highest three? May you clarify the rating to a site knowledgeable? That’s the opaque-score downside the article opens with.

*High three pages by cosine similarity. Exact scores, opaque rating – Picture by writer*

Three numbers, all very shut to one another (0.7843, 0.7798, 0.7728). Are you able to say why web page 9 beats web page 6? The textual content preview makes it apparent: web page 9 is the Variations on the Transformer structure desk, web page 5 is about output values and concatenation, web page 6 is the Most path lengths desk. The web page that truly solutions the query, part 3.5 Positional Encoding, sits on web page 6 and ranks final within the prime three. The unrelated web page 5 ranks second. The scores look exact, however the rating has no story behind it: there is no such thing as a token to level at, no phrase to defend, only a dot product on two black-box vectors. Embeddings work in lots of circumstances, and Article 2 unpacks the place this rating comes from. However the rating itself by no means turns into interpretable, and for the remainder of this text we use a retriever you may learn with your individual eyes.

2.3.b Key phrase matching

For every web page, rely what number of of parsed_question.key phrases seem in it (case-insensitive substring match). Drop pages with zero matches; preserve the top-k by match rely. The output desk under carries the precise matched_keywords per web page, so anybody can learn it and see why a web page was picked.

retrieve_pages(page_df, line_df, parsed_question.key phrases, top_k=3) returns the highest three pages by key phrase rely plus the filtered strains: 314 strains saved from pages 6, 9, 7.

*High three keyword-matched pages, with the matched phrases proven per web page – Picture by writer*

Three pages, ranked by match rely, with the precise matches laid out. Pages 6, 8, and 9 every include the literal phrase positional encoding; web page 6 holds Part 3.5 Positional Encoding with the precise reply. Anybody studying the desk can confirm the outcome by hand: search the supply for positional encoding and also you’ll discover these three pages.

Two design selections:

Drop pages with zero matches. A retrieval that claims “nothing matches” is extra helpful than one which pads with three random pages. The schema’s null path (subsequent subsection) handles the empty case cleanly.
We don’t break ties. When pages tie on the identical match rely, the order is no matter pandas’ nlargest returns. The downstream LLM sees the strains from all tied pages in doc order and decides.

From 1048 strains to 300, and we all know the proper materials is in there.

def cosine_sim_matrix(query_vec, doc_matrix):
    q = query_vec / (np.linalg.norm(query_vec) + 1e-12)
    d = doc_matrix / np.linalg.norm(doc_matrix, axis=1, keepdims=True)
    return d @ q

def retrieve_pages(page_df, line_df, query, top_k=3):
    q_vec = np.asarray(get_embedding(query), dtype=np.float32)
    doc_matrix = np.vstack(page_df["embedding"].values)
    sims = cosine_sim_matrix(q_vec, doc_matrix)

    scored = page_df.copy()
    scored["similarity"] = sims
    retrieved_pages_df = scored.nlargest(top_k, "similarity")

    kept_pages = retrieved_pages_df["page_num"].tolist()
    filtered_line_df = line_df[line_df["page_num"].isin(kept_pages)]
    return retrieved_pages_df, filtered_line_df

Notice: the “break up into particular person phrases” lure. A pure reflex when the multi-word phrases don’t match: break up them and seek for the person tokens. Beneath we develop each key phrase into its phrases, deduplicate, then re-run retrieval. We get matches, and we additionally get false positives, as a result of phrases like encoding, transformer, community seem everywhere in the doc in unrelated contexts.

Now each web page within the prime three matches a number of tokens, however have a look at which tokens. Phrases like encoding and transformer cowl many of the paper. Pages about layer encoding or encoder stacks look as related because the web page that truly solutions the query. Splitting trades one failure (zero matches) for an additional (false positives). Article 7 covers the true fixes (synonym growth by means of a dictionary, hybrid scoring); for now, preserve the phrase complete.

2.3.c A tougher query: the place every retriever breaks

Similar pipeline, a unique query. We ask in regards to the worth of epsilon utilized in label smoothing. The reply is on web page 8 of the paper, written as ε_ls = 0.1 (Greek letter ε, by no means the English phrase epsilon). Watch what every retriever does.

question_2 = "What's the worth of epsilon utilized in label smoothing?"
parsed_question_2 = get_keywords_from_question(question_2, shopper=shopper)
print("Key phrases:", parsed_question_2.key phrases)

Key phrases: ['epsilon', 'label smoothing']

Two failures of various shapes:

Embeddings rank pages by topical proximity. The suitable web page (web page 8, the place ε_ls = 0.1 lives) could or will not be within the prime three. Pages dense in math notation come up even once they’re unrelated.
Key phrases are blind to symbols. The LLM emits epsilon, label smoothing, and many others. The doc writes the Greek letter ε. Substring match returns zero on something that mentions epsilon by image solely. The web page that incorporates the reply is invisible to the key phrase retriever.

Part 4.4 picks this up because the bridge to Article 2 (Embeddings deal with synonyms and floor variation) and Article 6 (richer Query Parsing pulls in alternate options just like the Greek letter).

2.4 Era

We ship the retrieved strains to the LLM with the query, formatted as a tab-separated block the place page_num and line_num sit subsequent to every line. That format offers the LLM the precise coordinates it must cite.

In: the unique query, line_df, and the retrieved web page numbers from part 2.3.

Out: an AnswerWithEvidence, a structured JSON with the reply, the proof span (start_page_num, start_line_num, end_page_num, end_line_num), a confidence, a justification, the precise quotes, and any caveats.

class AnswerWithEvidence(BaseModel):
    reply: str = Discipline(...)

    start_page_num: int | None
    start_line_num: int | None
    end_page_num: int | None
    end_line_num: int | None

    confidence: float = Discipline(..., ge=0.0, le=1.0)
    justification: str = Discipline(...)

    quotes: checklist[str] = Discipline(default_factory=checklist)
    caveats: checklist[str] = Discipline(default_factory=checklist)

The uncooked JSON is price saving in manufacturing: justification, quotes, caveats, and confidence all feed analysis, audit, and replay, properly past the reply area a chat UI reveals.

We serialize the filtered strains right into a TSV with header page_numtline_numttext, one row per line. The LLM sees the precise coordinates subsequent to every textual content fragment so it might cite by (page_num, line_num) in its reply.

That is what makes the reply grounded: the schema forces the mannequin to fill in (start_page, start_line, end_page, end_line), a verbatim quote, and caveats if something is unsure. No prose, solely a typed object with citations.

We name reply = llm_answer_with_evidence(query, filtered_line_df, shopper=shopper) and get again an AnswerWithEvidence occasion, rendered under as a styled JSON picture so the sector labels keep legible.

def llm_answer_with_evidence(query, filtered_text_prompt):
    resp = shopper.responses.parse(
        mannequin=model_chat,
        enter=[
            {
                "role": "system",
                "content": (
                    "Answer using ONLY the provided lines. "
                    "Return JSON only."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Lines:n{filtered_text_prompt}nn"
                    f"Question:n{question}nn"
                    "Pick a contiguous evidence span."
                ),
            },
        ],
        text_format=AnswerWithEvidence,
        retailer=False,
    )
    return resp.output_text

We name reply = llm_answer_with_evidence(query, filtered_line_df, shopper=shopper) and get again an AnswerWithEvidence occasion.

{
  "reply": "The choices for positional encoding talked about are realized positional embeddings and stuck positional encodings (particularly, utilizing sine and cosine capabilities of various frequencies).",
  "start_page_num": 6,
  "start_line_num": 31,
  "end_page_num": 6,
  "end_line_num": 32,
  "confidence": 0.98,
  "justification": "Traces 31–32 explicitly state: 'There are numerous selections of positional encodings, realized and stuck [9].' Moreover, additional strains element the sinusoidal encoding because the fastened alternative, and Desk 3 row (E) discusses utilizing realized embeddings as a substitute.",
  "quotes": [
    "There are many choices of positional encodings, learned and fixed [9]."
  ],
  "caveats": [
    "Further details about the specific implementation of learned embeddings are only touched on elsewhere, but both options are mentioned here."
  ],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "learned positional embeddings",
    "fixed positional encodings",
    "sinusoidal positional encoding"
  ]
}

Three issues occurred that matter:

The reply is right. Each choices recognized, paraphrased accurately.
The proof span (web page 6, strains 26-44) factors to a particular area. Not “someplace on web page 6”. Actual strains.
The mannequin couldn’t have hallucinated a quotation: it solely noticed strains from the retrieved pages, and the schema pressured an actual (web page, line) vary we will confirm.

If the mannequin can’t fill the schema, null fields are allowed and caveats information why. Article 8 develops the schema right into a a lot richer type with per-brick suggestions fields; Article 23 builds the storage structure round it.

Sanity examine. On a paper this quick we will additionally ship the complete line_df to the LLM with no retrieval and examine the reply matches. Reassuring right here, received’t scale to giant paperwork.

{
  "reply": "The choices talked about for positional encoding are sinusoidal positional encodings (utilizing sine and cosine capabilities of various frequencies) and realized positional embeddings.",
  "start_page_num": 6,
  "start_line_num": 27,
  "end_page_num": 6,
  "end_line_num": 41,
  "confidence": 0.99,
  "justification": "Traces 6:27-6:41 describe including 'positional encodings' to the enter embeddings, specify the sinusoidal technique, and point out experimenting with realized positional embeddings, stating each choices have been tried and produced almost similar outcomes.",
  "quotes": [
    "Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add 'positional encodings' to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9]. On this work, we use sine and cosine capabilities of various frequencies: ... We additionally experimented with utilizing realized positional embeddings [9] as a substitute, and located that the 2 variations produced almost similar outcomes (see Desk 3 row (E)). We selected the sinusoidal model as a result of it might enable the mannequin to extrapolate to sequence lengths longer than those encountered throughout coaching."
  ],
  "caveats": [
    "Exact mathematical formulas for sinusoidal encoding are present here, but full details for learned embeddings are not. Table 3 row (E) and further details may expand on results but are not needed for the options question."
  ],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "sinusoidal positional encoding",
    "learned positional embeddings",
    "sine and cosine functions",
    "relative or absolute position"
  ]
}

2.5 PDF annotation on the supply PDF

Now the satisfying half. We use the proof span to attract rectangles straight on the supply PDF.

In: the supply PDF and the proof span from the AnswerWithEvidence.

Out: an annotated PDF with rectangles drawn across the cited strains.

Elective. A CLI device, a batch job, or an API could skip it; the reply with citations is already full after part 2.4.

Three calls do the work:

passage_lines_df_from_answer(line_df, reply) rebuilds the cited-line DataFrame from the proof span.
passage_bbox_by_page(passage_df) teams bounding bins per web page.
draw_passage_rectangles(pdf_path, bboxes_df, out_pdf_path) writes the annotated PDF.

*One bounding field per cited web page, wrapping each cited line on that web page – Picture by writer*

*PDF annotation in three steps: develop the span, union per web page, draw rectangles – Picture by writer*

def passage_lines_df_from_answer(line_df, answer_json):
    a = json.hundreds(answer_json)
    sp, sl = a["start_page_num"], a["start_line_num"]
    ep, el = a["end_page_num"], a["end_line_num"]
    if sp is None: return line_df.iloc[0:0]
    masks = (
        line_df["page_num"].between(sp, ep)
        & ((line_df["page_num"] != sp) | (line_df["line_num"] >= sl))
        & ((line_df["page_num"] != ep) | (line_df["line_num"] <= el))
    )
    return line_df.loc[mask].copy()

def passage_bbox_by_page(passage_df):
    return passage_df.groupby("page_num", as_index=False).agg(
        x0=("x0", "min"), y0=("y0", "min"),
        x1=("x1", "max"), y1=("y1", "max"))

def draw_passage_rectangles(pdf_path, bboxes_df, out_path):
    doc = fitz.open(pdf_path)
    for _, r in bboxes_df.iterrows():
        web page = doc[int(r["page_num"]) - 1]
        web page.add_rect_annot(fitz.Rect(r["x0"], r["y0"], r["x1"], r["y1"]))
    doc.save(out_path)

*Consideration paper web page 6 with cited paragraph highlighted, subsequent to query and reply – Picture by writer*

The passage actually is the place the reply comes from. The pink field wraps the Positional Encoding paragraph: the sentence that introduces the selection (“we use sine and cosine capabilities of various frequencies”) and the two-line method straight under it. The reader can transfer from the chat reply to the quotation to the supply paragraph with out leaving the identical display. That’s the entire level.

Why a field round the entire paragraph and never the precise phrases? As a result of we labored on the line granularity: line_df carries one bounding field per textual content line, the LLM cites a (start_line, end_line) span, and passage_bbox_by_page collapses each line in that span into one wrapping rectangle. If you wish to draw the field across the actual phrases sin(pos / 10000^(2i/d_model)) as a substitute of the entire paragraph, the method is similar. Simply change the granularity. Exchange line_df with a word-level word_df (PyMuPDF’s web page.get_text("phrases") offers you a bounding field per phrase), make the schema cite (start_word, end_word), and passage_bbox_by_page already does the proper factor. Similar four-brick pipeline, finer scope.

3. Chaining the bricks, and testing the pipeline

3.1 The entire pipeline as one perform

The bricks chain right into a single name. Feed in a PDF and a query; get again a typed reply with line citations, and optionally an annotated PDF.

In: a PDF path and a textual content query (plus an non-obligatory top_k and an non-obligatory output PDF path).

Out: an AnswerWithEvidence, and (if annotate_pdf is given) an annotated PDF on disk.

Inside, pdf_qa_baseline chains doc parsing → query parsing → retrieval → technology → PDF annotation. What crosses the retrieval → technology boundary is simply the web page numbers; the filtered line_df is rebuilt inside technology.

def pdf_qa_baseline(
    pdf_path: str,
    query: str,
    top_k: int = 3,
    annotate_pdf: str | None = None,
):
    # 1. Parsing
    line_df = fitz_pdf_to_line_df(pdf_path)

    # 2. Retrieval
    page_df = embed_page_df(build_page_df(line_df))
    _, filtered = retrieve_pages(page_df, line_df, query, top_k)

    # 3. Era
    reply = llm_answer_with_evidence(query, filtered)

    # 4. Elective highlighting on the supply PDF
    if annotate_pdf is just not None:
        passage = passage_lines_df_from_answer(line_df, reply)
        bboxes = passage_bbox_by_page(passage)
        draw_passage_rectangles(pdf_path, bboxes, annotate_pdf)

    return reply

{
  "reply": "The choices talked about for positional encoding are realized and stuck positional encodings, particularly sinusoidal positional encodings (utilizing sine and cosine capabilities of various frequencies) and realized positional embeddings.",
  "start_page_num": 6,
  "start_line_num": 31,
  "end_page_num": 6,
  "end_line_num": 41,
  "confidence": 0.99,
  "justification": "Traces 31-41 focus on the alternatives for positional encodings, stating that there are a lot of selections together with realized and stuck encodings. It then explains the usage of sine and cosine capabilities (sinusoidal encoding) and notes that realized positional embeddings have been additionally experimented with.",
  "quotes": [
    "There are many choices of positional encodings, learned and fixed [9].",
    "On this work, we use sine and cosine capabilities of various frequencies: ...",
    "We additionally experimented with utilizing realized positional embeddings [9] as a substitute, and located that the 2 variations produced almost similar outcomes (see Desk 3 row (E))."
  ],
  "caveats": [],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "positional encodings",
    "learned",
    "fixed",
    "sinusoidal",
    "sine and cosine functions",
    "learned positional embeddings"
  ]
}

That is the API of the article. Later articles construct a sister perform ask_corpus(query, corpus, ...) for archive-scale work: identical contract (typed reply with citations), completely different scope (filter the corpus first, then run document-level work on the matching paperwork).

3.2 Attempt it on a unique doc

Drop in any PDF you’ve round: a paper from your individual area, a contract, a report from work. Right here we decide the World Financial institution’s April 2026 Commodity Markets Outlook (World Financial institution publication, April 2026 difficulty; CC BY 3.0 IGO, as declared on the World Financial institution Open Information Repository publication web page for this difficulty): a 69-page report on power, agriculture, and fertilizer markets, removed from a analysis paper in tone and construction.

Similar 4 bricks, identical default prompts, identical retrieve_pages, identical schema. Nothing in regards to the pipeline modifications for a brand new doc.

We begin with a query whose reply lives deep within the report, within the metals chapter quite than the Government Abstract: the outlook for aluminum costs in 2026.

We name pdf_qa_baseline end-to-end: go the CMO PDF, the aluminum query, top_k=3, and an annotate_pdf path so the pipeline additionally writes the highlighted supply. The returned answer_cmo_al is similar AnswerWithEvidence form we noticed on the Consideration paper.

{
  "reply": "Aluminum costs are projected to rise by about 22 p.c in 2026 (y/y) to succeed in an all-time excessive—about 21 p.c greater than their January 2026 projections—supported by tight provide circumstances and strong demand development. Costs are anticipated to say no by about 6 p.c in 2027 as provide circumstances regularly ease.",
  "start_page_num": 45,
  "start_line_num": 32,
  "end_page_num": 45,
  "end_line_num": 43,
  "confidence": 0.98,
  "justification": "The chosen span explicitly supplies the projected proportion improve for aluminum costs in 2026, the context for these actions, and the outlook for 2027. It additionally mentions the record-high stage forecast and components driving the worth.",
  "quotes": [
    "Aluminum prices are projected to rise by about 22 percent in 2026 (y/y) to reach an all-time high—about 21 percent higher than their January 2026 projections—supported by tight supply conditions and solid demand growth (table 1).",
    "Prices are expected to decline by about 6 percent in 2027 as supply conditions gradually ease."
  ],
  "caveats": [],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "all-time high",
    "tight supply conditions",
    "solid demand growth"
  ]
}

The composite view locations the highlighted supply web page subsequent to the query and the reply, so the quotation will be checked at a look:

A tougher query on the identical report. What if we ask about one thing the report mentions solely in passing? We attempt the AI-related electrical energy demand query, whose reply the World Financial institution developed solely in an “Upside threat” sidebar on web page 31.

Similar name form, tougher query: pdf_qa_baseline(pdf_path=pdf_path_cmo, query=question_cmo_ai, top_k=3, ...). The pipeline should resolve whether or not the retrieved pages really carry the AI-electricity determine or whether or not to flag the reply as not discovered.

{
  "reply": "The supplied strains point out that faster-than-anticipated growth of AI-related information facilities might enhance demand for sure metals like aluminum and copper, however don't quantify the contribution of AI-related information facilities to international electrical energy demand development.",
  "start_page_num": 47,
  "start_line_num": 39,
  "end_page_num": 47,
  "end_line_num": 40,
  "confidence": 0.8,
  "justification": "The one point out of AI-related information facilities is in relation to demand for metals, not electrical energy demand. There is no such thing as a quantitative estimate or proportion given for his or her affect on international electrical energy demand development.",
  "quotes": [
    "Also, faster-than-antici-npated expansion of AI-related data centers could nboost demand for aluminum and copper, driving nprices higher."
  ],
  "caveats": [
    "No specific figures or direct statements about global electricity demand growth caused by AI-related data centers were found in the provided lines."
  ],
  "complete_answer_found": false,
  "context_structured": true,
  "llm_discovered_keywords": [
    "AI-related data centers",
    "electricity demand growth",
    "boost demand for aluminum and copper"
  ]
}

*CMO web page 47, null-path response: the schema refused to manufacture when the reply wasn’t there – Picture by writer*

However how can we be certain the reply actually doesn’t exist within the doc? Strictly, we will’t, at the very least not from this null path alone. What the schema says is “the LLM didn’t discover the reply within the strains it was proven”, which is a unique declare from “the reply is just not within the doc”. The Upside-risk sidebar on web page 31 of the identical CMO report does quantify the determine (the World Financial institution cites the IEA’s 8% projection of world electrical energy demand development from 2024 to 2030). The default key phrase pipeline pulled web page 47 and close by pages as a substitute, the place the report’s prose discusses AI’s impact on metallic demand. Proving absence would require both operating the LLM on each web page, or a retrieval technique that surfaces sidebar textual content and quick reference mentions. That’s precisely what Article 7 (Retrieval) develops; for the minimal model, “I didn’t discover it within the prime three pages” is what we report.

3.3 Extra questions in a single desk

A small batch of 4 questions on the identical two paperwork, all leads to one desk. Learn the desk for patterns, not for each cell.

Numeric worth: studying price of the bottom Transformer. Particular quantity, anticipated web page 7 (part 5.3 on Adam optimizer).
No reply in doc: chemical composition of seawater. The schema’s null path ought to hearth; each retrievers will pull random-looking pages.
Totally different matter on CMO: outlook for urea costs. Similar pipeline on the fertilizer part of the World Financial institution report, removed from the AI sidebar.
Compound query: d_k and d_v within the Transformer. Two values requested directly. Additionally checks the table-parsing restrict (the values dwell in Desk 1 web page 4, parsed as flat strains).

def run_pipeline_test(
    query: str,
    line_df_in: pd.DataFrame,
    page_df_in: pd.DataFrame,
    page_df_emb_in: pd.DataFrame,
    top_k: int = 3,
    shopper=shopper,
) -> dict:
    """Run each retrievers + technology on one query; return a abstract dict."""
    parsed_q = get_keywords_from_question(query, shopper=shopper)
    retrieved_emb_df, _ = retrieve_pages_by_similarity(
        page_df_emb_in, line_df_in, query, top_k=top_k, shopper=shopper,
    )
    retrieved_kw_df, filtered_lines_kw = retrieve_pages(
        page_df_in, line_df_in, parsed_q.key phrases, top_k=top_k,
    )
    # If key phrase retrieval finds nothing, fall again to the entire doc so technology
    # nonetheless runs (small PDFs solely: wouldn't scale to an actual corpus).
    lines_for_generation = (
        filtered_lines_kw if len(filtered_lines_kw) > 0 else line_df_in
    )
    reply = llm_answer_with_evidence(
        query, lines_for_generation, shopper=shopper,
    )
    return {
        "query": query,
        "key phrases": parsed_q.key phrases,
        "emb_top3": retrieved_emb_df["page_num"].tolist(),
        "kw_top3": (
            retrieved_kw_df["page_num"].tolist()
            if len(retrieved_kw_df) > 0 else "(no kw match)"
        ),
        "answer_excerpt": (reply.reply[:80] + ("..." if len(reply.reply) > 80 else "")),
        "cite_page": reply.start_page_num,
    }

*Similar pipeline on 4 questions: two succeed, one refuses cleanly, one journeys on desk parsing – Picture by writer*

Learn the desk left-to-right per row. 4 patterns to remove:

Key phrases beat embeddings on the studying price row. The bottom Transformer’s coaching schedule is on web page 7 (part 5.3, Optimizer). Embeddings rank pages 8/9/10; web page 7 is not within the prime three. The key phrase retriever finds web page 7 instantly by way of the literal phrase studying price. Similar lesson because the epsilon row in part 2.3.c: when the query relies on a exact time period the doc prints verbatim, key phrases are the higher device.
Each retrievers fail on the seawater row, and the failure is seen. The PDF has nothing to say about seawater. The key phrase column reveals (no kw match) outright, with no false ‘top-3 pages’ that look believable. The schema then returns a null reply with a caveat. A clear ‘I don’t know’ is the system’s Most worthy habits on out-of-scope questions.
Each retrievers work on the urea row. The CMO has a fertilizer part; embeddings and key phrases each carry again web page 42, technology cites it accurately. Cross-domain pipelines work so long as the query’s vocabulary lands on the doc.
The d_k and d_v compound row exposes the table-parsing restrict. The 2 values dwell in Desk 1, web page 4 of the Transformer paper, the place every row lists d_model, h, d_k, d_v, and many others. Our parser flattened the desk into plain strains, so a mannequin that asks for 2 cells facet by facet has to reassemble the row from textual content alone. Key phrases retrieve web page 4 (the literal phrase d_k seems there), however the quotation typically factors to 1 worth whereas the opposite is paraphrased. The repair is structural: parse tables as tables, not as strains. That’s Article 5 (parsing) and Article 6 (compound-question decomposition) doing their job.

4. The questions every block raises

What this minimal system does properly:

An actual, verifiable reply. A structured object with the reply, the web page, the strains, the quote. The person can examine the quotation in seconds.
“Not discovered” dealt with cleanly. When the reply isn’t within the retrieved strains, the schema permits null fields and the caveats area says why. No fabrication.
The reply linked to the supply. The highlighted PDF closes the loop between the LLM’s declare and the doc. That is what separates a helpful RAG system from a chatbot that occurs to learn paperwork.
Simple to observe. Every perform does one factor. No hidden state, no framework magic. When one thing goes flawed, debugging is studying the code.

Now have a look at the identical system once more. Every block hides assumptions price questioning.

4.1 Doc parsing: we simply learn strains

We extracted textual content line by line. That’s affordable for an educational paper, however have a look at what we threw away: part construction, headings, desk layouts, figures, footnotes, cross-references. Web page 4 of this paper incorporates Desk 1 with the per-layer complexities. We parsed every of its rows as plain strains, shedding the desk construction fully. Web page 9 incorporates Desk 3, the ablation examine. Similar downside.

For a query like “What are the choices for positional encoding?” this doesn’t matter. The reply is in steady prose. For a query like “What’s the per-layer complexity of self-attention?” it abruptly does, as a result of the reply lives in a desk cell that our parser flattened into noise.

That’s the subject of Article 5: Parsing. Paperwork have construction. Ignoring it’s the single largest supply of downstream failure.

4.2 Query parsing: we requested for key phrases, however solely key phrases

Our question-parsing step extracts a flat checklist of key phrases. That works on a clear query in opposition to an educational paper. It begins to interrupt down as quickly as questions get tougher.

Three issues this minimal model doesn’t do.

It doesn’t detect intent. “Summarize chapter 3”, “Translate this clause into French”, “Examine X and Y” every name for a unique downstream pipeline. A single key phrases area can’t carry that sign.

It doesn’t decompose compound questions. “What are the exclusions and the deductible?” parsed as a flat key phrase checklist pollutes the retrieval (the key phrases for “exclusions” and “deductible” pull in two completely different scopes that intrude). Article 6 walks by means of tips on how to detect compound questions, resolve whether or not to decompose, and route the sub-questions independently.

It doesn’t detect an anticipated reply form. “What’s the premium quantity?” needs a quantity with a foreign money. “What are the obligations?” needs a listing. “Examine the 2 insurance policies” needs a desk. The minimal model treats each reply as free textual content. Article 6 introduces the expected_answer_shape area that drives the technology template downstream.

That’s the subject of Article 6: Query Parsing. The identical brick, a lot richer JSON.

4.3 Chunking: we aggregated by web page

We selected pages because the unit of retrieval. Why pages? Why not paragraphs, or sections, or fixed-size chunks of 512 tokens like each normal RAG tutorial recommends?

The reply is that page-level aggregation occurs to work for this paper as a result of pages roughly align with semantic items. On a contract, on a authorized textual content, on a technical handbook with numbered clauses, pages are arbitrary cuts and also you’d need clause-level or section-level chunks as a substitute. The “proper” chunking relies on the doc and the query, not on a default worth.

The temptation, when a fixed-size method begins failing, is to grid-search over chunk sizes and overlaps. That’s the machine studying reflex. It’s the flawed body for what’s really a structural resolution. Article 3: RAG Is Not Machine Studying, and the Six-Month Mistake of Treating It Like One makes that case in full.

4.4 Retrieval: key phrase matching is clear, however blind to vocabulary

Our retrieval simply labored. Web page 6 got here again with the matched key phrase, forward of the remainder, and the Positional Encoding part is on web page 6. Anybody can have a look at the match desk and see why. That’s the commerce we made: the only attainable retrieval, fully auditable.

The commerce has a price. Key phrase matching is blind each time the query’s vocabulary doesn’t match the doc’s. Three failure modes present up instantly on the identical paper.

Image vs phrase. Ask “What’s the worth of epsilon utilized in label smoothing?” The key phrases from query parsing are possible one thing like ["epsilon", "label smoothing"]. The precise reply (ε_ls = 0.1) sits on web page 8, however the doc writes it because the Greek letter ε, by no means the English phrase “epsilon”. The substring examine returns zero on the symbol-only web page; solely the literal phrase label smoothing lands on web page 8.

Synonym mismatch. Ask “How does the mannequin know the order of phrases in a sentence?” The key phrases may be ["word order", "sentence order"]. The doc calls this positional encoding. Not one of the query’s key phrases seem on web page 6. The retriever picks pages that occur to say “order” or “sentence” in passing, none of which include the reply.

Paraphrase. Ask “What consideration mechanism does the encoder use?” The doc says self-attention and Multi-Head Consideration, by no means the phrase “consideration mechanism the encoder makes use of”. The key phrases pulled from the query, even after growth, could or could not embody the doc’s actual phrasing. After they do, retrieval works. After they don’t, it silently degrades.

The primary two failures are so frequent that the remainder of the collection spends two articles on them.

Article 6: Query Parsing turns the key phrase extraction right into a a lot richer step that pulls from a site glossary, expands synonyms, and contains possible doc phrasings quite than the query’s literal phrases.
Article 2: Embeddings introduces vector representations that match throughout floor vocabulary: the place embeddings shine (synonyms, paraphrase, misspellings, cross-lingual matching), the place they quietly fail (negation, actual values, inner acronyms, polysemic phrases), and tips on how to mix them with key phrase matching for the most effective of each worlds.
Articles 7 and 9 put the ensuing hybrid retrieval into an actual doc index.

The suitable reply is to mix, not decide a winner. The 2 strategies fail on nearly reverse circumstances: embeddings stumble when the query relies on a exact image, named time period, or actual worth; key phrases stumble when the asker’s vocabulary doesn’t actually seem within the doc. Operating each retrievers, taking the union of their candidates, and (optionally) re-ranking with a cross-encoder is the usual hybrid recipe. Article 2 develops it; Articles 7 and 9 wire it right into a corpus.

The minimal model stays single-retriever as a result of it teaches the proper reflex first: the retriever have to be auditable. Key phrase matching makes that reflex concrete (you may see precisely which phrases landed on which web page). As soon as that reflex is in place, embeddings turn out to be a managed addition quite than an opaque default, and mixing the 2 turns into a deliberate engineering alternative quite than a pattern.

4.5 Era: we requested for sources, and we bought them

That is the block that labored finest, nearly too simply. We outlined a Pydantic schema with start_page_num, start_line_num, end_page_num, end_line_num, confidence, justification, quotes, and caveats, and the mannequin stuffed it in accurately.

How way more can we ask? A structured comparability for comparative questions, a listing of conflicts if the doc contradicts itself, a number of citations from a number of components of the doc, a confidence breakdown per declare. Sure to all the above. The technology step is much extra controllable than most groups understand. Article 8: Era as Managed Execution explores this in depth.

5. The form of what comes subsequent

This minimal pipeline is the backbone of every part that follows. Every a part of the collection goes deep on one of many questions raised above.

The errors that kill most initiatives come from getting the flawed image of considered one of these blocks: RAG isn’t ML (Article 3), embeddings aren’t magic (Article 2), not all RAG issues look the identical (Article 4). That’s Half I.

Every brick then will get its personal deep dive: doc parsing, query parsing, retrieval, technology. That’s Half II, the 4 bricks.

As soon as the blocks are strong, we recombine them for circumstances that seem like manufacturing: lengthy paperwork, justification and absence dealing with, table-of-contents-driven retrieval, itemizing questions, structured extraction, the composite pipeline. That’s Half III.

Then we alter scale. From one doc to many. From a single paper to an archive of a whole lot or hundreds of paperwork. The structure modifications considerably. That’s Half IV.

Lastly, what it takes to function the system in manufacturing: analysis, price and monitoring, safety and compliance, the structure of the codebase itself. That’s Half V.

The blocks don’t change. Their internals do.

A couple of framing notes:

The 4 bricks (Half II) are the conceptual core. A lot of the remainder of the collection is about doing every one higher. Half III and Half IV are recombinations: the identical 4 concepts at completely different scales and for various query varieties.
The collection scope is enterprise paperwork. Contracts, technical specs, regulatory filings, inner procedures: all carry construction (TOC, sections, tables) and bounded vocabulary (business jargon, knowledgeable phrases). RAG works on these corpora due to that construction, not heroic embedding methods. Paperwork with no construction (novels, lengthy unstructured transcripts) and questions that require intent quite than finding a passage are out of scope; Article 4 returns to the place the road falls.
Code is illustrative, not production-ready. What you’ve learn works on an actual PDF, however lacks the error dealing with, validation, caching, price controls, monitoring, and safety a manufacturing system wants. Every will get its personal article.

Right here’s the specific map from this minimal system to the remainder of the collection:

PDF parsing throws away construction → Article 5, Article 10
Query parsing wants greater than key phrases (intent, decomposition, anticipated reply form) → Article 6
Chunking technique isn’t a hyperparameter → Article 3
Query vocabulary doesn’t match doc phrases → Article 2, Article 6
Retrieval picks the flawed web page → Article 7, Article 9
Mannequin paraphrases its quotation → Article 8, Article 21
“Not discovered” wants nuance → Article 4
Compound, itemizing, comparability, summarization questions → Article 6, Articles 11-13
Multi-document corpus → Half IV (Articles 15-20)
Manufacturing, analysis, safety, structure → Half V (Articles 21-25)