Context Engineering for RAG : The 4 Typed Inputs Behind Each RAG Reply

companion to Enterprise Doc Intelligence, a sequence whose stance is that enterprise RAG amplifies the professional, it doesn’t substitute them. The structure follows from that: 4 bricks (doc parsing, query parsing, retrieval, technology), every emitting typed items that converge on one LLM name. The trade now calls that apply context engineering. Scope right here is the single-document case; corpus, dialog, and tool-call extensions are follow-up work.

*the place this text sits within the sequence: Article 7bis (context engineering), the reframing companion to the 4 bricks – Picture by writer*

📓 Runnable notebooks are on GitHub: doc-intel/notebooks-vol1.

*The general public companion-code repo at doc-intel/notebooks-vol1 – Picture by writer*

By the point the 4 bricks of a single-document RAG are constructed, the meeting is settled. Parsing produces relational tables. Query parsing produces a typed ParsedQuestion. Retrieval produces a filtered subset of traces, plus an audit of the way it picked them. Era produces a Pydantic reply with cited proof. The entire thing converges on one LLM name, with a hard and fast system immediate and a person content material assembled from upstream items.

That pipeline has a reputation now. In June 2025 Tobi Lütke tweeted that “immediate engineering” was the flawed body, and proposed “context engineering” as an alternative: “the artwork of offering all of the context for the duty to be plausibly solvable by the LLM.” Andrej Karpathy endorsed it every week later as “the fragile artwork and science of filling the context window with simply the best info for the subsequent step.” Inside months the time period was on the duvet of an O’Reilly ebook and structured right into a taxonomy by LangChain.

What follows reads the single-document RAG pipeline by means of that lens. Every brick emits typed items; the meeting stage threads them into the LLM name; the system immediate stays mounted for caching. Naming the apply doesn’t change the structure. It modifications what to name it when an auditor asks how the system works, and it tells the reader that the structure is the one manufacturing groups converged on in 2025.

1. The identify, and what it covers

Immediate engineering used to imply two associated issues. Tuning the wording of 1 immediate to coax higher behaviour, and writing instance pictures so the mannequin knew what good output appeared like. Each are slender. They concern one block of textual content despatched to 1 name.

Context engineering covers every little thing that lands within the mannequin’s context window for one name:

The system immediate (the position, the foundations, the examples).
The retrieved paperwork or rows.
Dialog historical past when there’s one.
Software definitions and their outputs.
Reminiscence, scratchpads, agent state.
Structured metadata in regards to the doc, the corpus, the undertaking.
The precise person enter.

In a long-running agent that calls the mannequin dozens of instances, the immediate is one in all six or eight slots. The remainder comes from someplace upstream: a retriever, a instrument, a reminiscence retailer, a profile lookup. The self-discipline shifts from “what ought to I write within the immediate” to “what ought to I assemble within the context, the place does each bit come from, and the way do I preserve the meeting steady throughout calls.”

That’s engineering work. It seems to be like software program structure: typed objects, contracts between parts, audit trails, caching. The 2025 time period is overdue, as a result of the apply was already there within the working manufacturing techniques. Lütke and Karpathy named what groups had been already doing.

The sequence occurs to have finished it from the beginning, brick by brick. The following sections stroll by means of what every brick contributes to a single-document RAG payload, then by means of the 4 typed items that land within the LLM name and the code that produces every one. The corpus, dialog, and tool-call circumstances come up on the finish as out-of-scope work, with tips that could the place within the sequence they are going to be addressed.

*Seven typed bricks feeding the LLM’s context window, grouped by supply: query, paperwork, infrastructure. – Picture by writer*

2. Each brick emits typed context

The 4 bricks emit typed context channels that converge on the meeting band on prime, the place PromptContext, the mounted system immediate, and the person template mix earlier than the LLM name. – Picture by writer

The schema above is the recap of what the sequence shipped. Every brick is a typed-context emitter. The names on the containers are the precise fields of the particular Pydantic courses and DataFrames the code produces.

Parsing emits relational tables and one synthesis dict. line_df carries one row per line with bbox. page_df carries one row per web page with kind and column depend. toc_df carries the table-of-contents entries with begin web page and depth. image_df carries embedded photos with phash and metadata. parsing_summary is the doc-level synthesis: doc_type, n_pages, typical_fields, abstract, plus the mechanics fields. The retrieval brick consumes the per-row tables. The query parsing brick consumes the semantic subset of parsing_summary by way of DocContext.

Query parsing emits a ParsedQuestion. Its fields should not free-form. key phrases is a brief record of content material noun phrases for retrieval. intent is a literal label from a hard and fast enum that drives form dispatch in technology. structural_hints.pages_hint carries pinned pages when the person mentioned “on web page 3”. answer_shape carries the anticipated output form (textual content, quantity, date, record, desk, tackle) for the technology schema lookup. Every subject is consumed by a unique downstream brick. None of them are handed as uncooked strings to the LLM.

Retrieval emits a filtered DataFrame and an audit dict. filtered_line_df is the subset of line_df the technology brick sees. anchor_pages is the web page IDs that had been stored and why. The retrieval_audit carries the strategy that gained (key phrase, TOC, LLM arbiter), the LLM TOC reasoning when relevant, and the chosen sections. The filtered body is what the LLM reads. The audit is what an auditor reads.

Era is a client, not an emitter. It takes the query, the filtered traces, the PromptContext, and the reply schema. It calls the LLM. It returns a Pydantic typed reply. The dashed border on the Era field alerts that position.

The violet “PROMPT ASSEMBLY” zone on the best is the place context engineering occurs as code. The sequence implements it by way of three primitives:

A PromptContext(BaseModel) aggregator with one subject per upstream context supply: doc_context, future corpus_context, future project_context.
A hard and fast MODULE_SYSTEM_PROMPT on the module degree for every brick that calls the LLM.
A MODULE_USER_TEMPLATE with named placeholders the brick fills by way of str.format(...).

Article 1 (the minimal four-brick RAG) launched the bricks as a circulate. Article 6A (the query parsing thesis) made the query parser typed. Article 8A (the typed technology contract) makes the technology schema typed. This text reads the identical 4 bricks by means of the lens of “what context does every one contribute, how do they attain the LLM name with out polluting one another.” Similar code, completely different lens.

3. The 4 typed items of a single-document payload

What lands within the LLM name for a single-document RAG is 4 items, every produced by a unique piece of code, every with a unique cost-and-cache profile. This part walks the 4 within the order they seem within the person content material the LLM reads.

3.1 The mounted system immediate

The primary piece is the system message. The position description, the foundations, the examples. It doesn’t change throughout calls. The sequence writes it as a Python fixed on the module degree, then exposes it as a kwarg with a default so a caller can override per area with out forking:

PARSE_QUESTION_SYSTEM_PROMPT = (
    "You extract content material noun phrases from the person's query..."
)

def parse_question(query, *,
                   system_prompt: str = PARSE_QUESTION_SYSTEM_PROMPT,
                   user_template: str = PARSE_QUESTION_USER_TEMPLATE,
                   context: PromptContext | None = None):
    ...

Two operational penalties. The immediate is cacheable by the LLM supplier, as a result of it doesn’t change throughout calls on the identical mannequin. Cached enter prices roughly ten instances lower than recent enter on the suppliers that publish a tariff. And the immediate is auditable, as a result of it lives at a steady Python image an auditor can grep, model, and diff between releases.

3.2 The retrieved traces, filtered by the dispatcher

The second piece is the traces the LLM really reads. The dispatcher consumes ParsedQuestion.key phrases and structural_hints, picks a way (key phrase, TOC, LLM arbiter), and returns the filtered body plus the audit. The person content material will get the filtered body; the audit lives on disk for the operator to examine later:

retrieved, filtered_line_df, audit = dispatch_page_retrieval(
    query, line_df, page_df,
    toc_df=toc_df, key phrases=key phrases,
    top_k=5, use_toc=True,
)

What ships to the LLM in person content material is the filtered body, not the entire doc. A 200-page contract turns into ten pages of related traces. The person content material stays beneath a couple of thousand tokens. The audit explains why every web page made it in, so a caller can problem the choice with out re-running the decision.

3.3 The doc-context block, compact JSON

The third piece is the doc-level synthesis: doc kind, web page depend, typical fields, abstract. It lands within the person content material as a compact JSON object so the LLM can scope ambiguous wording in opposition to the doc’s nature. The sequence implements it as a way on each context-carrying Pydantic class. DocContext.as_prompt_json() builds the smallest JSON that also names the 4 fields; null and empty values are dropped:

class DocContext(BaseModel):
    doc_type: str | None = None
    n_pages: int | None = None
    typical_fields: record[str] = []
    abstract: str | None = None

    def as_prompt_json(self) -> str:
        payload = {ok: v for ok, v in self.model_dump().objects()
                   if v will not be None and v != []}
        return json.dumps(payload, separators=(",", ":"))

Measured on a CV with doc_type="resume", n_pages=1, and 4 typical fields, the payload is beneath 200 characters. On an unknown doc the place each subject is null or empty, the payload is the empty object {} and the bloc is omitted totally from the person content material. The identical sample applies to the reserved corpus-context and project-context slots when later articles activate them.

3.4 The `PromptContext` aggregator that wraps the three above

The fourth piece is the aggregator. Every LLM-calling brick takes one optionally available context: PromptContext kwarg. The aggregator carries the doc-context in its personal typed slot immediately, with reserved slots for the corpus-context and project-context the follow-up articles will activate. The helper render_context_block(context) walks the non-null fields and emits one labelled JSON bloc per layer on the head of the person content material:

class PromptContext(BaseModel):
    doc_context:     DocContext | None = None
    # corpus_context:  CorpusContext  | None = None  # reserved
    # project_context: ProjectContext | None = None  # reserved

Every LLM brick takes one optionally available context: PromptContext kwarg. The helper render_context_block(context) walks every non-null subject, renders its compact JSON, and emits one labelled bloc per layer. Including a brand new layer means uncommenting one subject, including two traces within the helper, and each brick picks the brand new layer free of charge. The signature is steady throughout releases.

4. What modifications in apply

Naming the apply modifications three operational issues, even with the code unchanged.

Audit. When the reply is flawed, the query is now not “what did the immediate say.” The query is “what landed within the context window for that decision.” The sequence persists each brick output to disk: parsing/, questions//parsed_question.json, retrieval//retrieved_pages.parquet, retrieval//retrieval_audit.json. The auditor reconstructs the context payload from these information. Then the query turns into particular: was the doc_context flawed, had been the flawed pages chosen, did the system immediate drift between releases, was the person template stale. Every of these has a unique repair.

Value. Two levers compound. The system immediate is mounted throughout calls on the identical mannequin, so it pays cached-input tariff. The person content material has been compressed by way of as_prompt_json and chosen by way of retrieval, so the variable half is small. On a corpus of 100 paperwork with 10 questions every, the dominant value is the variable half instances 1000 calls. Naming the apply doesn’t change the mathematics, however it makes the price range for every name legible: each line within the context payload has a generator that somebody can level at.

Composition throughout follow-up work. The PromptContext aggregator has one subject activated immediately, with two extra reserved for the corpus-context and project-context layers a later piece of the sequence provides. When these land, this text doesn’t want a rewrite. The signature stays. The physique of render_context_block grows by one department. Each brick that already takes context: PromptContext | None picks up the brand new sub-context free of charge. The self-discipline pays off in deferring breakage throughout releases.

5. Out of scope, with pointers

The one-document case stops right here. Context engineering at giant covers three issues this text doesn’t contact:

Corpus context. When the reply requires studying throughout many paperwork, the LLM wants a way of which paperwork are in scope and what they’ve in widespread. That lives in a future CorpusContext Pydantic, fed by an aggregator over per-document parsing_summary values. The slot is reserved in PromptContext so the brick signatures don’t change. A later article walks the construct and the buyer wiring.
Dialog historical past. Multi-turn chat carries prior query / reply pairs the LLM ought to take into account earlier than answering the brand new query. That could be a state drawback (the place does the historical past reside, when is it summarised, when is it pruned) on prime of a context drawback. A later article within the sequence treats it as a first-class brick.
Software calls. Agent loops deliver instrument definitions, instrument outputs, and intermediate state into the context window. The choice / compression / isolation issues get sharper there as a result of the context window fills up shortly throughout turns. A later article within the sequence treats agentic context engineering as its personal matter.

The 4 canonical methods the LangChain weblog names (write, choose, compress, isolate) had been developed with the agent loop in thoughts. Two of them (write and choose) translate cleanly to the single-document case because the system immediate and the retrieval dispatcher. The opposite two (compress and isolate) apply in spirit however chew more durable as soon as corpus and dialog enter the image, which is why this text doesn’t drive the four-way mapping.

See it reside

A brief reside companion runs within the shipai dashboard. Click on any candidate web page within the audit path, then click on anchor / paragraph / part / web page within the picker above.

*The shipai reside demo: similar anchor, 4 context-scope selections aspect by aspect, the person widens the spotlight to see the tradeoff – Picture by writer*

Similar anchor, 4 context-scope selections aspect by aspect. anchor is one line. paragraph is ±5 traces on the identical web page. part makes use of the TOC to widen to the part physique. web page fills the entire web page. The article’s trade-off (value vs precision) turns into a slider you may really feel on an actual PDF as an alternative of a paragraph of prose.

6. Conclusion

The 2025 trade dialog round context engineering provides a reputation to a self-discipline single-document RAG already practises brick by brick. Parsing emits relational tables and a doc-level synthesis. Query parsing emits a typed ParsedQuestion whose fields every drive a unique downstream brick. Retrieval emits a filtered line set plus an audit. Era consumes the assembled payload by means of a hard and fast system immediate, a templated person content material, and a PromptContext aggregator with one typed slot per upstream layer.

The label is what modifications: an auditor, a hiring supervisor, or a vendor studying the structure can place it contained in the 2025 vocabulary with out additional translation. The bricks, the schemas, and the cost-versus-cache trade-offs are unchanged. The corpus, the dialog, and the tool-call circumstances come up as follow-up work, every with its personal typed slot reserved in the identical aggregator.

Immediate Engineering Fails Quietly — Immediate Regression Is Why

The right way to Select Between Small and Frontier Fashions

7. Sources and additional studying

The 2025 dialog, in chronological order.

Walden Yan, Don’t construct multi-agents, Cognition, June 12 2025. The earliest piece that names the self-discipline. Yan’s declare that “context engineering is successfully the #1 job of engineers constructing AI brokers” is the road Lance Martin later quotes when he introduces the four-strategy taxonomy.
Tobi Lütke, X, June 18 2025. The naming tweet: “I actually just like the time period ‘context engineering’ over immediate engineering. It describes the core ability higher: the artwork of offering all of the context for the duty to be plausibly solvable by the LLM.”
Lance Martin, Context Engineering for Brokers, June 23 2025. The taxonomy paper. Additionally republished on the LangChain weblog beneath the LangChain Crew byline.
Andrej Karpathy, X, June 25 2025. The endorsement: “+1 for ‘context engineering’ over ‘immediate engineering’. Folks affiliate prompts with brief activity descriptions you’d give an LLM in your day-to-day use. In each industrial-strength LLM app, context engineering is the fragile artwork and science of filling the context window with simply the best info for the subsequent step.”
Drew Breunig, Learn how to Repair Your Context, June 26 2025. A parallel taxonomy: six concrete techniques (RAG, Software Loadout, Context Quarantine, Context Pruning, Context Summarization, Context Offloading) for holding the context window wholesome.

The taxonomies, aspect by aspect.

Lance Martin: 4 methods for the agent loop (write, choose, compress, isolate). Single-document RAG interprets the primary two cleanly; the opposite two chew more durable as soon as corpus and dialog enter the image.
Drew Breunig: six techniques (RAG, Software Loadout, Context Quarantine, Pruning, Summarization, Offloading). Extra fine-grained, much less summary. Helpful when the agent loop is already working and the context window is filling up.

The longer therapies.

Counterpoints.

Weaviate, Context Engineering book (23 p, December 2025). The seller framing: six parts (Brokers, Question Augmentation, Retrieval, Prompting Strategies, Reminiscence, Instruments). The sequence’ place on this rebrand, the place the relabelling tracks the product line reasonably than the apply, is roofed in a follow-up critique submit.
Roadie weblog, Why Conflating RAG with Context Engineering Prices You in Manufacturing. The other framing: holding RAG and context engineering distinct, with retrieval as one slot amongst many.

The sequence primitives this text references.

PromptContext aggregator and DocContext projection: src/docintel/core/schemas/.
render_context_block helper: src/docintel/core/prompts.py.
Module-level system prompts and person templates: each LLM-calling module beneath src/docintel/, by conference. Earlier within the sequence:
Amplify the Professional: A Philosophy for Constructing Enterprise RAG. The sequence’ manifesto: the 4 bricks (parsing, query parsing, retrieval, technology) are designed to scale the professional’s judgement, not substitute it.

Half I: What works, what breaks

Baseline Enterprise RAG, from PDF to highlighted reply. The four-brick pipeline finish to finish: PDF in, highlighted reply out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. The place embedding similarity wins (synonyms, typos, paraphrase), the place it predictably breaks (unknown phrases, negation, term-vs-answer relevance), and how you can use it anyway.
RAG will not be machine studying, and the ML toolkit solves the flawed drawback. Why chunk-size sweeps and finetuning optimize the flawed factor; route by query kind as an alternative.
From regex to imaginative and prescient fashions: which RAG method suits which drawback. Two axes, doc complexity and query management, that choose the method for every case.

Half II: The 4 bricks

Doc parsing

Past extract_text: the 2 layers of a PDF that drive RAG high quality. The primary half of the parsing brick: the doc’s nature, alerts, and abstract.
Cease returning flat textual content from a PDF: the relational tables RAG wants. The second half of the parsing brick: the relational tables each downstream brick reads.
- When PyMuPDF can’t see the desk: parse PDFs for RAG with Azure Format. The identical tables from Azure Format: native desk cells, OCR, paragraph roles.
- Parse PDFs for RAG regionally with Docling: wealthy tables, no cloud add. The identical tables computed regionally with Docling: TableFormer cells, nothing leaves the machine.
- Imaginative and prescient LLMs are PDF parsers too: studying charts and diagrams for RAG. Imaginative and prescient as a parser: the images turn into searchable textual content.
- Parse scanned PDFs for RAG with EasyOCR: free OCR provides you phrases, not a doc. The place conventional OCR stops: textual content recovered, construction misplaced.
- Making a PDF’s photos searchable for RAG, with out paying to learn all of them. The picture cascade: filter low-cost, classify, describe solely what’s price studying.
- Reconstructing the desk of contents a PDF forgot to ship, so RAG can scope by part. Rebuilding toc_df when the PDF prints a contents web page however ships no define.

Query parsing

Parse the query earlier than you search: the lacking step in most RAG pipelines. The thesis of query parsing: why a person string wants the identical parsing as a doc, and the way it splits right into a retrieval temporary and a technology temporary.
5 fields RAG ought to extract from any query: key phrases, scope, form, decomposition, clarification. The 5 households of columns the parser reads straight from the person’s query, with the code that fills every one.
One parsed RAG query, 4 selections: chunk technique, mannequin tier, fragments, audit path. The choices the parser makes on prime of the person string, utilizing the doc’s profile: dispatch, activations, full schema, the audit path (pipeline_trace.json), and a broker-corpus walkthrough.

Retrieval