Cease Returning Textual content from RAG: The Typed Reply Contract That Prevents Hallucination

AI Brokers Defined: What Is a ReAct Loop and How Does It Work?

The Untaught Classes of RAG Retrieval: Cosine Is Not the Basis

brick of Enterprise Doc Intelligence, a sequence that builds an enterprise RAG system from 4 bricks: doc parsing, query parsing, retrieval, and technology. Technology is the fourth and final brick. That is the primary of its three elements: the contract, the typed reply schema the mannequin has to fill. The companions cowl how the decision that fills it’s assembled (Article 8B, immediate meeting) and the way the reply is checked and looped again into the pipeline (Article 8C, validation).

*the place this text sits within the sequence: Article 8 (technology), the contract half, inside Half II (the 4 bricks) – Picture by writer*

📓 Runnable companion notebooks are on GitHub: doc-intel/notebooks-vol1.

*The general public companion-code repo at doc-intel/notebooks-vol1 – Picture by writer*

1. The mannequin hallucinates; reply from the passages, not from reminiscence

The primary three bricks converge right here, every lined in its personal articles:

The generator’s job is to show these passages and that query into a solution, and the mannequin will hallucinate on the way in which. That isn’t a bug to patch. It’s what generative AI does: it predicts essentially the most believable subsequent token, it doesn’t look something up. On a subject that saturates its coaching knowledge the prediction is dependable. In your contract, seen as soon as or by no means, it predicts a continuation simply as fluent, simply as assured, and much more prone to be unsuitable. You may’t prepare that away. You may solely shrink the room for it.

Most of that room is already closed by the point technology runs. Every brick earlier than it arms technology one thing clear:

Doc parsing provides it structured tables, not a garbled textual content dump.
Query parsing provides it a exact query and a declared reply format, the form and kind to return, not a free string.
Retrieval provides it the minimal, the few passages that truly maintain the reply, every pinned to a transparent anchor on its actual traces.

Three bricks, 3 ways the room to invent received smaller. The bottom is ready, and technology solely has to not waste it.

*The mannequin predicts both method and you may’t management what it noticed in coaching, so floor the reply within the retrieved passage – Picture by writer*

Technology is the place you spend that preparation, and the lever shouldn’t be a better immediate (“don’t make issues up” adjustments nothing). It’s managed execution. The mannequin solutions solely from the passages in entrance of it, in a typed form, with a quotation for each declare. Structured enter, passages plus query, in. Structured output, a typed schema with citations, constancy flags, and suggestions for the pipeline, out. Ask for “a solution” and the mannequin fills the gaps from reminiscence. Ask for a structured object whose each subject is checked in opposition to the enter, and it has nowhere to invent.

2. Asking the mannequin for greater than “the reply”

The schema is the contract between the pipeline and the mannequin, and it doesn’t need to cease at “the reply”. The minimal RAG pipeline’s AnswerWithEvidence was the minimal that earns the phrase “RAG”: a direct reply, the proof begin/finish, a confidence, non-obligatory quotes and caveats. That works for prose questions. Each subject we add previous that’s one other query the schema asks the mannequin, and every earns its place by giving the pipeline one thing it couldn’t get in any other case.

The wealthy contract stacks 4 sorts of fields on prime of the minimal one:

Typed values per form (part 2.1): Quantity(worth, foreign money, unit) as a substitute of the string "USD 1,200 per declare"; DateValue(iso, unique) as a substitute of "15 March 2024"; TableValue(headers, rows) as a substitute of pipe-separated textual content. Downstream code by no means re-parses a string.
Multi-element solutions with multi-span citations (part 2.2): many actual questions have an inventory as the reply; many single-element solutions have non-contiguous proof (a definition on web page 5 plus an instance on web page 23). The schema fashions each instantly.
Self-assessment + pipeline-feedback fields (part 2.3): confidence, caveats, answer_found, complete_answer_found, context_structured, llm_discovered_keywords, conflicting_evidence, suggested_clarification. Every of those makes the mannequin emit a sign the pipeline reads to determine its subsequent transfer.
Programmatic completeness (part 2.4): the one sign we intentionally do not ask the mannequin. It’s set by the pipeline primarily based on what retrieval was parametrized to incorporate (an overlap web page past the part, for example). Robust as a result of deterministic, grounded in doc construction, not within the mannequin’s self-rating.

These 4 are those we develop right here. The record isn’t closed. As soon as the schema is the contract, including a brand new indicator prices one subject declaration and one immediate fragment: “flag redacted blocks”, “return per-item table_confidence”, “emit jurisdiction when the clause cites a legislation”, something the pipeline wants. The fitting indicators rely upon the corpus and the downstream client; this text exhibits the 4 general-purpose ones, and the registry sample (part 2.3) retains the door open for customized shapes.

The schema is constructed bottom-up in 4 layers, one per subsection: Worth (part 2.1, the typed primitive: Quantity, DateValue, TableValue, plus a Span that holds one contiguous quotation vary) → Merchandise (part 2.2, one Worth + its proof Spans, e.g. AmountItem(quantity, spans)) → Reply (part 2.3, a record[Item] + the self-assessment and pipeline-feedback fields shared by way of AnswerBase; a registry maps form labels to Reply subclasses) → Programmatic completeness (part 2.4, the one sign computed by the pipeline quite than requested of the mannequin). How this contract is enforced at decoding time (Pydantic v2 schemas + OpenAI’s Responses API consumer.responses.parse(...), plus the fallback hierarchy for suppliers with out constrained decoding) is part 2.5.

A word on vocabulary, used constantly from right here on. Worth = a typed primitive (Quantity, DateValue, Handle, …), one Pydantic class per idea. Form = the string label ("quantity", "date", …) that the query parsing brick emits and that the registry makes use of as a key. Merchandise = one worth + its proof spans (AmountItem(quantity, spans)). Schema (or Reply) = the top-level Pydantic class handed to responses.parse (AmountAnswer, TextAnswer, …); inherits AnswerBase for the shared suggestions fields. Contract = casual identify for what the schema enforces: form, sorts, required fields, so the pipeline can learn end result.reply.objects with out strive/besides.

2.1 Typed values, one schema per `answer_type`

The primary layer of the schema is the worth: the typed primitive the mannequin fills in. The query parsing brick tags each query on two orthogonal axes:

answer_shape (the cardinality): single / itemizing / desk / tree / nested_json
answer_type (the worth sort): quantity, date, iban, textual content, boolean, …

The registry that maps every answer_type to a concrete worth class is inbuilt part 2.3 as soon as AnswerBase is in scope; the form axis decides how these values get wrapped (one worth vs an inventory vs a 2D desk). Right here we simply outline the worth lessons and the quotation atom (Span).

All line numbers are international (throughout the entire doc, not per-page); the conference is enforced within the BASE immediate of Article 8B (immediate meeting).

Every worth sort, briefly:

Quantity(worth, foreign money, unit): ISO 4217 code, non-obligatory unit like "per declare". The schema enforces that the foreign money exists and is a string; downstream sums and conversions work.
DateValue(iso, unique): ISO 8601 plus the unique phrasing as written within the doc. Two fields as a result of the patron needs the parseable type, and the consumer needs to recognise what’s within the supply.
TableValue(headers, rows): a real 2D construction, not pipe-separated textual content. Helpful for premium grids, comparability tables, and any “record me X for every Y” query.
bool for boolean solutions (lined? excluded? compliant?), with caveats carrying any required nuance.
textual content: str for every thing else: definitions, paraphrases, narrative solutions.

Every worth might be wrapped in an Merchandise in part 2.2: AmountItem(quantity: Quantity, spans: record[Span]) and so forth, one Merchandise class per form. That is extra verbose than a single reply: str, however the verbosity is what makes the output programmatically helpful.

A companion subject, extraction_method, lives one layer up on AnswerBase (part 2.3) and says how the reply was obtained. It’s the field-level model of the part 1 level: verbatim and computed are grounded within the passages in entrance of the mannequin, whereas inferred is the mannequin filling a spot from its personal reminiscence, the recall we don’t belief in your paperwork. The 4 values:

"verbatim": the worth is written word-for-word within the passages. The validator (Article 8C, validation) reads this and requires no less than one quote that may be a substring of the cited spans.
"computed": the worth required combining a number of parts from the passages (summing line objects, for instance). Needs to be checked.
"inferred": the worth is derived however not specific. Needs to be reviewed by a human.
"na": no reply.

class Span(BaseModel):
    line_start: int
    line_end: int
    quote: str | None = None

class Quantity(BaseModel):
    worth: float
    foreign money: str
    unit: str | None = None

class DateValue(BaseModel):
    iso: str
    unique: str

class TableValue(BaseModel):
    headers: record[str]
    rows: record[list[str]]

Labored instance: the tackle query. A consumer asks “what’s the tackle?” and the database that consumes the reply needs 4 columns: avenue, postal_code, metropolis, nation. The intuition is to ask 4 questions: “what’s the tackle?”, “what’s the postal code?”, “what’s town?”, “what’s the nation?”. Every name retrieves the identical passage (within the supply, the tackle sits as one block: “350 Fifth Avenue, New York, 10118, USA”), and asks the mannequin to slice off one piece. 4 round-trips for one extraction. 4 probabilities for the mannequin to float. 4 occasions the information crossing the API boundary. 4 occasions the fee.

The developer’s transfer is to declare the typed worth as soon as. Handle(BaseModel) with fields avenue, postal_code, metropolis, nation, registered subsequent to Quantity, DateValue, TableValue in ANSWER_REGISTRY. From then on, a single query returns a populated Handle object that maps on to INSERT INTO addresses(avenue, postal_code, metropolis, nation) VALUES (...). One name, one retrieval (the supply had the tackle in a single block anyway), one row, one place to audit.

The schema is each contract and instruction. The API enforces the 4 fields exist, and the mannequin, seeing the 4 named fields within the response form, is aware of to interrupt the block aside by itself.

That is the amplify the skilled sample on the subject stage. Finish customers preserve asking their pure “what’s the tackle?”; the developer codifies the structuring as soon as and the subsequent thousand solutions move into the SQL pipeline with out re-asking. The identical logic applies to each recurring extraction the corpus accommodates: an individual’s identify into first_name / last_name / middle_initial, a worth into worth / foreign money / unit, a date vary into iso_start / iso_end. Every one is a small Pydantic class the developer provides as soon as.

Labored instance: evaluating quantities. By no means ask the mannequin the comparability. A consumer asks “is the contract premium above a million {dollars}?”. The naive path is to ask precisely that: “sure or no, is the premium > 1,000,000 USD?”, and belief the reply. Don’t. The mannequin has to do three issues in a single shot: find the premium, parse its foreign money, and evaluate. Every step is an opportunity to float, and the binary output erases the worth that produced it: no audit path. Worse, foreign money conversion occurs silently, with no seen alternate fee: a 100,000,000 JPY premium turns into “sure” or “no” relying on regardless of the mannequin believes the JPY/USD fee is at this time.

The fitting transfer: extract first, then evaluate in Python. Ask for an Quantity(worth, foreign money, unit). Apply the conversion explicitly (quantity.worth * RATE[amount.currency]["USD"]). Examine with the brink. Each step is seen, auditable, replayable, and if the conversion fee updates, the reply might be recomputed with out re-calling the mannequin. The rule generalizes: by no means delegate computation, comparability, or aggregation to the LLM when the end result might be derived deterministically from extracted values. The LLM extracts; Python compares.

*Extract first, evaluate in Python: a JPY premium silently swallowed by the LLM – Picture by writer*

What the typed extraction appears like on 4 actual shapes: Handle, Quantity, DateValue, PersonName: with reasonable noise across the goal on the uncooked facet, and the clear Pydantic object on the fitting:

*Uncooked passage to structured object throughout 4 typed shapes – Picture by writer*

2.2 Multi-element solutions and multi-span citations

The minimal schema assumes one reply with one contiguous span of supporting traces. Actual questions break that assumption two methods.

Many questions have an inventory as their reply, not a single worth. “What are the classes underneath the Establish perform?” expects six objects, every with its personal proof. “Which exclusions apply to flood harm?” expects nonetheless many exclusions are written, every pointing to its personal clause. The schema fashions this with objects: record[XItem]. Zero objects means not discovered, one merchandise means a single reply, N objects means an inventory. Every merchandise carries its personal worth AND its personal proof. By no means a single span protecting the entire record.

Even single-element solutions usually have non-contiguous supporting proof. A definition on web page 5 plus an instance on web page 23. A situation plus its exception in a separate paragraph. A price plus its footnote. Forcing a single contiguous vary both over-cites (one huge span swallowing irrelevant traces in between) or under-cites (choosing one of many spans and dropping the others). The schema fashions this with spans: record[Span] per merchandise. A single-span reply is only a record of size one. A multi-span reply has every area cited individually.

Span is the small atom: a contiguous line_start..line_end vary, plus an non-obligatory quote. Two steps, saved separate. What the mannequin returns is pure structured knowledge: line numbers, typed values, flags. It by no means returns the proof textual content. Afterward, the pipeline recovers the remainder from the supply tables: the precise snippet by becoming a member of line_df on line_start..line_end, and the bounding field for the PDF (Article 8C, validation). The quote is the one subject the place the mannequin could echo textual content, and it earns its place solely as a test: the validator confirms it’s a substring of the cited traces, which catches a unsuitable line quantity. The snippet the consumer reads is all the time the recovered one, by no means the mannequin’s.

class TextItem(BaseModel):
    textual content:  str
    spans: record[Span] = Area(default_factory=record)

class AmountItem(BaseModel):
    quantity: Quantity
    spans:  record[Span] = Area(default_factory=record)

class DateItem(BaseModel):
    date:  DateValue
    spans: record[Span] = Area(default_factory=record)

class BooleanItem(BaseModel):
    boolean: bool
    spans:   record[Span] = Area(default_factory=record)

class TableItem(BaseModel):
    desk: TableValue
    spans: record[Span] = Area(default_factory=record)

The identical Merchandise form covers two patterns facet by facet: a multi-element record (one merchandise per ingredient, every with its personal span) and a single merchandise whose proof spans non-contiguous areas of the supply:

*Multi-element record vs multi-span quotation: identical Merchandise form, totally different cardinalities – Picture by writer*

2.3 Self-assessment and pipeline-feedback fields

The suggestions fields are the place the schema begins steering the pipeline. Each shape-specific reply inherits the identical AnswerBase fields, in two teams: what the mannequin thinks of its personal output, and what the pipeline reads to determine its subsequent transfer.

Self-assessment: what the mannequin thinks of its personal output:

confidence: float ∈ [0, 1]: the mannequin’s self-rated certainty. Not a calibrated chance; deal with it as a triage sign. A 0.5 or 0.6 deserves a re-assessment; a 0.9 on a fancy desk query doesn’t imply the reply is right.
caveats: record[str]: natural-language limitations: “The clause makes use of ‘affordable’ with out defining it.” “Two passages give conflicting dates.” For authorized or compliance use, caveats are sometimes extra useful than the reply itself.
extraction_method (already lined in part 2.1): verbatim / computed / inferred / na.

Pipeline-feedback: what the pipeline reads to determine its subsequent transfer:

answer_found: bool and complete_answer_found: bool: two binary indicators, on goal. answer_found=False means we extracted nothing usable (form mismatch on quantity/date, or off-corpus). answer_found=True, complete_answer_found=False means we received one thing nevertheless it’s partial. The mannequin units this when it spots in-passage clues of incompleteness: a numeric expectation contradicted by what’s there (“5 exclusions” however solely three on the web page), a ahead reference (“see Part 7 for extra…”), a sentence that dangles right into a comma on the backside of the web page. The case the mannequin can not detect (a clear ending that’s in reality mid-list) is what part 2.4’s sturdy sign is for. The complete reply requires each flags True. Splitting the sign lets the pipeline take totally different actions: NA path on the primary, broader retrieval on the second.
context_completeness_weak: float ∈ [0, 1]: the mannequin’s view, from contained in the retrieved scope, of whether or not the passages supplied sufficient context. The mannequin judges from dangling punctuation, mid-sentence cutoffs, ahead references the passage itself exhibits. Weak as a result of it could actually solely see what was retrieved; if the truncation is invisible from inside (the web page ends with a clear interval mid-list), this sign misses it. part 2.4 pairs it with a sturdy programmatic sign that appears past the retrieved scope.
context_structured: bool: flags whether or not the passage appeared well-parsed. If the mannequin acquired what appears like a garbled desk (column values jumbled in, headers and rows combined), it units this to False. The pipeline can then route that web page by a special parser (Camelot, Docling, vision-language mannequin) and retry. The mannequin turns into a detector of upstream parsing failures.
llm_discovered_keywords: record[str]: the mannequin’s contribution to the subsequent iteration. Whereas studying the passages, the mannequin usually notices phrases that will have made the unique retrieval higher. “I see this passage makes use of the time period ‘declaration web page’. Was that within the unique question?” These key phrases get logged and might be added to the subsequent retrieval spherical.
keywords_found: record[str]: which of the unique question phrases appeared within the passages. If the consumer requested about “premium” and the passage doesn’t comprise that phrase, the connection between query and reply is solely semantic. Data value surfacing.
conflicting_evidence: bool: flags passages that contradict one another. Frequent in contracts with amendments, in versioned paperwork, in regulatory filings with revisions. The mannequin says “I see two dates they usually don’t agree” quite than choosing one arbitrarily.
suggested_clarification: str | None: what the mannequin presents when the query is just too ambiguous to reply confidently. Connects on to the query parsing brick: when the system ought to ask quite than guess, the mannequin proposes the clarification.

Architectural cut up: RichAnswer (or quite, the household TextAnswer, AmountAnswer, …) is what the LLM produces. The pipeline retains its hint individually on a sibling GenerationResult so it by no means travels by responses.parse. Two causes. Architectural: the hint is stuffed by the dispatcher (Article 8B, immediate meeting), by no means by the mannequin; protecting it out of the LLM-facing schema makes that boundary specific. Mechanical: OpenAI’s structured-output mode requires each object schema to declare additionalProperties: false. A free-form dict[str, Any] subject on the LLM-facing schema makes the request fail. Conserving the hint on GenerationResult sidesteps the constraint by development.

class AnswerBase(BaseModel):
    extraction_method: Literal['verbatim','computed','inferred','na']
    confidence: float
    caveats: record[str] = []
    answer_found: bool
    complete_answer_found: bool
    context_completeness_weak: float
    context_structured: bool
    llm_discovered_keywords: record[str] = []
    keywords_found: record[str] = []
    conflicting_evidence: bool
    suggested_clarification: str | None = None

class TextAnswer(AnswerBase):    objects: record[TextItem]
class AmountAnswer(AnswerBase):  objects: record[AmountItem]
class DateAnswer(AnswerBase):    objects: record[DateItem]
class BooleanAnswer(AnswerBase): objects: record[BooleanItem]
class TableAnswer(AnswerBase):   objects: record[TableItem]
ListAnswer = TextAnswer

ANSWER_REGISTRY = {
    "textual content": TextAnswer,    "quantity": AmountAnswer,
    "date": DateAnswer,    "boolean": BooleanAnswer,
    "desk": TableAnswer,  "record": ListAnswer,
}

The reply payload is just half of what technology returns. The pipeline additionally wants a hint of what was used to provide it (mannequin identify, immediate model, retrieved context), so the reply ships wrapped in a GenerationResult:

@dataclass
class GenerationResult:
    reply: AnswerBase
    meta:   dict[str, Any] = subject(default_factory=dict)

The 4 examples beneath present the identical AnswerBase populated for 4 totally different instances, every a (Query, retrieved Context, generated JSON) triple. Retrieval is assumed right in all 4: the variation comes from what the doc accommodates and the way the mannequin experiences it. The mix of answer_found, complete_answer_found, conflicting_evidence, and caveats is what tells the dispatcher (Article 8C, validation) which path to take subsequent: ship the reply, retry retrieval, fall by to the no-answer path, or return a clarification.

1. Full reply. The consumer asks for the 5 Capabilities of the NIST Cybersecurity Framework (US Authorities work, public area within the US, see NIST copyright assertion), retrieval lands on the passage that lists all 5, and the mannequin returns one merchandise per Perform with its proof span. answer_found=True, complete_answer_found=True, caveats=[], excessive confidence. The dispatcher reads these indicators and ships the reply as-is.

{
  "objects": [
    {"text": "Identify", "spans": [{"line_start": 88, "line_end": 88, "quote": null}]},
    {"textual content": "Shield",  "spans": [{"line_start": 89, "line_end": 89, "quote": null}]},
    {"textual content": "Detect",   "spans": [{"line_start": 90, "line_end": 90, "quote": null}]},
    {"textual content": "Reply",  "spans": [{"line_start": 91, "line_end": 91, "quote": null}]},
    {"textual content": "Get better",  "spans": [{"line_start": 92, "line_end": 92, "quote": null}]}
  ],
  "extraction_method": "verbatim",
  "confidence": 0.95,
  "caveats": [],
  "answer_found": true,
  "complete_answer_found": true,
  "context_completeness_weak": 0.9,
  "context_structured": true,
  "llm_discovered_keywords": [],
  "keywords_found": ["function", "framework"],
  "conflicting_evidence": false,
  "suggested_clarification": null
}

2. Partial reply. The consumer asks for natural-disaster exclusions, the retrieved passage lists earthquake (one match) and explicitly flags “continued in Part 7” on line 236, however Part 7 wasn’t retrieved. The mannequin returns the one merchandise it could actually extract, units complete_answer_found=False, and experiences "Part 7" in llm_discovered_keywords. The dispatcher reads complete_answer_found=False and triggers a broader retrieval spherical utilizing the found key phrases earlier than returning the ultimate reply to the consumer. This state of affairs is the in-passage detection case: the truncation is seen from contained in the retrieved scope due to the “continued in” trace. The more durable case, the place the passage ends cleanly with no such trace, is what part 2.4’s next-page peek catches.

{
  "objects": [
    {"text": "Damage from earthquake or seismic events",
     "spans": [{"line_start": 234, "line_end": 234,
                "quote": "(c) damage from earthquake or seismic events;"}]}
  ],
  "extraction_method": "verbatim",
  "confidence": 0.7,
  "caveats": [
    "Only 1 exclusion found in retrieved passage ; line 236 points to Section 7 (not retrieved)."
  ],
  "answer_found": true,
  "complete_answer_found": false,
  "context_completeness_weak": 0.5,
  "context_structured": true,
  "llm_discovered_keywords": ["Section 7", "additional exclusions"],
  "keywords_found": ["exclusion"],
  "conflicting_evidence": false,
  "suggested_clarification": null
}

3. No reply. The consumer asks in regards to the cancellation interval, however retrieval pulled the premium-schedule passage, which doesn’t point out cancellation. The mannequin actually returns objects=[], answer_found=False, extraction_method="na", and a caveat naming what the passage did cowl versus what’s lacking. The dispatcher takes the no-answer path: both inform the consumer “not discovered on this doc”, or rephrase the question and retry as soon as earlier than giving up.

{
  "objects": [],
  "extraction_method": "na",
  "confidence": 0.0,
  "caveats": [
    "Retrieved passage covers premium, deductible and fees, not the cancellation period."
  ],
  "answer_found": false,
  "complete_answer_found": false,
  "context_completeness_weak": 0.2,
  "context_structured": true,
  "llm_discovered_keywords": [],
  "keywords_found": [],
  "conflicting_evidence": false,
  "suggested_clarification": null
}

4. Conflicting proof. The consumer asks for the efficient date, retrieval brings again each the unique date (line 56) and a later modification (line 178), with totally different values. The mannequin returns each objects quite than choosing one, units conflicting_evidence=True, names the battle in caveats, and proposes a suggested_clarification. The dispatcher reads conflicting_evidence=True and exhibits the battle to the consumer as a substitute of guessing.

{
  "objects": [
    {"text": "2024-03-15",
     "spans": [{"line_start": 56, "line_end": 56,
                "quote": "Effective: 15 March 2024 (original)"}]},
    {"textual content": "2024-04-01",
     "spans": [{"line_start": 178, "line_end": 178,
                "quote": "Effective date: 1 April 2024 (amended)"}]}
  ],
  "extraction_method": "verbatim",
  "confidence": 0.5,
  "caveats": [
    "Two effective dates found: 15 March 2024 (original) and 1 April 2024 (amendment)."
  ],
  "answer_found": true,
  "complete_answer_found": true,
  "context_completeness_weak": 0.85,
  "context_structured": true,
  "llm_discovered_keywords": ["amendment"],
  "keywords_found": ["effective", "date"],
  "conflicting_evidence": true,
  "suggested_clarification": "Authentic date (2024-03-15) or amended (2024-04-01)?"
}

2.4 The complement: programmatic completeness (the sturdy sign)

One completeness sign is just too necessary to belief the mannequin with, so the pipeline computes it. There’s intentionally no context_completeness_strong subject: it’s set by the pipeline, primarily based on what retrieval was instructed to incorporate within the context.

Think about you ask the mannequin for the record of exclusions in a coverage. Retrieval anchors on the “Exclusions” part by way of the TOC and arms the mannequin web page 5: objects (a) by (e), the final one ending with a clear interval. The mannequin reads web page 5, emits 5 objects, units complete_answer_found=True and context_completeness_weak excessive. From contained in the web page, the record reads as full.

However neither the mannequin, nor a human studying web page 5 alone, can inform whether or not objects (f), (g), (h) sit on web page 6. A clear interval on the backside of a web page proves a sentence is completed, not {that a} record is. The one method to know is to have a look at the subsequent web page. And the subsequent web page is one in every of two issues:

A brand new part heading (one thing like “Part 5: Protection Limits”). The earlier record was bounded by the part change. Web page 5 was full.
A continuation of the earlier part (objects (f), (g), (h) the place you anticipated the heading). The record was truncated. Web page 5 appeared full solely as a result of the web page break occurred to land on a clear sentence.

The lure is that the LLM by no means sees web page 6. It judges from no matter it receives, and a page-five-only context all the time reads as full when the lower is clear, even when the doc continues. So context_completeness_weak can not catch this class of failure, irrespective of how nicely the mannequin introspects.

The repair is a retrieval alternative, not a technology one. Retrieval is greatest handled as a parametric module with a number of knobs:

begin web page, finish web page
line-level choice (generally the reply is 2 particular traces and also you need nothing extra)
and (related right here) an non-obligatory one-page overlap past the part’s final identified web page

Whether or not a question asks retrieval for “simply the matching part” or “the part plus one overlap web page” is ready per query form. The overlap web page by no means goes to the LLM; it stays with the pipeline as proof for the post-generation completeness test. With the overlap, the pipeline will get a deterministic verdict: a brand new heading on the prime → bounded; continuation content material → truncated.

The analogy is chunk overlap: chunk overlap ensures no truth is sliced in half between two chunks; the page-overlap retrieval parameter ensures no record is sliced in half between two retrieval scopes. Both method, security is purchased by intentionally pulling barely greater than appears strictly obligatory.

When is the tail value the fee? It activates how good the TOC is.

If the TOC is ideal to the road: When parsing’s toc_df precisely marks each part’s begin and finish, retrieval can pull precisely the related part with out a tail. You save tokens. The sturdy sign turns into non-obligatory insurance coverage.

If the TOC is imperfect (the everyday case: the doc has no TOC, the parser missed a heading due to an uncommon font, the part ran barely longer than the TOC recommended), the one-page tail is the security internet. The price is one further web page per question (~500-1000 tokens for a typical PDF). The profit is deterministic detection of truncated solutions: a category of failure neither the mannequin nor an skilled can catch from the retrieved context alone.

What this wants from parsing and retrieval. Each upstream modules contribute. Parsing exposes section_end_page in toc_df by way of a easy conference: TOCs nearly by no means spell out the place sections finish, however the subsequent part’s start_page is the implicit finish + 1. With that column, retrieval has a one-lookup reply to “how far does this part go?”. Retrieval then decides, per query form, whether or not to tug [start_page, section_end_page] precisely or so as to add the one-page tail. Technology solely consumes the ensuing context_completeness_strong subject: it doesn’t determine the retrieval form, it reads the sign and reacts (ship the reply, or set off a refetch).

The determine beneath exhibits the truncation case in motion. The amber panel is what the LLM noticed: web page 5 solely, with the 5 objects ending on a clear interval. From the mannequin’s seat, the record reads as full and the JSON it returns says so (complete_answer_found=true). The blue panel is what the pipeline pulled individually as proof for the post-gen test: web page 6’s first traces, which start with merchandise (f) as a substitute of a brand new part heading. The mannequin by no means noticed the blue panel; the pipeline did, units context_completeness_strong=false, and triggers a refetch with the broader scope.

*Pipeline’s page-6 peek catches a truncation the LLM missed; refetch triggered – Picture by writer*

The bounded case is the mirror picture. Identical query, identical page-5 content material, however the blue panel begins with “Part 5. Protection Limits” as a substitute of merchandise (f). The pipeline marks context_completeness_strong=True and ships the reply: confirmed bounded this time, the mannequin’s declare now backed by the pipeline test.

A secondary test: per-span boundary cleanness. This one is a helper, not the headline. For every Span in every reply merchandise, the pipeline can ask: does the span begin at first of a paragraph or mid-sentence? Does it finish at clear terminal punctuation? Are the traces contiguous, or is there a spot? These per-span checks catch a special failure mode (a single span that lower the supporting proof in half), they usually don’t require a peek. Helpful as a per-item triage device; not a substitute for the next-page peek when the query is “is the reply set full?”.

2.5 How the contract is enforced

Constrained decoding is what makes the contract actual: with responses.parse, the mannequin can not return an output that fails to parse. Beneath it sits a hierarchy of weaker fallbacks, ordered by reliability:

Pydantic + responses.parse (or equal native structured output API). The API enforces the schema at decoding time: the mannequin can not return an output that fails to parse. Most dependable. That is what the remainder of this text makes use of.
JSON Schema with structured output mode. Identical concept, JSON-native. Used when Pydantic isn’t out there or when focusing on a non-Python client.
JSON Schema within the immediate with “return legitimate JSON”. No decoding-level enforcement. The mannequin normally complies, however it’s important to validate after the very fact. Use as a fallback when the supplier doesn’t expose a structured-output API.
Simply “return JSON” as a obscure instruction. Keep away from. The mannequin will principally comply, however it’ll sometimes wrap the JSON in ```json blocks, prepend “Right here’s the reply:”, or embody trailing commas. Every of those breaks downstream customers.

JSON and Pydantic are interchangeable in idea: Pydantic is only a Python-friendly method to declare a JSON schema with validation. Both is meaningfully stronger than asking “return JSON” in a immediate and hoping.

Within the wild: open supply fashions and JSON. Reliability of structured output varies rather a lot throughout fashions. OpenAI’s structured output mode and Anthropic’s device use are extremely dependable. Amongst open supply fashions, Phi-4, Mistral-Nemo, and Llama-3.3 with grammar-constrained decoding (vLLM grammars or llama.cpp GBNF) work nicely. “Pondering” fashions with specific reasoning (DeepSeek-R1 type, sure Qwen modes) are much less dependable for JSON: the reasoning hint pollutes the output and the mannequin struggles to modify again to wash JSON on the finish. For structured output workloads, desire non-thinking fashions or specific modes that disable reasoning. High quality of JSON output and uncooked mannequin dimension are not correlated: a smaller mannequin with grammar constraints usually outperforms a bigger one with out.

3. Conclusion

The contract is asserted: typed values so downstream code by no means re-parses a string, objects that bind every worth to its proof spans, the self-assessment and pipeline-feedback fields, and the one completeness sign the pipeline computes itself. Declaring it’s half the work. Article 8B (immediate meeting) builds the decision that fills the contract: the schema picked from the registry, the system immediate composed from a set BASE plus fragments, the total hint saved for audit. Article 8C (validation) checks what comes again and decides the pipeline’s subsequent transfer.

4. Sources and additional studying

The contract rests on constrained decoding: the article makes use of OpenAI’s Structured Outputs (responses.parse, Aug 2024) and the mechanism described in Willard and Louf (Outlines, 2023). The reflection-token concept from Asai et al. (Self-RAG, ICLR 2024) is the printed concept behind the pipeline-feedback fields of part 2.3. The citation-bearing reply schema (AnswerWithEvidence) is in the identical household as Bohnet et al. (Attributed Query Answering, 2022). The vocabulary the literature makes use of is constrained decoding and structured technology; managed execution is that this brick’s coinage: the LLM name sits inside an engineered swap, not in entrance of an agent loop.

Identical route because the article:

OpenAI, Structured Outputs. The official doc + Aug 2024 launch publish for “100% schema adherence”. The entire method is dependent upon this being dependable.
Willard & Louf, Environment friendly Guided Technology for Massive Language Fashions (Outlines), 2023 (arXiv:2307.09702). The constrained-decoding paper; the open-source equal of OpenAI’s structured outputs and the mechanism that makes “schema is the contract” true.
Asai et al., Self-RAG: Studying to Retrieve, Generate, and Critique by Self-Reflection, ICLR 2024 (arXiv:2310.11511). Reflection tokens are the printed concept behind the pipeline-feedback fields inbuilt part 2.3.
Bohnet et al., Attributed Query Answering: Analysis and Modeling for Attributed Massive Language Fashions, 2022 (arXiv:2212.08037). Quotation-bearing reply schemas earlier than the constrained-decoding wave; the printed concept behind AnswerWithEvidence.

Earlier within the sequence:

What works, what breaks

Baseline Enterprise RAG, from PDF to highlighted reply. The four-brick pipeline finish to finish: PDF in, highlighted reply out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. The place embedding similarity wins (synonyms, typos, paraphrase), the place it predictably breaks (unknown phrases, negation, term-vs-answer relevance), and how one can use it anyway.
RAG shouldn’t be machine studying, and the ML toolkit solves the unsuitable drawback. Why chunk-size sweeps and finetuning optimize the unsuitable factor; route by query sort as a substitute.
From regex to imaginative and prescient fashions: which RAG method suits which drawback. Two axes, doc complexity and query management, that decide the method for every case.

Doc parsing

Past extract_text: the 2 layers of a PDF that drive RAG high quality. The primary half of the parsing brick: the doc’s nature, indicators, and abstract.
Cease returning flat textual content from a PDF: the relational tables RAG wants. The second half of the parsing brick: the relational tables each downstream brick reads.
- When PyMuPDF can’t see the desk: parse PDFs for RAG with Azure Format. The identical tables from Azure Format: native desk cells, OCR, paragraph roles.
- Parse PDFs for RAG domestically with Docling: wealthy tables, no cloud add. The identical tables computed domestically with Docling: TableFormer cells, nothing leaves the machine.
- Imaginative and prescient LLMs are PDF parsers too: studying charts and diagrams for RAG. Imaginative and prescient as a parser: the photographs turn out to be searchable textual content.
- Parse scanned PDFs for RAG with EasyOCR: free OCR provides you phrases, not a doc. The place conventional OCR stops: textual content recovered, construction misplaced.
- Making a PDF’s pictures searchable for RAG, with out paying to learn all of them. The picture cascade: filter low-cost, classify, describe solely what’s value studying.
- Reconstructing the desk of contents a PDF forgot to ship, so RAG can scope by part. Rebuilding toc_df when the PDF prints a contents web page however ships no define.

Query parsing

RAG questions want parsing too: flip the consumer’s string into briefs for retrieval and technology. The thesis of query parsing: why a consumer string wants the identical parsing as a doc, and the way it splits right into a retrieval temporary and a technology temporary.
What the query parser extracts from a consumer string: key phrases, scope, form, decomposition, clarification. The 5 households of columns the parser reads straight from the consumer’s query, with the code that fills every one.
Dispatching the parsed RAG query: chunk technique, mannequin tier, activations, audit. The selections the parser makes on prime of the consumer string, utilizing the doc’s profile: dispatch, activations, full schema, the audit path (pipeline_trace.json), and a broker-corpus walkthrough.

Retrieval