Cease Returning Flat Textual content from a PDF: The Relational Form RAG Wants

brick of Enterprise Doc Intelligence, a sequence that builds an enterprise RAG system from 4 bricks: parsing, query parsing, retrieval, and era. Parsing comes first, and that is the second of its two elements. The earlier half turned a PDF into line_df, one row per line of textual content on the web page. This one covers the remainder of the mannequin: the total set of tables a parser ought to emit, what each holds, and the way they hyperlink collectively, so the desk on web page 14 retains its columns and the renewal price stays connected to its label. The opposite three bricks, and the highlighted reply on the finish, all learn these tables, by no means the uncooked PDF.

*the place this text sits within the sequence: Article 5, the data-model half of the parsing brick, inside Half II (the 4 bricks) – Picture by writer*

RAG tutorials begin the identical approach: textual content = extract_text(pdf). That single line is the place the PDF issues start.

You construct a RAG pipeline. It really works on a number of clear paperwork. Then a buyer sends you an actual contract: 30 pages, with a Schedule of Costs desk on web page 14. The consumer asks “what’s the renewal price?” and the mannequin returns the fallacious quantity.

The crew says: “the mannequin can’t learn tables.”

The mannequin reads tables fantastic. The issue is upstream. Your parser walked the desk cell by cell and joined them into one lengthy string. The column construction is gone. The hyperlink between a label and its quantity is gone. Your mannequin is requested to guess which quantity is the renewal price. Typically it guesses proper. Usually it doesn’t.

*The identical 4 rows, joined cell-by-cell into one chunk. EUR 200 one-time, Late fee 75: who pairs with what? – Picture by writer*

The parser didn’t fail. It gave you what you requested for. You requested for the fallacious factor.

An excellent PDF parser doesn’t extract textual content. It fashions the doc as a relational set of tables. One PDF in, one desk per form of factor out (seven or eight immediately, and extra as new wants present up).

toc_df: the sections, just like the writer wrote them.
page_df and line_df: the physique. Each web page. Each line.
image_df: each determine on each web page.
span_df: daring, italic, colour, font measurement. Each span of each line.
object_registry: each determine caption, each desk caption, each annex.
cross_ref_df: each “see Determine 2”, each “see Desk 4”, each “see Annex B”.
parsing_summary: tells you if the PDF is born-digital, scanned, or combined. Tells you if the OCR is sweet or unhealthy.

Retrieval reads these tables. Technology reads these tables. Highlighting reads these tables. You open the PDF as soon as. After that, you solely work with tables.

This text covers every desk intimately, then runs parse_pdf aspect by aspect on two very completely different PDFs to point out that the identical columns cowl each. The earlier article (“Past extract_text: the 2 layers of a PDF that drive RAG high quality”) covers the upstream aspect: the declared indicators the parser reads first and the page-level classification it runs earlier than any line will get a quantity.

how every desk is produced: line_df, parsing_summary, toc_df and image_df come straight from the parse; page_df, span_df, object_registry and cross_ref_df are derived from line_df – Picture by writer

1. One desk per entity

All the things we’ve extracted will get returned as a dictionary of tables plus a parsing abstract, one desk per entity of the doc mannequin.

The _df naming conference makes granularity readable from the title itself. The diagram on the high of this text reveals how every desk is produced. 4 come straight from the parse: line_df (the textual content strains), parsing_summary (the doc-level synthesis), toc_df (the native define, through doc.get_toc), and image_df (through web page.get_image_info). The opposite 4 are derived from line_df: page_df aggregates it by web page, whereas span_df, object_registry, and cross_ref_df are extracted from its strains. How the tables then be a part of each other is a separate query, taken up in part 2.

1.1. toc_df: desk of contents

TOCs are in every single place in enterprise paperwork. Contracts, reviews, insurance policies, worker manuals, regulatory filings: nearly all of them ship with a declared part construction, and that construction is the most cost effective semantic sign you may hand a retriever.

The catch: it isn’t at all times native. Typically it’s solely typographic (daring headings, numbered sections, indented subheadings) and must be reconstructed from line_df + span_df.

We focus right here on the native case (the frequent one for born-digital LaTeX, Phrase, and InDesign exports); reconstructing a TOC from typography when bookmarks are absent is its personal matter, sketched by an adaptive parser and handled in full in a devoted follow-up.

*declared define with `parent_idx` and `breadcrumb`; empty when no native bookmarks – Picture by writer*

The way to construct it: build_toc_df(doc) calls doc.get_toc(easy=False) (one entry per bookmark, with the vacation spot dict connected) and walks the end result to compute parent_idx, breadcrumb, end_page, and start_y. Run on the Consideration paper, you get the 22 entries already proven in part 1.2 above: three ranges of headings, native bookmarks, no reconstruction wanted.

The implicit end_page conference: TOCs mark the place sections start, nearly by no means the place they finish. build_toc_df materializes the top as a column anyway: for every row, end_page is the start_page of the subsequent entry on the identical degree or shallower (the subsequent peer or ancestor), with total_pages because the fallback for the final part. Take a look at Conclusion on the Consideration paper: start_page=10, end_page=15. The doc solely has 15 pages, so the final part absorbs the whole lot to the doc’s finish. The conference retains a one-page overlap by design (a bit’s end_page is its successor’s start_page, not successor.start_page - 1), which makes the era brick’s next-page peek (a robust completeness sign that catches truncated lists at part boundaries) a single lookup moderately than a runtime scan.

The start_y column, for information: Every bookmark in a PDF define carries a vacation spot Level(x, y) on its goal web page, not only a web page quantity. build_toc_df exposes the y as start_y (uncooked worth as returned by fitz). It pins every part header to a exact place inside start_page, which is what allows line-level decision: the identical (target_page, target_y) → line be a part of used for native hyperlinks in part 1.6. Identical coordinate-orientation caveat: 720 on the Consideration paper (LaTeX, bottom-up) and 72 on NIST CSF (Acrobat, top-down) each level on the high of the web page, simply from reverse origins. We retailer the uncooked worth; callers normalize when they should land on a particular line.

start_page and end_page are page-level anchors. Line-level anchors (start_line, end_line) are the pure refinement: they let downstream levels pinpoint a bit to the precise line in line_df, and so they allow TOC offset detection when the doc has entrance matter inserted after the TOC was generated (the whole TOC drifts by 1 or 2 pages, a real-world failure mode). The total remedy lives in a devoted bonus article on TOC anchoring and validation; for now, toc_df stops at page-level granularity (with start_y because the bonus column for callers able to resolve to a line).

The function: toc_df is the most cost effective semantic sign in the whole pipeline. Every entry names a bit: figuring out that strains 100–150 belong to “3.5 Positional Encoding” tells the retriever and the LLM what these strains are about, earlier than any embedding is computed. Embeddings provide you with topical proximity; the TOC offers you the doc’s personal structural which means of every area, declared by the writer, not inferred. The breadcrumb extends this with hierarchical context: a bit will get stamped with “Strategies > 3.5 Positional Encoding”, giving the language mannequin section-level grounding with out inflating the chunk textual content. end_page is what lets the era brick peek one web page previous a retrieved part and detect truncated solutions and not using a imaginative and prescient go. When the doc has a local TOC, all of that is free.

Be careful: TOC entries can level to pages that don’t exist (a corrupt or truncated export). Validate 0 <= page_num < n_pages earlier than recording a row, or a bit anchor lands nowhere and the page-range be a part of from part 2 silently returns empty.

1.2. line_df: line granularity

The supply of reality for textual content content material. Each line of the PDF, with its place and dominant typographic fashion.

*one row per textual content line with bbox, typography, render mode, `column_position` – Picture by writer*

The way to construct it: fitz_pdf_to_line_df(pdf_path) walks each textual content block of each web page and emits one row per line. assign_column_positions(line_df) then annotates every row with single / left / proper / multi. Run on knowledge/paper/1706.03762v7.pdf, the Consideration Is All You Want paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv summary web page). Right here is web page 4 of the paper (the two-column Determine 2 area):

*rows 1-2 are Determine 2’s twin captions in reverse columns – Picture by writer*

The function: line_df is the unified per-element manifest of the doc. Textual content strains first, however the identical row construction additionally carries picture placeholders and desk placeholders: every seen content material factor on a web page is one row, with its personal bbox, column_position, and a content_type flag (textual content, picture, desk). Textual content-specific fields (font, render_mode) are NaN for non-text rows; the wealthy picture and desk metadata lives in image_df and the desk extractor’s output, joined again through (page_num, line_num). The result’s {that a} single sorted question towards line_df.page_num returns each factor on a web page in studying order, no matter its sort. Downstream levels don’t have to affix three tables to know what’s on this web page.

Be careful: on multi-GB or thousand-page PDFs, holding each line (and picture) in reminiscence without delay is an issue. A light-weight mode that skips line_df and image_df for endpoints needing solely parsing_summary (classification, the doc-level abstract) retains these low-cost; gate the total parse at ingestion time for the remainder.

The screenshot under is from Enterprise Doc Intelligence, the desktop app I’m constructing. The Textual content panel on the precise is line_df made seen: the web page’s native textual content, line by line, parsed as soon as and browse straight from the desk, subsequent to the unique web page it got here from.

*line_df made seen: the web page’s native textual content learn straight from the desk, beside the unique web page – Picture by writer*

1.3. page_df: web page granularity

Per-page synthesis. Classification, flags, aggregated metrics.

*per-page synthesis: `page_type`, additive flags, char counts, `n_columns` – Picture by writer*

The way to construct it: build_page_df(line_df) teams line_df by page_num. detect_columns_per_page(line_df) computes n_columns and the result’s merged in.

What else matches right here: build_page_df is the precise residence for any per-page sign you may combination from line_df on the identical go. Past the core triplet, easy aggregations land right here at no cost: n_lines (web page density), native_chars versus ocr_chars (a quick scanned-or-native verdict, no classifier wanted), n_fonts and font-size unfold (a tough construction indicator that separates heading-heavy pages from plain prose), image_coverage_ratio (a be a part of with image_df). The columns that want a downstream go wait: page_type is produced by classify_page (lined within the earlier article) and parsing_method / context_structured are produced by an adaptive cascade that escalates to a heavier parser when fitz will not be sufficient.

Run on the Consideration paper:

*low-cost aggregations subsequent to the core triplet on the Consideration paper – Picture by writer*

The function: page_df is the place extraction is anchored. Each parser, each OCR run, each classifier operates web page by web page; page_df is the desk that information what every web page is and the way it ought to be dealt with. The web page can be a great semantic unit by itself: roughly one or two concepts per web page in educational papers, one clause per web page in contracts, one sub-topic per web page in technical reviews. Sufficiently small to be targeted, giant sufficient to hold context. That’s why retrieval usually defaults to page-level chunks in a minimal RAG pipeline and why most downstream coordination keys off page_num. Whenever you question “what’s web page 5 about”, page_df is the row that solutions; while you question “all scanned pages with unhealthy OCR”, page_df is what you filter.

Be careful: retailer page_width and page_height per row, by no means as soon as per doc. Letter and A4 combine in technical publishing, and a panorama web page is commonly inserted for a large desk; a single document-level web page measurement makes each bbox-derived metric (column detection, full-page-image protection) drift on the odd-sized pages.

1.4. image_df: picture granularity

One row per embedded picture.

The way to construct it: The parser walks each web page and calls web page.get_image_info(), which returns every embedded picture with its displayed bounding field and intrinsic dimensions. The Consideration paper has three:

*3 photos: web page 3 Determine 1, web page 4 Determine 2’s two panels – Picture by writer*

Describing the picture content material: Thus far image_df solely locates every picture: a bounding field, a measurement, a content material hash. It says nothing about what the picture reveals, and a bounding field will not be retrievable. A chart or a diagram holds no extractable textual content, so OCR and layout-based parsers go away that half empty: to them the area is invisible. To make the determine searchable we run a imaginative and prescient LLM over every extracted picture and retailer a brief description alongside the row, for instance “a line chart of commodity costs since 2022” or “the Transformer structure, an encoder of N stacked layers”. That description is textual content, so retrieval can match it. A companion piece on vision-LLM enrichment walks this step in full.

*Every extracted picture will get a one-sentence description, which is textual content retrieval can match – Picture by writer*

1.5. object_registry: cross-reference TARGETS

A cross-reference has two sides. The goal is the place a named object lives within the doc: the road “Determine 2: The Transformer mannequin structure” on web page 3, the road “Desk 1: BLEU scores” on web page 8. The supply is a body-text point out pointing on the goal: “as proven in Determine 2”, “see Desk 1”. object_registry captures the goal aspect, one row per caption. The subsequent subsection (part 1.6) captures the supply aspect. Resolving sources to focus on pages, so a retrieved chunk that mentions “see Desk 1” additionally pulls the web page the place Desk 1 lives, is a follow-up cross-reference go that consumes each tables.

*captions for named objects, one row per goal, `(object_type, object_id)` is the be a part of key – Picture by writer*

The way to construct it: Detection makes use of regex patterns ANCHORED firstly of a line (an actual caption begins there, a body-text point out doesn’t); build_object_registry walks line_df, matches every line towards the patterns, and retains the primary hit for each (object_type, object_id) pair. On the Consideration paper:

OBJECT_PATTERNS = [
    (re.compile(r"^s*(?:Figure|Fig.?)s+(d+)b", re.IGNORECASE), "figure"),
    (re.compile(r"^s*Tables+(d+)b",             re.IGNORECASE), "table"),
    (re.compile(r"^s*(?:Annex|Appendix)s+([A-Z0-9]+)b", re.IGNORECASE), "annex"),
]

def build_object_registry(line_df: pd.DataFrame) -> pd.DataFrame:
    """Returns one row per (object_type, object_id), first match wins."""

Run on the Consideration paper, the builder lands one row per named object, with the caption line because the anchor:

*5 figures and 4 tables on the Consideration paper, every with its caption anchor – Picture by writer*

1.6. cross_ref_df: cross-reference SOURCES

The symmetric half of object_registry. Every row is one body-text point out of a named object: “as proven in Determine 2” on web page 4, “consult with Desk 1” on web page 7, “see Annex B for particulars” on web page 12. Each such point out is a supply that, when resolved, jumps to a web page recorded in object_registry.

Identical sample because the TOC, two strategies can produce these rows: native PDF hyperlinks (the deterministic supply, when the doc carries them) and text-pattern matching on line_df (the overall fallback, what build_cross_ref_df ships). Technique 1 is precise however partial. Technique 2 is approximate however full.

Technique 1, native PDF hyperlinks: A PDF can carry its personal clickable cross-references. fitz.Web page.get_links() returns one entry per hyperlink rectangle, with the goal encoded as a (target_page, to.x, to.y) triple for an inside bounce or a URI for an exterior one:

import fitz
doc = fitz.open("knowledge/nist/NIST.CSWP.29.pdf")
for web page in doc:
    for ln in web page.get_links():
        tgt_page = ln.get("web page")
        tgt_pt   = ln.get("to")        # Level(x, y) on the goal web page
        print(web page.quantity + 1, ln.get("sort"), tgt_page, tgt_pt, ln.get("uri"))

The attention-grabbing bit is to.y. Understanding solely the goal web page tells you the place on the doc the hyperlink lands however not what it factors at; the y coordinate pins the road inside that web page. We cut up the vacation spot into two scalar columns, tgt_page and tgt_y, and resolve the goal line by discovering the row in line_df whose y0 is closest to tgt_y on tgt_page.

Two sensible caveats right here:

PDF turbines differ on y orientation. LaTeX returns bottom-up, Acrobat returns top-down. The normalizer tries each and retains the nearer match.
tgt_y might sit between two strains. We spherical to the closest one.

The payoff: as soon as we all know the touchdown line, we will be a part of (target_page, landing_text) towards toc_df and get better the part index instantly. No regex, no textual content matching towards breadcrumbs. The native hyperlink tells us precisely which toc_idx we landed in.

*4 NIST TOC hyperlinks pointing at part begins, joinable to `toc_df` – Picture by writer*

The identical pipeline on the Consideration paper turns up a unique form of hyperlink: citations that resolve to bibliography entries moderately than TOC part begins.

*3 Consideration-paper citations resolved to bibliography strains through `landing_text` – Picture by writer*

Protection is the catch. The 2 demo PDFs present the identical sample:

Consideration paper: 95 inside hyperlinks, all citations leaping to bibliography entries, plus 18 exterior URIs (github, arxiv). Zero native hyperlinks for body-text mentions like “as proven in Determine 2”.

NIST Cybersecurity Framework 2.0 (CSWP-29; US Authorities work, public area within the US, see NIST copyright assertion): 47 inside hyperlinks, all TOC entries and the record of figures pointing at part begins, plus 56 exterior URIs. Identical story: no body-text determine or desk mentions are linked.

Enterprise paperwork are often worse, with no native hyperlinks in any respect (scans, screenshots, exports from instruments that drop hyperlink metadata). So native hyperlinks are glorious sign when current (deterministic, resolvable to a toc_idx when the goal is a bit header) however by no means cowl the total set of cross-references an article carries.

Technique 2, text-pattern matching: Detection makes use of the identical vocabulary as OBJECT_PATTERNS, however UNANCHORED so the regex matches anyplace inside a line; caption strains are excluded so the road that DEFINES Determine 2 isn’t additionally counted as a point out of it.

*one row per body-text point out, joinable again to `object_registry` – Picture by writer*

On the Consideration paper:

REFERENCE_PATTERNS = [
    (re.compile(r"b(?:Figure|Fig.?)s+(d+)b", re.IGNORECASE), "figure"),
    (re.compile(r"bTables+(d+)b",             re.IGNORECASE), "table"),
    (re.compile(r"b(?:Annex|Appendix)s+([A-Z0-9]+)b", re.IGNORECASE), "annex"),
]

def build_cross_ref_df(line_df: pd.DataFrame) -> pd.DataFrame:
    """One row per body-text point out, with ~30 chars of context."""

Run on the Consideration paper, each body-text point out of a determine or desk lands as a row, joinable again to object_registry:

*6 of 13 mentions; Determine 2 seems thrice throughout pages 4-5 – Picture by writer*

Run on the demo PDFs, the Consideration paper has 13 body-text mentions overlaying 6 distinctive objects (Determine 1, Determine 2, Desk 1–4): some figures are referenced a number of instances, which is precisely what the source-side desk is supposed to seize.

NIST CSF 2.0 has 13 mentions (7 determine references, 5 annex references, 1 desk reference) overlaying 10 distinctive objects (5 figures, 4 annexes, 1 desk). The mismatch with NIST’s object_registry (6 figures + 3 annexes + 2 tables) is informative:

one annex is talked about within the physique with out an anchored caption within the doc (the regex catches a reference whose goal lives exterior the parsed textual content)
one registered determine and one registered desk are by no means referenced

Each are real-world indicators price surfacing to a downstream cross-reference resolver.

1.7. span_df: sub-line granularity (elective)

The road is typically too coarse. A line can combine daring and non-bold textual content (an outlined time period in a contract). A line in a analysis paper can embrace an inline equation in italic alongside prose. A line in an modification can have the unique textual content in black and the modification in purple.

class Span(BaseModel):
    # Identification & ordering
    pdf_hash: str
    page_num: int
    line_num: int
    span_id:  int

    # What it says, the place it sits
    textual content: str
    bbox: tuple[float, float, float, float]

    # Typography indicators
    font_name: str
    font_size: float
    is_bold:   bool
    is_italic: bool
    color_rgb: tuple[int, int, int]

A span_df is extra granular than line_df. On the Consideration paper the ratio is 3,480 spans for 1,048 strains, about 3.3× heavier. The price solely pays off for levels that examine typography:

Heading detection: A line in a bigger font, presumably daring, might be a heading. A TOC reconstruction go makes use of this when native bookmarks are absent.
Itemizing detection: A daring span beginning a paragraph is commonly the marker of an enumeration merchandise.
Outlined phrases in contracts: Daring or italicized phrases in authorized paperwork are sometimes outlined elsewhere; capturing them at parse time allows glossary linking later.

The way to construct it: Default behaviour: parse_pdf(...) returns span_df empty. The downstream levels that want it name a devoted builder on the identical line:


paper = parse_pdf(paper_pdf)
paper["span_df"] = build_span_df(paper_pdf)   # 3,480 rows on the Consideration paper

Conserving the spans behind an express name avoids paying their value on each parse for levels that solely want line_df. Run on the Consideration paper:

*rows 1-5 physique textual content, rows 6-7 the part header in daring; `is_bold` keys the TOC reconstructor – Picture by writer*

1.8. parsing_summary: technical synthesis

A single JSON-serializable dictionary per doc. It solutions at a look: “is that this PDF scanned?”, “does it want OCR?”, “what extraction technique ought to the subsequent stage use?” And yet another, the semantic one downstream bricks learn: “what sort of doc is that this and what’s it about?”

The dict is organised in 5 zones. The primary 4 are deterministic, constructed by the parser with out an LLM name. The fifth, semantic, carries the doc sort plus a brief LLM-written abstract that the query parser injects into its system immediate.

{
  "pdf_hash": "abc123...",
  "n_pages": 87,
  "pdf_version": "1.7",
  "source_software": "word_export",
  "creator_raw": "Microsoft Phrase 2019",
  "producer_raw": "Microsoft Phrase for Microsoft 365",
  "content_type": "scanned_with_ocr",
  "is_scanned": true,
  "has_text_layer": true,
  "ocr_quality": "good",
  "page_type_counts": {"scanned_ocr_good": 80, "native": 5, "empty": 2},
  "scanned_page_ratio": 0.92,
  "has_toc": true,
  "n_toc_entries": 24,
  "n_named_objects": 11,
  "is_encrypted": false,
  "has_form_fields": false,
  "recommended_strategy": "use_existing_ocr",
  "needs_reocr": false,
  "pages_needing_ocr": [],
  "doc_type": "annual_report",
  "typical_fields": ["fiscal_year", "revenue", "net_income", "auditor"],
  "abstract": "87-page annual report for fiscal yr 2023. Covers income, web revenue, and auditor's notes throughout working segments. Customary sections: Letter to Shareholders, MD&A, Monetary Statements, Notes."
}

The excellence between source_software (from metadata) and content_type (inferred from content material) issues. The 2 can diverge: a PDF whose Producer is “Microsoft Phrase” however whose content material is 100% scanned means any person pasted photos right into a Phrase doc and exported. That’s helpful data; don’t overwrite one with the opposite.

The semantic zone follows the identical rule on a unique axis. doc_type is a rough household (resume, contract, academic_paper, bill, memo, annual_report, …) derived from filename + first-page textual content. Deterministic, no LLM. typical_fields is the per-doc_type desk of subject names a query about this sort of doc is almost certainly to focus on; a resume will get [name, email, phone, experience, …], a contract will get [policyholder, premium, deductible, …]. abstract is the one LLM-derived worth within the dict: three to 4 factual sentences naming the doc sort, the principle topic, and the fields it carries. One LLM name at parsing time, cached without end, injected into the query parser’s system immediate so “what’s the title?” on a CV now not returns not discovered. The companion article on what to learn earlier than any line will get a quantity (“Past extract_text”) walks the total design of that abstract.

2. The relational mannequin: how the tables hyperlink

Producing the tables is one factor; linking them is one other. As soon as the tables exist, the keys they share flip eight separate DataFrames into one queryable mannequin, and nearly each hyperlink resolves again to line_df, the per-line supply of reality.

*how the tables be a part of: line_df on the centre, every desk linked by its shared key – Picture by writer*

A number of hyperlinks carry many of the weight:

toc_df → line_df. A TOC entry is aware of its start_page (and start_y), so from any part you bounce straight to the strains that belong to it. “Summarize part 3.5” turns into a page-range filter on line_df, no search required.
image_df ↔︎ line_df. A picture occupies a place on the web page, so it has a line slot in line_df. That line’s textual content is empty at first, since a picture carries no extractable textual content. Optionally, a imaginative and prescient go reads the picture and writes a brief description again into that textual content cell, so retrieval can match “the structure diagram” later. The hyperlink is what makes that enrichment incremental: fill it while you want it, go away it empty while you don’t.
cross_ref_df → its goal. A body-text point out resolves to wherever the goal lives. “see Determine 2” resolves to object_registry on (ref_type, ref_id); “see part 2.3” resolves to a toc_df entry. The desk fills in as references are matched, so decision runs lazily, point out by point out.
page_df, span_df, object_registry anchor to line_df on page_num or (page_num, line_num), the identical be a part of each downstream brick depends on.

Concretely, frequent questions collapse into one or two filters:

“Summarize part 3.5.” Lookup its start_page and end_page in toc_df, then line_df[line_df.page_num.between(start, end)]. No embedding, no key phrase search, simply the part’s strains.
“What are the totals?” on the bill from part 3.2 → line_df[line_df.column_position == "right"]. The column the parser detected is now a question.
“What does Determine 2 present?” object_registry resolves the caption to its web page and line; line_df returns the caption textual content; and if a imaginative and prescient go has stuffed the picture’s slot, you get the outline too.
“The place is Desk 1 referenced?” cross_ref_df[(cross_ref_df.ref_type == "table") & (cross_ref_df.ref_id == 1)] lists each point out with its (page_num, line_num), joined again to toc_df to call the part each sits in.

Every is a filter or a be a part of on tables already in reminiscence, by no means a re-parse.

That is what the joins purchase you downstream. Retrieval pulls a bit from toc_df, expands it to its strains in line_df, and widens to the figures it mentions via object_registry; era reads these strains; highlighting renders citations again onto the web page by (page_num, line_num). The entire pipeline turns into a sequence of low-cost joins on one parse, as a substitute of re-reading the PDF at each step. How these joins grow to be concrete SQL major keys, international keys, and indexes is the storage layer’s job, past this text’s scope.

3. parse_pdf on two actual PDFs, aspect by aspect

parse_pdf is the only entry level that calls each helper above and returns the total set of linked tables in a single go. Run it on two very completely different PDFs and the output construction is similar: identical keys, comparable shapes.

3.1. parse_pdf side-by-side on two actual PDFs

Working each calls and laying the 2 returned dicts aspect by aspect reveals that the keys maintain up, with per-cell tallies that replicate every doc’s form:

*identical keys, identical form throughout paperwork, with per-cell tallies – Picture by writer*

A LaTeX analysis paper and the NIST Cybersecurity Framework 2.0 (CSWP-29, US authorities work, public area). Two very completely different paperwork: one has 15 pages of math notation in a NeurIPS-style two-column format, the opposite 32 pages of coverage textual content mixing single and two-column sections. Identical parse_pdf name, identical keys, each column comparable. The Consideration paper drops a helpful shock on the way in which: this arXiv model carries 22 native TOC entries, opposite to the frequent assumption that arXiv strips bookmarks.

The PDF is opened as soon as with fitz, each helper consumes the identical doc state, and the file is closed earlier than returning. No reopening, no redownload from S3, no inconsistency between two helpers seeing completely different web page variations. From right here, retrieval, era, and annotation by no means contact the PDF once more. They question the dict.

3.2. column_position in motion (an bill)

Invoices are the canonical case for column_position: line objects run down the left column (descriptions), costs and totals stack down the precise column. We decide a one-page fictional bill (knowledge/invoices/invoice_01.pdf, openly-licensed, generated for the sequence) so the format is trustworthy two-column billing as a substitute of a analysis paper’s determine caption.

*every line boxed by the column the parser assigned: blue = left (descriptions), inexperienced = proper (quantities and totals) – Picture by writer*

Take a look at the supply web page first. Every line is boxed by the column the parser gave it: blue for the left (descriptions), inexperienced for the precise (quantities and totals). assign_column_positions picks that cut up cleanly:

header line on the left at x0 ≈ 54, totals stack on the precise at x0 ≈ 391-514, a line merchandise splits an outline on the left and amount + value on the precise on the identical y0 – Picture by writer

The header line sits within the left column at x0 = 54. Beneath the objects desk, the totals stack on the precise: “TOTAL DUE:” at x0 ≈ 391, the quantity $2,027.56 at x0 ≈ 497. The road merchandise at y0 = 397.13 reveals the cut up clearly: the outline “Employees coaching” sits at x0 = 54 (left), the amount 0.5 and unit value $197.58 sit at x0 ≈ 343 and x0 ≈ 395 (proper). Downstream, asking for “the totals” turns into a one-line question towards line_df: line_df[line_df["column_position"] == "proper"].

No imaginative and prescient go, no bbox arithmetic. Only a column filter on a structured desk.

3.3. Two PDFs, identical parser, identical form

Two very completely different paperwork, the identical parser, instantly comparable structured outputs:

*ame parser, two PDFs; strains, columns, TOC entries, named objects all queryable – Picture by writer*

What this may have seemed like with a naive get_text() parser: a string per doc, no solution to inform which strains had been OCR’d and which had been native, no thought the place every determine caption sits, no separation between left and proper halves of a two-column web page. The retrieval and era levels would have constructed on sand.

4. Save as soon as, reload without end

Parsing is the costliest brick within the pipeline. Query parsing, retrieval, and era every value one LLM name; parsing reads bytes and resolves format. With PyMuPDF it stays low-cost (sub-second on a small paper). With heavier engines (Azure Format, Tesseract, vision-LLM fallback), the identical PDF can take 30 seconds to a number of minutes per run. Three iterations on a downstream immediate is three OCR runs. No purpose for that.

The repair is path-driven. Every PDF writes its parsed tables to a mirror folder underneath the output listing, matching the supply path precisely. From the PDF path alone, each downstream step (retrieval, era, annotation) is aware of the place the cache lives.

The Fluid Simulator That Doesn’t Resolve the Fluid Equations

Construct and Run an Clever Doc Processing (IDP) System within the Cloud

*each PDF in `knowledge/` has a twin folder in `output/` carrying its parsed tables – Picture by writer*

The relational tables go to .xlsx (one file per desk, opens with a double-click), parsing_summary to JSON. Excel is sufficient at this stage: pandas round-trips cleanly, and every desk stays inspectable in any spreadsheet instrument. A manufacturing storage layer swaps in SQLite (international keys, joins throughout paperwork, append-on-update), however the downstream bricks eat DataFrames both approach.

save_parsed writes the folder; load_parsed returns the identical dict, or None if the cache is lacking. The calling sample is one line:


parsed = load_parsed(pdf_path)
if parsed is None:
    parsed = parse_pdf(pdf_path)
    save_parsed(pdf_path, parsed)

The downstream bricks observe go well with. Query parsing writes its ParsedQuestion to questions//parsed_question.json, retrieval saves retrieved_pages.xlsx, era saves reply.json. Each step is totally recoverable from disk, each step may be replayed with out touching the LLM once more. Whenever you tweak a era immediate, you’re not paying for parsing or retrieval to re-run.

5. Conclusion

An excellent RAG parser doesn’t extract textual content. It turns an unstructured PDF right into a relational mannequin of the doc: a set of linked tables, joined by shared identifiers (page_num, line_num, (ref_type, ref_id)), every carrying one entity. Retrieval, era, and annotation by no means re-read the PDF afterwards; they question DataFrames. Saving the parse as soon as and reloading it without end turns a 30-second-per-question latency right into a per-corpus one-shot value.

A relational set of tables, one PDF in, no flat string out. Each downstream instrument the crew wires onto the parser (key phrase search, embedding similarity, part retrieval, quotation rendering, audit log, change monitoring) reads from these tables moderately than from the unique bytes. The PDF is opened as soon as, at ingest. After that, the whole lot is SQL or pandas. That property is what makes the parsing brick well worth the engineering funding: the fee is paid as soon as per doc, and each iteration on the remainder of the pipeline runs towards a secure, queryable artefact.

This text is a part of the Enterprise Doc Intelligence sequence. The minimal RAG pipeline reveals the relational tables in use end-to-end on an actual PDF.

6. Sources and additional studying

Earlier within the sequence:

The parser this text describes follows the identical structure as Docling (Auer et al., Docling Technical Report, IBM Analysis 2024): format detection, TableFormer, reading-order. Borderless desk extraction makes use of the mannequin from Smock et al. (PubTables-1M / Desk Transformer, CVPR 2022). The page-class taxonomy is constructed on the identical baseline as Pfitzmann et al. (DocLayNet, KDD 2022). The article provides a render-mode detection go (native / scanned / combined) with OCR-quality scoring on high. The parser produces a relational set of tables (line_df, page_df, image_df, toc_df, object_registry, cross_ref_df, span_df, plus a parsing_summary dict); retrieval, era, and annotation downstream don’t learn the PDF once more, they question DataFrames.

Identical path because the article:

Auer et al., Docling Technical Report, IBM Analysis 2024 (arXiv:2408.09869). Reference structure for the pipeline this text describes: format detection, TableFormer, reading-order, unified doc illustration.
Smock, Pesala, Abraham, PubTables-1M / Desk Transformer (TATR), CVPR 2022 (arXiv:2110.00061). Imaginative and prescient-based desk detection and construction recognition; the mannequin behind most fashionable desk parsers.
Pfitzmann et al., DocLayNet, KDD 2022 (arXiv:2206.01062). Empirical baseline for the page-class taxonomy and format detection benchmarks.
Lo et al., PaperMage, EMNLP 2023 demos. Maps to the indexing-vs-reading cut up (parsing for retrieval will not be parsing for reply era).

Completely different angle, completely different context:

Faysse et al., ColPali: Environment friendly Doc Retrieval with Imaginative and prescient Language Fashions, 2024 (arXiv:2407.01449). Imaginative and prescient-language retrieval on the web page picture. The context is retrieval the place the web page picture is the artefact, no parsing-into-tables step. This text makes use of bounding-box-anchored DataFrames as the muse as a substitute.
Wang et al., DocLLM: A Format-Conscious Generative Language Mannequin for Multimodal Doc Understanding, JPMorgan 2024 (arXiv:2401.00908). Format-aware LLM that reads the PDF instantly with out an express relational parsing brick. Identical household of method as ColPali; completely different from this text’s queryable relational artefact.
Kim et al., OCR-free Doc Understanding Transformer (Donut), ECCV 2022 (arXiv:2111.15664). Finish-to-end OCR-free doc understanding; helpful distinction with the OCR-quality-scoring go this text provides on high of the render-mode detection.