Past extract_text: The Two Layers of a PDF That Drive RAG High quality

This text is the parsing brick of Enterprise Doc Intelligence, a collection that builds an enterprise RAG system from 4 bricks: parsing, query parsing, retrieval, and technology. Parsing comes first, and that is the primary of its two components. This half covers the primary of parsing’s two layers: figuring out the nature of the doc (born-digital vs scanned, supply software program, declared metadata, native TOC) plus a brief LLM-written abstract of what it’s. The subsequent half turns the content material right into a relational set of tables.

*the place this text sits within the collection: Article 5, the nature-and-signals half of the parsing brick, inside Half II (the 4 bricks) – Picture by creator*

in a RAG course of, the parser has one job. Learn the doc the best way a human would earlier than answering a query about it.

What is that this factor? A CV, an insurance coverage contract, a regulatory textual content, an educational paper? What number of pages? Born digital or scanned, or stitched collectively from each? What does it carry: paragraphs, tables, multi-column structure, embedded photographs? In what language?

Every of these checks is a failure case the remainder of the pipeline can’t get well from:

A CV exported from a designer template. The candidate’s identify sits in a brand picture on the high of web page 1, the remainder of the web page in clear textual content. Requested “what’s the identify?”, retrieval finds nothing matching and falls again to the PDF metadata’s creator area, which is whoever final edited the file. The reply is mistaken earlier than technology ever runs.
An insurance coverage contract with has_text_layer=True. The textual content layer is OCR output at high quality 0.3. “Renewal price EUR 250” comes by means of as “Renewa1 price EUR 25O”. Key phrase retrieval by no means matches; technology reads a special quantity close by and commits to it.
A 200-page regulatory textual content with no TOC and no headings the parser can detect. The pipeline treats it as one homogeneous blob. The query parser has no concept web page 4 holds the definitions and web page 187 holds the exclusions.
An instructional paper with two-column structure. Naive textual content extraction interleaves the left and proper column line by line. The retrieved chunk reads as gibberish.

Identical form each time. An skilled was requested a query a few doc they’d by no means opened. They guessed. The pipeline did the identical.

Parsing has two layers. This text (5_A) covers the primary: figuring out the nature of the doc (born-digital vs scanned, supply software program, declared metadata, native TOC if any) and a brief LLM-written abstract (web page depend, plus three or 4 sentences naming the doc kind, the principle topic, the fields it carries). The subsequent article (5_B) covers the second: figuring out the content material exactly by means of a relational base the place each line, span, picture, and TOC entry turns into one row keyed by web page and place.

The article makes use of PyMuPDF (additionally imported as fitz), a free Python library that reads PDF bytes instantly. No exterior instruments, no API key. Quick sufficient to run at ingest time and correct on born-digital PDFs. The identical parse_pdf contract could be re-implemented by heavier engines (Azure Structure, Docling, Camelot, vision-LLM fallback). When a web page calls for extra depth than fitz may give, an adaptive cascade dispatches throughout them. That escalation is a follow-up matter, past this text’s scope.

1. Doc-level alerts

A PDF offers you two sorts of data. Doc-level alerts: metadata, native bookmarks, declared properties. Web page-level content material: what every web page holds. The parser reads them in that order, and trusts content material when the 2 disagree.

*metadata learn as soon as at doc degree, content material walked web page by web page – Picture by creator*

Metadata is a handful of fields the PDF palms over in milliseconds. Producer, Creator, native bookmarks, encryption flag. Doc-level, no strolling pages. You learn them on the very begin of each parse to make a routing name. Phrase export → direct extraction. Kofax scan → OCR pipeline. Something ambiguous → the slower content material cross.

Metadata lies generally. Ghostscript and qpdf overwrite the upstream Producer area once they recompress, so a Phrase PDF re-distilled twice will declare to be Ghostscript and inform you nothing concerning the true origin. The helper exposes each the inferred label and the uncooked creator_raw / producer_raw strings so downstream guidelines can argue again.

1.1. Supply software program

A PDF nearly at all times advertises its origin by means of the Creator and Producer fields. That single sign tells us how exhausting the remainder of the parsing might be, and lets us path to the best technique earlier than opening any web page.

Producers fall into roughly 5 buckets, ordered from best to hardest to parse. “Vector tables” beneath means tables drawn as native strains + textual content (the cells survive as knowledge); the alternative is a desk flattened right into a single picture (solely OCR can get well the cells).

Workplace authoring instruments (best). Microsoft Phrase, PowerPoint, LibreOffice (Author / Impress), OpenOffice, Google Docs and Slides exports, Apple Pages and Keynote. They protect logical construction (headings, lists, paragraphs) with native vector fonts. Direct textual content extraction works effectively, studying order is dependable, tables are normally vector tables. The majority of “workplace paperwork” you’ll see in an enterprise corpus.
Doc processors. LaTeX engines (pdfTeX, XeTeX, LuaTeX), Pandoc, Quarto, R Markdown, ReportLab, WeasyPrint. Glorious textual content constancy however with their very own quirks: hyphenation breaks phrases throughout strains, math is rendered as vector paths or photographs (not extractable textual content), references and citations have uncommon spacing. Tables are vector tables more often than not.
Design and publishing instruments. Adobe InDesign, Illustrator, QuarkXPress, Affinity Writer. Multi-column circulate with messy studying order. PyMuPDF usually will get studying order mistaken on dense layouts. Tables could be drawn as vector graphics slightly than vector tables. Captions, sidebars, and ornamental parts complicate parsing. Count on to escalate to a layout-aware parser on dense layouts.
Print pipelines and recompressors. Browser print (Chrome, Safari, Firefox), OS print-to-PDF dialogs, Ghostscript, qpdf, distiller-class instruments. Combined high quality. Browser-printed PDFs protect textual content however lose hyperlinks and bookmarks. Ghostscript and qpdf usually cross the content material by means of however overwrite the upstream Producer area, so the unique sign is gone. That’s why the helper exposes each creator_raw and producer_raw.
Scanner software program and seize apps (hardest). Kofax, ABBYY, Adobe Scan, ScanSnap, CamScanner, fax pipelines. Pure picture, no native textual content. OCR necessary. CamScanner-class apps add image-quality points (skew, low decision, JPEG artefacts) on high.

def detect_source_software(doc: fitz.Doc) -> str:
    """Classify the manufacturing software program utilizing Creator/Producer metadata."""
    meta = doc.metadata or {}
    mixed = f"{(meta.get('creator') or '').decrease()} {(meta.get('producer') or '').decrease()}"

    # Bucket 1 — workplace authoring instruments
    if "microsoft" in mixed and "phrase" in mixed: return "word_export"
    if "pdfmaker" in mixed and "phrase" in mixed: return "word_export"
    if "powerpoint" in mixed: return "powerpoint_export"
    if any(s in mixed for s in ("libreoffice", "openoffice")): return "libreoffice_export"

    # Bucket 2 — doc processors
    if any(s in mixed for s in ("pdftex", "xetex", "luatex")): return "latex_export"
    if "pandoc" in mixed: return "pandoc_export"

    # Bucket 3 — design and publishing instruments
    if "indesign" in mixed: return "indesign_export"

    # Bucket 4 — print pipelines and recompressors
    if "ghostscript" in mixed: return "ghostscript"
    if any(s in mixed for s in ("chrome", "safari", "firefox")): return "browser_print"

    # Bucket 5 — scanner software program (OCR necessary)
    if any(s in mixed for s in ("kofax", "abbyy", "adobe scan", "scansnap", "camscanner")):
        return "scanner_software"

    return "unknown_source"

The detection is imperfect: a Phrase PDF re-distilled by means of Ghostscript could have its Producer overwritten, and uncommon producers fall into unknown_source. On a combined corpus of papers, scanned contracts, browser-printed studies, and Workplace exports, roughly 9 PDFs out of ten land in the best bucket on first learn. Sufficient to drive routing. We expose each the inferred source_software label and the uncooked creator_raw / producer_raw strings so downstream guidelines can compensate when wanted.

Two demo PDFs carry the remainder of this text: the Consideration Is All You Want paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv summary web page) and the NIST Cybersecurity Framework 2.0 (CSWP-29; US Authorities work, public area within the US, see NIST copyright assertion). The detector lands them in two totally different buckets:

*Consideration paper lands within the LaTeX bucket, NIST CSF within the Phrase bucket – Picture by creator*

1.2. Native desk of contents

PyMuPDF exposes the doc’s define by means of doc.get_toc(), which returns an inventory of [level, title, page] triples. build_toc_df wraps that and provides parent_idx and breadcrumb so the hierarchy is queryable.

Run it on the Consideration Is All You Want paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv summary web page) and also you get an actual, three-level construction:

*22 entries throughout 3 ranges, with `breadcrumb` and computed `end_page` – Picture by creator*

When the doc has a TOC, we deal with it as declared construction: the doc telling us the way it organizes itself.

When the doc has no TOC (most scans and fast exports), get_toc() returns an empty listing. Reconstructing a TOC from typography alerts (massive daring strains, numbering patterns) is a separate downside, outdoors this text’s scope.

1.3. Different declared properties

Encryption (doc.is_encrypted, doc.needs_pass), type fields (doc.is_form_pdf), digital signatures, creation and modification dates. All low cost to learn. Some matter for parsing routing (encrypted PDFs want dealing with); most matter on the corpus degree (versioning, audit, entry management) and are coated within the corpus articles (Articles 15-20).

2. What every web page holds

As soon as metadata is learn, we stroll the pages. Content material is the bottom fact: when a PDF claims to be a Phrase export however each web page is a scan that somebody pasted into Phrase and re-exported, solely content material catches it. Metadata says one factor, the bounding packing containers say one other. We imagine the bounding packing containers.

For every web page we extract content material parts in precedence order.

*4 extractors plus a classifier that composes them – Picture by creator*

2.1. Textual content and the render mode

Textual content is an important deliverable. The pure unit is the road: a line carries a string together with its bounding field (the rectangle that encloses it on the web page), dominant typography (font, dimension, daring, italic, shade), and a crucial flag, the render mode: a PDF-level code that tells us whether or not the textual content was written natively or positioned invisibly by an OCR layer on high of a scanned picture.

uncooked = web page.get_text("rawdict")
native_chars = 0
ocr_chars = 0
for block in uncooked["blocks"]:
    if block["type"] != 0:
        proceed
    for line in block["lines"]:
        for span in line["spans"]:
            if span.get("render_mode", 0) == 3:
                ocr_chars += len(span["text"])
            else:
                native_chars += len(span["text"])

Render mode 3 means the textual content is drawn invisibly: a layer that OCR software program locations beneath the web page picture so the scan turns into searchable. The textual content is there, however solely as hidden characters. Distinguishing render mode 3 from native textual content issues: it’s the solely dependable solution to know whether or not a scanned web page already has a usable searchable layer or must be re-OCR’d.

Going additional: When typography varies inside a single line (a daring phrase in the course of a sentence, a coloured heading, a rotated label), capturing it requires taking place to the span degree. We introduce span-level extraction in part 3 of this text, as a result of some downstream phases (heading detection, itemizing aggregation throughout lengthy solutions) want it.

2.2. Photographs and full-page protection

Photographs come second as a result of they usually include textual content or crucial visible info that the RAG pipeline would in any other case lose. Logos establish the issuing celebration. Schematics describe techniques. Images doc proof. Tables exported as photographs carry knowledge.

For every embedded picture we report its displayed bounding field (in PDF factors), its intrinsic dimensions (in pixels), and a content material hash for deduplication. The picture can be extracted and persevered (S3 or native storage) so downstream phases can course of it.

A standard pitfall: web page.get_images() returns the intrinsic dimensions of every picture, not the world displayed on the web page. To compute true protection, use web page.get_image_info(), which returns the bounding field in PDF factors as rendered.

page_area = web page.rect.width * web page.rect.peak
max_coverage = 0.0
for information in web page.get_image_info():
    bbox = data["bbox"]
    img_area = (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
    max_coverage = max(max_coverage, img_area / page_area)
has_full_page_image = max_coverage >= 0.95

A web page the place one picture covers ≥ 95% of the floor is, with very excessive chance, a scanned web page. The 95% threshold is empirical: it tolerates the small margins a scanner provides across the web page edge with out catching pages that legitimately use a big hero picture inside a structure. Values from 90% to 99% all work in observe; tighten the edge in case your corpus has many full-bleed cowl pages, loosen it if scanners crop tightly.

2.3. Vector tables

Tables don’t chunk like working textual content. Naive linearization destroys cell semantics. The parsing stage flags their presence and placement; the precise structured extraction occurs in a follow-up adaptive-parsing cross that escalates to a layout-aware engine when fitz’s row guess seems unreliable.

PyMuPDF (since 1.23) detects vector tables, these constructed from drawn strains mixed with native textual content, through web page.find_tables(). The decision returns a TableFinder object whose .tables listing has one entry per detected desk: tables = web page.find_tables(); n_tables = len(tables.tables).

For scanned tables rendered as photographs, find_tables() gained’t hearth. Detection in that case requires visible instruments (Camelot, Docling, PaddleStructure), past this text’s scope.

2.4. Columns: left, proper, single, multi

Column detection is tough. Two-column layouts break the naive studying order: a analysis paper parsed with out column consciousness returns line 1 of column 1, then line 1 of column 2, then line 2 of column 1, and so forth, splicing sentences from totally different columns into noise. Three or extra columns make each cheap heuristic shaky.

The pragmatic transfer is to annotate every line with the place it sits horizontally slightly than attempt to get well an ideal studying order. We add a column_position area to line_df with 4 values:

single: the web page has one column.
left / proper: the web page has two columns; the road falls in a single or the opposite.
multi: the web page has three or extra columns; we flag it as an alternative of guessing.

The detection clusters the left edge of every line alongside the x-axis: in line_df that edge is bbox_x0, the x coordinate the place the road begins. A web page the place each line begins in roughly the identical horizontal band is single-column. A web page with two clear bands is two-column, and we break up strains by which band they fall into. The Consideration paper’s web page 4 will give us the actual numbers in part 4.2: x0 ≈ 148 for the left column, x0 ≈ 364 for the best one.

def assign_column_positions(
    line_df: pd.DataFrame,
    gap_threshold: float = 80.0,
    min_cluster_fraction: float = 0.10,
) -> pd.DataFrame:
    """Add a `column_position` area: single / left / proper / multi."""
    out = line_df.copy()
    out["column_position"] = "single"
    for _, sub in line_df.groupby("page_num"):
        x0_values = sub["x0"].tolist()
        if not x0_values:
            proceed
        clusters = _cluster_x0(x0_values, gap_threshold)
        sig = _significant_clusters(clusters, len(x0_values), min_cluster_fraction)
        n_cols = max(1, len(sig))
        if n_cols == 1:
            proceed
        if n_cols == 2:
            c1_center = sum(sig[0]) / len(sig[0])
            c2_center = sum(sig[1]) / len(sig[1])
            break up = (c1_center + c2_center) / 2
            left_idx = sub.index[sub["x0"] < break up]
            right_idx = sub.index[sub["x0"] >= break up]
            out.loc[left_idx, "column_position"] = "left"
            out.loc[right_idx, "column_position"] = "proper"
        else:
            out.loc[sub.index, "column_position"] = "multi"
    return out

The gap_threshold defaults to 80 PDF factors (1 PDF level = 1/72 inch, so 80 ≈ 2.8 cm). That’s the standard width of the gutter between columns in a NeurIPS-style paper or a two-column coverage doc. Something narrower is extra doubtless a paragraph indent than a column break.

Why trouble with left / proper in any respect? The use case that earns this area its place is structured knowledge on the web page the place place is the schema. On invoices, the issuer’s handle sits top-left, the shopper’s handle sits top-right (or the inverse, relying on the template). Asking the retriever to tug “the shopper block” from the best half of web page 1 is much extra pure than asking for an actual bbox. The identical sample exhibits up on kinds, statements, and contracts with a header block. As soon as the sector is in line_df, downstream phases can filter by column_position == "proper" like some other desk question.

The consumer may also level at it instantly. Operators aware of their paperwork will say “the reply is within the left column” or “the coverage quantity sits on the best”. That sentence is a question towards column_position, not a imaginative and prescient process.

Two columns is the place this label earns its maintain. With three or extra columns, “left vs proper” loses which means and we mark the web page multi slightly than guess. Newspapers, dense reference manuals, and pages with facet margins are the circumstances to observe. When column_position == "multi" exhibits up on a web page that issues, that’s a sign to escalate to a layout-aware parser.

A frequent failure mode of “minimal” RAG pipelines lives proper right here. The creator exams on a Phrase doc (column_position == "single" in every single place), retrieval works, then a buyer drops a two-column annual report and the system begins returning sentences lower in half. The bug seems like a technology downside (“the mannequin can’t learn”); the trigger is a parsing downside (the strains have been by no means in the best order to start with).

2.5. Web page classification

With the per-page alerts collected, each web page receives a main kind (mutually unique) and additive flags (unbiased booleans).

The first sorts:

*eight mutually unique web page sorts, every with its routing determination – Picture by creator*

The additive flags describe what the web page accommodates, independently of the sort:

has_text / has_native_text / has_ocr_layer: any textual content current; any native (non-OCR) textual content; any invisible OCR layer.
has_image / has_full_page_image: any embedded picture; one picture protecting ≥ 95% of the web page.
has_vector_table: a minimum of one desk detected through web page.find_tables() (strains + native textual content, not flattened to a picture).
has_vector_graphics: the web page accommodates drawn paths which might be NOT a vector desk (charts, schematics, ornamental shapes, mathematical figures). Price flagging as a result of these are PDF content material the textual content extractor sees as nothing.

Separating kind from flags lets us cross standards: “all pages with a vector desk” no matter kind, “combined pages that additionally include a desk”, and so forth.

The classifier consumes a PageFeatures object: the subset of per-page alerts it must resolve. The text_quality_score in that object is a 0–1 ratio: 0 means the web page textual content is garbled (excessive proportion of unrecognised characters), 1 means clear native textual content. An adaptive cascade builds it in full from the uncooked alerts; right here it is only one enter to the classifier:

@dataclass
class PageFeatures:
    char_count: int
    n_fonts: int
    n_images: int
    has_full_page_image: bool
    native_chars: int
    ocr_chars: int
    text_quality_score: float

Three views of “page-level fields” dwell on this article. The schema (the page_df diagram beneath in part 3.3) lists each area the information mannequin targets. PageFeatures is the subset classify_page reads. The present page_df pattern is the core triplet the package deal builds right now: additive flags from the schema land progressively as downstream phases ask for them.

The classification logic itself is brief:

def classify_page(options: PageFeatures) -> str:
    if options.char_count < 10 and options.n_images == 0:
        return "empty"

    if options.n_fonts == 0 and options.has_full_page_image:
        return "scanned"

    if (options.has_full_page_image
        and options.ocr_chars > options.native_chars
        and options.ocr_chars > 50):
        return "scanned_ocr_good" if options.text_quality_score >= 0.7 else "scanned_ocr_bad"

    if options.has_full_page_image and options.native_chars > 50:
        return "combined"

    if options.n_fonts > 0 and options.native_chars > 0:
        return "native_with_image" if options.n_images > 0 else "native"
    return "unknown"

The decisive alerts are structural, not statistical: declared fonts, render mode, displayed picture protection. We by no means depend on character-count thresholds alone to resolve native vs scanned. A local web page with three strains of textual content remains to be native.

Going additional: OCR high quality scoring (the text_quality_score used above) deserves its personal therapy. The 2 dependable alerts are the proportion of Unicode substitute characters and the ratio of phrases present in a dictionary. Lists of “suspicious characters” like ●◦• needs to be averted; these are completely official bullets in formatted paperwork. The complete scoring pipeline is a follow-up matter.

3. The semantic zone of parsing_summary: one LLM name, system-prompt grade

Sections 1 and a pair of went by means of NIST CSF and the Consideration paper, each wealthy in structural alerts. Part 3 turns to a doc kind the place construction alone settles nothing: the one-page CV. The working instance is a fictional CV, Sarah Mitchell, Knowledge Analyst.

The alerts from sections 1 and a pair of are the whole lot a deterministic parser can produce in a number of seconds and not using a mannequin name. They inform us what the doc is and how it’s laid out. They don’t inform us what it’s about. Two one-pagers with the identical web page depend, the identical single-column structure, the identical word_export producer nonetheless differ on each query retrieval might be requested.

Give an LLM Agent a Browser

The best way to Optimize Vector Search When RAM Will get Too Costly: On-Disk vs. In-Reminiscence ANN Indexes

A brief prose abstract closes that hole. One LLM name at parsing time, fed the primary one or two pages, requested to return three or 4 sentences naming the doc kind, the principle topic, and the fields it carries. Round 200 tokens. Cached ceaselessly, since parsing is run-once per doc. The end result lands in three fields of the identical doc-level dict (parsing_summary): doc_type, typical_fields, and abstract.

Run that clear CV by means of parsing and the semantic zone of parsing_summary reads like this:

{
  "doc_type": "resume",
  "typical_fields": ["name", "email", "phone", "experience", "education", "skills", "languages"],
  "abstract": "One-page resume of Sarah Mitchell, a Knowledge Analyst based mostly in London with about 4 years of expertise. Lists positions at Northwind Retail and Brightwave Insurance coverage, a BSc in Statistics from Leeds, and expertise in Python, SQL, BigQuery and Energy BI. Customary CV sections: Abstract, Expertise, Training, Abilities."
}

Dropped into the system immediate of the query parser, this fixes the “what’s the identify?” case from the opener. The parser now sees that this doc is about Sarah Mitchell earlier than it sees the consumer’s query. Identify is not an ambiguous position phrase in search of a literal incidence. The parser is aware of the candidate’s identify is Sarah Mitchell and routes the query that manner.

The identical three fields work for each query on the identical doc. “The place did she work?” now has a referent. “What’s her tech stack?” maps to the Abilities part listed in typical_fields. Web page depend rides alongside at no cost in the identical dict: “summarize web page 1” on a one-page CV turns into “summarize the entire doc”, retrieval is skipped, technology reads the complete content material.

The form of the abstract area issues greater than its size. A handful of working guidelines:

Three to 4 sentences, plain prose, factual register. No advertising tone (“an excellent CV with in depth achievements” poisons each downstream reply with claims the parser will then propagate).
Open with the doc kind and the principle topic: “One-page resume of Sarah Mitchell, a Knowledge Analyst…”. The parser makes use of the primary noun phrase to disambiguate position phrases like identify, position, employer.
Listing the usual sections once they exist: “Customary CV sections: Abstract, Expertise, Training, Abilities.”. The parser makes use of this to map query subjects to retrieval scopes.
Stick with info a reader may confirm on the primary web page or two. No claims about content material the LLM has not seen.

A set of fictional CVs with the identical form (one to 2 pages, candidate on the high, sections beneath) however totally different layouts and content material high quality stresses this self-discipline. A abstract that reads “resume of , , with ” generalises throughout all of them. A abstract that drifts into rendering decisions (“two-column structure with a colored sidebar”) overfits to 1 file and breaks on the following.

That is the piece that turns the look on the doc metaphor into one thing a chatbot can use. The deterministic alerts from sections 1 and a pair of say how to parse. The semantic zone of parsing_summary says what was parsed. Collectively they type the doc-level dict each downstream brick reads, beginning with the query parser’s system immediate.

All of this exhibits up in Enterprise Doc Intelligence, the desktop app I’m constructing. The screenshot beneath has the identical fictional CV open, with the document-context fields surfaced and highlighted on the web page: candidate identify, goal position, years of expertise. The brief abstract written as soon as at parse time is what drives that panel.

*The identical one-page CV, parsed as soon as: identify, position, and expertise surfaced as fields the query parser can goal* – picture by creator

Conclusion

A PDF is 2 paperwork stacked on high of one another: the declared alerts (metadata, native TOC, supply software program) and the page-level content material (textual content vs scan, photographs, tables, columns, web page profile). The parser reads them in that order and trusts the physique when the 2 disagree. A brief LLM-written abstract area, paid as soon as per doc and cached, sits subsequent to them in the identical parsing_summary dict, and the query parser reads it as a part of its system immediate on each name.

Every sign saved at parse time turns into a column the remainder of the pipeline reads. Every page-level determination routes the web page to the best downstream handler: pure textual content pages undergo OCR-skip, table-heavy pages undergo a structured-extraction path, multi-column pages get a column-aware studying order. The distinction between a parser that ships a flat string and a parser that ships one thing downstream code can question is true right here, within the alerts it bothered to report.

The subsequent article (“Cease returning flat textual content from a PDF: the relational form RAG wants”) will present you the eight DataFrames the parser produces from these alerts, demoed on two actual paperwork. The identical DataFrames are the enter the minimal RAG pipeline consumes end-to-end, they usually sit contained in the broader Enterprise Doc Intelligence collection.

Sources and additional studying

Earlier within the collection:

The parser this text describes follows the identical structure as Docling (Auer et al., Docling Technical Report, IBM Analysis 2024): structure detection, TableFormer, reading-order. Borderless desk extraction makes use of the mannequin from Smock et al. (PubTables-1M / Desk Transformer, CVPR 2022). The page-class taxonomy is constructed on the identical baseline as Pfitzmann et al. (DocLayNet, KDD 2022). The article provides a render-mode detection cross (native / scanned / combined) with OCR-quality scoring on high. The parser produces a relational set of tables (line_df, page_df, image_df, toc_df, object_registry, cross_ref_df, span_df, plus a parsing_summary dict); retrieval, technology, and annotation downstream don’t learn the PDF once more, they question DataFrames.

Identical route because the article:

Auer et al., Docling Technical Report, IBM Analysis 2024 (arXiv:2408.09869). Reference structure for the pipeline this text describes: structure detection, TableFormer, reading-order, unified doc illustration.
Smock, Pesala, Abraham, PubTables-1M / Desk Transformer (TATR), CVPR 2022 (arXiv:2110.00061). Imaginative and prescient-based desk detection and construction recognition; the mannequin behind most fashionable desk parsers.
Pfitzmann et al., DocLayNet, KDD 2022 (arXiv:2206.01062). Empirical baseline for the page-class taxonomy and structure detection benchmarks.
Lo et al., PaperMage, EMNLP 2023 demos. Maps to the indexing-vs-reading break up (parsing for retrieval just isn’t parsing for reply technology).

Totally different angle, totally different context:

Faysse et al., ColPali: Environment friendly Doc Retrieval with Imaginative and prescient Language Fashions, 2024 (arXiv:2407.01449). Imaginative and prescient-language retrieval on the web page picture. The context is retrieval the place the web page picture is the artefact, no parsing-into-tables step. This text makes use of bounding-box-anchored DataFrames as the inspiration as an alternative.
Wang et al., DocLLM: A Structure-Conscious Generative Language Mannequin for Multimodal Doc Understanding, JPMorgan 2024 (arXiv:2401.00908). Structure-aware LLM that reads the PDF instantly with out an specific relational parsing brick. Identical household of method as ColPali; totally different from this text’s queryable relational artefact.
Kim et al., OCR-free Doc Understanding Transformer (Donut), ECCV 2022 (arXiv:2111.15664). Finish-to-end OCR-free doc understanding; helpful distinction with the OCR-quality-scoring cross this text provides on high of the render-mode detection.