When PyMuPDF Can’t See the Desk: Parse PDFs for RAG with Azure Structure

companion in Enterprise Doc Intelligence, the sequence that builds an enterprise RAG system from 4 bricks. Article 5 (doc parsing) constructed the parser with PyMuPDF (fitz). This companion retains the identical purpose and the identical relational tables, and swaps the engine for Azure Structure (the prebuilt-layout mannequin), a richer bundle that recovers what fitz can not. That hole is the place we begin.

*the place this companion sits: it extends Article 5 (doc parsing), inside Half II (the 4 bricks), with a distinct parsing engine – Picture by writer*

PyMuPDF (fitz) is quick, free, and actual on clear prose. It additionally goes blind in three locations, and every one is the place enterprise RAG quietly breaks.

The desk on web page 14 of a contract. Fitz reads the cells one after the other and concatenates them. The column construction is gone. “Renewal charge 500 Setup charge 200” lands within the chunk. Your mannequin is requested to guess which quantity is which charge.

The scanned modification glued to the tip of the doc. Fitz reads the native pages and returns empty strings on the scanned ones. The consumer will get no reply on the modification as a result of the parser by no means learn it.

The determine with textual content inside. A chart with axis labels. A signed seal stamp. A screenshot of a spreadsheet. Fitz returns the bbox of the picture. The textual content inside is gone.

Azure Doc Intelligence reads all three. It’s a proprietary Microsoft Azure cloud service ruled by Microsoft’s On-line Companies Phrases. The prebuilt-layout mannequin returns native desk cells (rows, columns, headers), OCR textual content for each web page (native or scanned), figures with the textual content inside them, and paragraph roles (title, sectionHeading, figureCaption, tableCaption). One name. The identical relational tables as fitz, half of them enriched.

The downstream pipeline doesn’t care which engine produced the dict. Retrieval, era, annotation learn rows. They by no means learn the PDF.

*The identical tables, Azure enriches half – Picture by writer*

1. The place fitz is blind

4 circumstances. In every one, fitz misses and Azure works.

1.1. Tables: fitz returns flat phrases, Azure returns cells

A contract desk has rows and columns. The label “Renewal charge” sits in column 1, the worth 500 sits in column 2. Fitz reads the web page prime to backside and emits one line per textual content phase. The 4 cells of a row come again as 4 free phrases. Generally the cells from the row under get combined in if the y-coordinates are shut. The chunker downstream sees a soup of phrases. The row-and-column construction that makes a desk a desk is gone.

Azure’s prebuilt-layout mannequin detects every desk as a structured object. outcome.tables is an inventory of tables, every with cells listed by (row_index, column_index). The header row is flagged (cell.form == "columnHeader"). The cell content material is the cell textual content, precisely because the writer typed it. We flatten the desk into markdown rows so it lives inside line_df like some other content material. A four-cell row “Renewal charge | 500 | Setup charge | 200” turns into one line_df row with that markdown textual content. The header row will get a | --- | --- | ... | separator so a downstream mannequin reads the construction again.

1.2. Photographs: fitz returns the bbox, Azure returns the textual content

Many PDFs have figures with textual content inside them. Structure diagrams with field labels. Charts with axis ticks and legends. Signed seal stamps. Embedded screenshots of spreadsheets. Fitz returns every picture as a bbox and the uncooked bytes. The textual content inside is invisible to the parser.

Azure’s OCR runs on each web page, together with the pixels inside determine areas. For every determine, we acquire each Azure phrase whose bbox sits contained in the determine area and be part of them as ocr_text. “Multi-Head Consideration Concat Linear h” now lives in image_df.ocr_text for the determine on web page 4 of the Consideration paper. Retrieval can match a query about “multi-head consideration” even when the reply is textual content inside a determine.

*fitz returns the bbox and an empty textual content cell; Azure’s OCR recovers the labels printed contained in the determine – Picture by writer*

1.3. Scanned pages: fitz returns nothing, Azure returns OCR

A 30-page native contract will get a 10-page scanned modification glued on the finish. Fitz reads the native pages and returns empty strings for the scanned ones. The parser doesn’t flag this. The downstream pipeline silently covers 75% of the doc. The consumer has no thought 25% is lacking.

Azure runs OCR on each web page no matter supply. Native pages and scanned pages come again by the identical outcome.pages[i].traces path with the identical form. The parsing_method column on line_df lets downstream code inform which engine produced which rows. The parsing_summary dict has a n_pages area that matches the doc’s precise web page depend, not simply the pages with native textual content.

*a scan is pixels, not characters; fitz has no textual content layer to learn, Azure OCRs the web page – Picture by writer*

1.4. Captions and headings: fitz makes use of regex, Azure has express roles

Fitz detects determine / desk captions by regex on the beginning of every line (^Determine d+b, ^Desk d+b). It really works when captions appear like “Determine 2” and misses the remaining (“Fig. 2”, multi-line wraps). It additionally has false positives: a body-text sentence that begins with “Determine 2” will get picked up as a caption when it’s a point out.

*the 2 failure modes of caption-by-regex (a missed “Fig.” caption, a physique point out wrongly flagged) that Azure’s paragraph function avoids – Picture by writer*

Azure’s paragraphs area has function labels: every paragraph within the outcome carries a tag like "figureCaption", "tableCaption", "title", or "sectionHeading" that tells us what sort of block it’s, with none regex. "figureCaption" and "tableCaption" populate object_registry straight. "title" and "sectionHeading" rebuild the TOC. The tag is Azure’s format mannequin naming the block’s perform; fitz has no equal. The (object_type, object_id) be part of key remains to be extracted by the identical regex on the caption textual content so cross_ref_df joins again the identical means.

The TOC is the extra attention-grabbing case. Fitz’s build_toc_df reads native bookmarks (doc.get_toc()). When the PDF has no native bookmarks, fitz returns an empty TOC. That is the frequent enterprise case: Phrase exports, scanned paperwork, PDFs from kind mills. Azure reconstructs the TOC from paragraph roles. Each "title" paragraph turns into a level-1 entry, each "sectionHeading" paragraph turns into level-2. The hierarchy comes from the order they seem. This isn’t excellent, but it surely produces a usable TOC the place fitz would produce nothing.

2. Identical contract, richer knowledge

One perform. The identical tables as parse_pdf, in the identical form. One Azure name shared by each builder. That decision is small: level the SDK on the doc with one model_id, prebuilt-layout. (The opposite prebuilt mannequin, prebuilt-read, is OCR solely; the format mannequin is the one which additionally returns tables, paragraph roles, and studying order.)

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.fashions import AnalyzeDocumentRequest
from azure.core.credentials import AzureKeyCredential

consumer = DocumentIntelligenceClient(endpoint, AzureKeyCredential(key))

# "Structure" = the prebuilt-layout mannequin (NOT prebuilt-read, which is OCR solely)
with open("contract.pdf", "rb") as f:
    poller = consumer.begin_analyze_document(
        "prebuilt-layout",
        AnalyzeDocumentRequest(bytes_source=f.learn()),
    )

outcome = poller.outcome()   # tables, paragraph roles, OCR, studying order

parse_pdf_azure_layout is the Azure twin of parse_pdf: similar name form, similar dict of tables out, so each downstream brick reads it with out figuring out which engine ran. The physique is price a glance, as a result of it’s the form each engine within the sequence follows: make one name, then one small builder per desk, and reuse the engine-agnostic builders for the tables that solely want line_df.

def parse_pdf_azure_layout(pdf_path):
    outcome = analyze_pdf(pdf_path)               # one name, prebuilt-layout
    line_df  = azure_layout_pdf_to_line_df(pdf_path, outcome=outcome)
    image_df = build_image_df_azure_layout(outcome)         # + ocr_text
    toc_df   = build_toc_df_azure_layout(outcome)           # paragraph roles
    object_registry = build_object_registry_azure_layout(outcome)  # function tags
    page_df      = build_page_df(line_df)        # reused fitz builder (line_df solely)
    cross_ref_df = build_cross_ref_df(line_df)   # reused fitz builder (line_df solely)
    return {"line_df": line_df, "image_df": image_df, "toc_df": toc_df,
            "object_registry": object_registry, "page_df": page_df,
            "cross_ref_df": cross_ref_df, "span_df": pd.DataFrame(),
            "parsing_summary": parsing_summary}

Studying it prime to backside: one analyze_pdf makes the Azure name as soon as, then one small builder per desk reads that shared outcome, and the 2 tables that solely want line_df, page_df and cross_ref_df, are produced by the exact same fitz builders the native parser makes use of. The dict on the finish is the contract each engine returns.

*The identical tables mirror `parse_pdf`, with per-row diffs vs fitz – Picture by writer*

3. What every desk positive factors

3.1. line_df positive factors table-cell rows, picture OCR, choice marks

A 4-column “Schedule of Fees” desk turns into 6 rows in line_df: the header row, the markdown separator, and 4 knowledge rows.

*Every supply row turns into a `line_df` row; column construction carried contained in the markdown textual content – Picture by writer*

We preserve the cells inside line_df as an alternative of including a separate table_cells_df. One desk for each downstream brick to learn; paragraph traces and desk rows look the identical on the best way out. The price: per-cell queries want a markdown parse step. For RAG questions that is high-quality. The retriever matches key phrases on the row textual content. The LLM reads the markdown straight.

OCR textual content from inside photographs additionally lands in line_df as additional rows. Azure’s outcome.pages[i].traces already contains traces that fall inside determine areas, so the line-builder picks them up robotically. Choice marks (checkboxes) grow to be single-character traces: [x] for chosen, [ ] for unselected. Types with check-the-box fields grow to be queryable.

3.2. image_df positive factors an `ocr_text` column

Identical row, new column. For every detected determine, we record each Azure phrase whose bbox overlaps the determine area by a minimum of 50% and be part of them as ocr_text.

*Consideration paper figures with their labels uncovered; textual content inside figures now retrievable – Picture by writer*

The identical column on a fitz-produced image_df is empty. The fitz parser doesn’t OCR photographs. When parsing_method == "fitz", the ocr_text column is there for form parity however stays clean. Downstream code that checks ocr_text != "" works the identical whether or not the row got here from fitz or Azure.

3.3. toc_df will get reconstructed from paragraph roles

When the PDF has native bookmarks, the fitz build_toc_df is actual and free: it reads what the writer wrote. When it doesn’t (most enterprise paperwork), fitz returns an empty toc_df and downstream levels lose the part construction.

The Azure builder walks outcome.paragraphs, filters by function in {"title", "sectionHeading"}, and assembles a TOC. Stage 1 = title, stage 2 = sectionHeading. The hierarchy comes from the order paragraphs seem within the doc. The identical start_page, end_page, start_y, breadcrumb columns because the fitz TOC. The lookback move that computes end_page (the subsequent peer-or-ancestor’s start_page, or total_pages for the final part) is an identical to the fitz one; the one distinction is the place the rows come from.

The reconstruction just isn’t excellent. Azure can not inform sub-section ranges aside past sectionHeading. The hierarchy you get is two-deep at most. For many enterprise queries that is sufficient: a piece stamped “Schedule of Fees” lets the LLM floor its reply to the fitting part even with out the complete Article 14 > Schedule of Fees path.

3.4. object_registry will get caption-role detection

Fitz detects captions by regex anchored at the beginning of a line: ^Determine d+b, ^Desk d+b. Two failure modes. False negatives when the caption format differs (Fig. 2. as an alternative of Determine 2, or a multi-line wrap that pushes the quantity off the primary line). False positives when a body-text sentence occurs to start out with “Determine 2 reveals…”.

Azure skips the regex drawback. Its paragraphs area tags "figureCaption" and "tableCaption" explicitly. We learn the function straight. The (object_type, object_id) be part of key into cross_ref_df remains to be pulled from the caption textual content by the identical regex the fitz builder makes use of, so the be part of works the identical with both engine. The win is recall: Azure catches captions fitz misses. The price stays the identical (one Azure name, the result’s reused throughout builders).

3.5. parsing_summary positive factors Azure-specific stats

Three new fields land within the doc-level synthesis dict:

n_tables_detected: what number of tables Azure discovered (zero on a pure-prose doc, non-zero on a contract with tables).
n_figures: what number of figures the format mannequin recognized.
n_selection_marks: what number of checkboxes (stuffed or empty) Azure detected throughout all pages.

These three counts make routing a doc simple. A 30-page doc with n_tables_detected = 18 appears like a contract and the desk construction issues. A doc with n_selection_marks = 0 might be not a kind. A doc with n_figures = 0 is text-only; no level working picture OCR.

3.6. page_df and cross_ref_df: unchanged

Two tables keep the identical form. page_df and cross_ref_df are constructed from line_df alone, so the engine that produced line_df is irrelevant. One implementation, two engines, no drift.

span_df is empty beneath Azure. The format mannequin doesn’t expose sub-line typography (per-word daring or italic). If you want spans for heading detection or time period emphasis, keep on fitz for that doc. The 2 engines complement one another.

4. The parsing_method column: provenance for adaptive parsing

Each per-row desk from parse_pdf_azure_layout carries parsing_method == "azure_layout". Each per-row desk from parse_pdf (the fitz one) carries parsing_method == "fitz". Identical column, similar identify, each engines. The purpose is downstream.

*Contract on fitz, web page 14 re-parsed with Azure; each engines coexist by way of `parsing_method` – Picture by writer*

That is what adaptive parsing (Article 10) consumes. The default move makes use of fitz. Pages that fail a pre-parse test (desk area detected with no rows extracted, image-heavy web page with sparse textual content, OCR layer with low high quality) get re-parsed by Azure. The re-parsed rows exchange or append to the unique line_df rows. The parsing_method column retains the path.

Three downstream patterns the column allows:

De-duplication: when the identical web page received each passes, preserve azure rows over fitz rows (df.sort_values("parsing_method").drop_duplicates(["page_num", "line_num"], preserve="first") if "azure_layout" < "fitz" lexicographically, or use an express priority map).
Audit: a query that lands on a row with parsing_method == "azure_layout" prices extra to confirm (Azure was wanted). The reply’s confidence weighting can use this.
Price accounting: (line_df.parsing_method == "azure_layout").any() per web page tells you which of them pages went by Azure and easy methods to invoice the parsing time.

5. Price and latency

Azure just isn’t free. Three numbers matter.

Latency: one web page by prebuilt-layout returns in 2 to 4 seconds. A 30-page doc takes 60 to 120 seconds. Fitz parses the identical doc in beneath a second. When the consumer is ready for a question, parse with fitz first. Escalate to Azure solely on pages fitz dealt with poorly.

Cash: Azure costs per web page. The prebuilt-layout tier is round US$10 per 1,000 pages at the moment. A 30-page contract prices roughly US$0.30. Parsing 1,000 such contracts a day is US$300/day if each web page goes by Azure. Proscribing Azure to the pages that want it brings this down by 10x or extra.

Limits: the per-call PDF dimension restrict is 500 MB or 2,000 pages, whichever comes first. Bigger paperwork have to be cut up. The free tier (F0) permits 500 pages per 30 days and is okay for growth. Manufacturing often wants S0.

The order of magnitude is secure: fitz is free, Azure prices roughly a cent per web page. The precise tier costs change with area and time: deal with the numbers above as a calibration, not a contract. Article 10 picks which engine runs.

6. When to name which

Default to fitz. Escalate to Azure when a selected sign says fitz just isn’t sufficient.

Three indicators price wiring:

The web page has a desk area however fitz extracted few or no row-like buildings. Compute on line_df: cluster traces by y-coordinate, search for runs of quick uniform-spaced traces (an indication of cells). If the web page metadata says “desk detected” (from fitz’s web page.find_tables()) however the line sample doesn’t look table-like, escalate.
The web page is image-heavy with sparse textual content. image_df for the web page covers greater than 80% of the web page space and line_df has fewer than 10 rows on that web page. Scanned web page with no OCR layer, or a web page that’s one giant diagram with textual content inside. Both case wants Azure.
The OCR high quality rating is low: When fitz’s web page.get_text("textual content") returns scrambled OCR (excessive ratio of Unicode alternative characters, low dictionary-word ratio), re-OCR with Azure. The text_quality_score is computed in pre_parse_signals and skim by the dispatcher.

A fourth sign is less complicated. If the doc has no native TOC (fitz.toc_df.empty) and era wants part context, run the doc as soon as by Azure to get a reconstructed TOC. One price per doc, not per question.

Article 10 builds the complete dispatcher. The parsing_method column is what lets each downstream stage learn which engine ran on which row.

7. Conclusion

Two engines, one contract: the identical relational tables out, similar downstream code no matter which one ran.

*Each functionality that issues for enterprise RAG, plus velocity and price – Picture by writer*

A parser doesn’t return textual content; it returns a mannequin of the doc. Azure makes that mannequin richer (cell-level tables, OCR inside figures, captions tagged by function, TOC reconstructed with out bookmarks) at 2 to 4 seconds and ~US$0.01 per web page. Fitz prices nothing and runs in milliseconds. The routing rule is easy: fitz by default, Azure when an upstream sign says fitz just isn’t sufficient. Article 10 wires the dispatcher.

“Los Movimientos”: The Routing Drawback That Practically Broke My Spirit

Educating LLMs to Replace Beliefs for Environment friendly Lengthy-Horizon Interplay – The Berkeley Synthetic Intelligence Analysis Weblog

8. Sources and additional studying

The prebuilt-layout mannequin behind parse_pdf_azure_layout is documented by Microsoft and rests on cell-level desk extraction analysis (Smock et al. 2022) plus a paragraph-role layer that converts visible areas into structural roles. Docling (Article 5ter) is the open-source equal of the identical cascade; it provides the identical desk contract on native {hardware}, helpful when paperwork can not go away the constructing.

Identical route because the article:

Microsoft, Azure AI Doc Intelligence. Structure mannequin. Official documentation for prebuilt-layout, the mannequin behind parse_pdf_azure_layout. The cell-level desk output, paragraph roles, and OCR protection all originate right here.
Smock, Pesala, Abraham, PubTables-1M / Desk Transformer (TATR), CVPR 2022 (arXiv:2110.00061). The analysis behind the cell-level desk extraction Azure ships; helpful for understanding what azure_layout is doing beneath the hood.

Totally different angle, completely different context:

Auer et al., Docling Technical Report, IBM Analysis 2024 (arXiv:2408.09869). Open-source native equal of the Azure format cascade. Identical desk contract; trades cloud price for native compute. The best selection when confidentiality blocks the cloud add that Azure requires.

Earlier within the sequence: