Imaginative and prescient LLMs are PDF Parsers Too: Studying Charts and Diagrams for RAG

When the Code Turns into the CEO: Why Your Subsequent Supervisor Would possibly Be a Decentralized Agentic Loop

The Present State of Agentic AI

companion in Enterprise Doc Intelligence, the sequence that builds an enterprise RAG system from 4 bricks. Article 5 (doc parsing) constructed the parser with PyMuPDF (fitz), which reads the phrases on a web page. This companion swaps the engine for a imaginative and prescient LLM that reads the web page as a picture, so it offers you the phrases plus the one factor the textual content parsers can’t, the content material of the images.

*the place this companion sits: it extends Article 5 (doc parsing), inside Half II (the 4 bricks), with a unique parsing engine – Picture by writer*

Present a PDF parser a chart and it sees an empty field. The textual content engines, native or cloud or native, all discover the phrases on a web page and put them in searchable tables. A chart has no phrases, so to each considered one of them the area is clean, and to a retrieval system it doesn’t exist.

A imaginative and prescient mannequin is totally different. It seems to be on the web page the best way an individual would. Ask it for the textual content and it offers you the textual content and the tables, similar to the others. Present it a chart and it tells you what the chart says, in plain phrases you possibly can search. That final half is what the others can’t do.

The catch: it’s slower, prices extra, and reads numbers off a chart solely roughly. It’s also solely nearly as good because the mannequin you choose. gpt-4.1 reads a chart that the cheaper gpt-4o-mini half-misses. So that you don’t use it in every single place. You put it aside for the pages which are largely photos, the place the opposite parsers come again empty.

1. The one factor solely a imaginative and prescient mannequin can do: make a picture searchable

Begin with the explanation this parser exists in any respect. The textual engines flip a web page into the relational tables from the sooner articles, however a determine defeats them: they return a chart as a bounding field in image_df with possibly a stray axis label. There isn’t any textual content in a chart, so to OCR and to a format mannequin the area is empty, and to a retrieval system it doesn’t exist.

*OCR and format return a field; the imaginative and prescient parser writes textual content you possibly can retrieve – Picture by writer*

A imaginative and prescient mannequin reads the image. Beneath are three figures pulled straight out of two PDFs: the Transformer diagrams from Consideration Is All You Want (Vaswani et al. 2017) and the commodity-price charts from the World Financial institution Commodity Markets Outlook (April 2026 problem). Every determine sits subsequent to the one-sentence description gpt-4.1 wrote for it. Supply paperwork and licensing particulars are listed on the finish of the article.

*Every extracted picture will get a one-sentence description, which is textual content retrieval can match – Picture by writer*

The value chart is now a sentence: commodity worth indices by sector, falling since their 2022 peak. A person looking for “commodity worth index since 2022” can now hit that web page. Earlier than, there was nothing on it to match.

Right here is the argument in its sharpest kind. Image a satellite tv for pc picture of a parking zone. It has no textual content in any respect. OCR finds nothing, format finds one field, and to a retrieval system the picture doesn’t exist. A imaginative and prescient mannequin writes “aerial view of a parking zone, roughly half full, round forty vehicles”. Now a seek for parking occupancy finds it. That sentence is the parse, and solely a imaginative and prescient mannequin can produce it. OCR and format can’t, by definition, as a result of there have been by no means any characters to learn.

2. It additionally parses textual content and tables, just like the others

The determine is the distinctive half, however a parser that solely learn photos can be ineffective. A imaginative and prescient mannequin reads the textual content and the tables too, and never worse than the textual engines on clear materials. We pointed parse_page_vision at web page 30 of the NIST Cybersecurity Framework, the Framework Core desk, and requested for markdown. It returned the desk columns intact, merged cells dealt with (the Perform identify sits on the primary row of its block and the continuation rows depart it clean).

*The identical 4-column desk the opposite engines reconstruct, learn straight off the picture – Picture by writer*

This is identical cell construction Docling and Azure produce from the identical web page within the two earlier articles: they emit markdown tables too, so the format will not be what units imaginative and prescient aside. The imaginative and prescient mannequin by no means constructed a desk object; it learn the grid off the image and wrote markdown (it returns HTML simply as properly). So the declare from the lead holds: it’s a parser, returning the reusable mannequin the others return, plus the figures they can not.

3. The mannequin issues: gpt-4o-mini misses charts that gpt-4.1 reads

How good the parse is relies upon closely on the mannequin, and the hole reveals exactly the place it counts, on the figures. We ran the identical CMO chart web page via gpt-4o-mini and gpt-4.1.

*Each learn the web page textual content and the desk; on the charts the cheaper mannequin finds half – Picture by writer*

gpt-4o-mini discovered three of the six charts and labelled two of them as tables. gpt-4.1 discovered all six and transcribed their axes all the way down to the month, together with the policy-uncertainty and temperature-anomaly charts the smaller mannequin missed. Each learn the web page textual content and the NIST desk accurately. The weaker mannequin fell down on the images, the one factor you introduced imaginative and prescient in to do. So with this parser the mannequin is a part of the standard, not only a latency and price knob: a less expensive imaginative and prescient mannequin degrades gracefully on textual content and badly on figures.

4. The sincere commerce: exactness and price

None of that is free, and the catch is price naming plainly. It’s not that imaginative and prescient “isn’t actually parsing”, as a result of it’s. It’s that the parse is much less actual and prices extra per web page.

*Identical on textual content and tables; imaginative and prescient alone reads pictures; the worth is exactness and price – Picture by writer*

Two prices stand out.

Exactness, with two faces: The values it reads off a curve are approximate: the form and the gist are proper, a selected tick will be off, so deal with a transcribed quantity as a lead, not a reality. Worse, it may possibly silently omit a component, a row of a desk or one chart in a panel, the best way gpt-4o-mini dropped half the charts in part 3. That could be a completeness downside, a form of hallucination by omission, and a deterministic parser by no means has it: when fitz or Docling reads a desk, no row goes lacking.

*imaginative and prescient recovers the form of a chart however not the precise worth; deal with a transcribed quantity as a result in confirm – Picture by writer*

Value: Each web page is a big picture and a mannequin name, billed per web page, with no bounding packing containers to spotlight afterward. The textual parsers run as soon as, value nearly nothing per web page, and provide you with actual spans.

So the rule will not be “imaginative and prescient as an alternative of parsing”. It’s “imaginative and prescient for the pages the textual parsers go blind on”.

5. The way it works: `parse_page_vision`

The mechanism is small. The operate renders the web page, sends the picture to the imaginative and prescient mannequin via the identical responses.parse structured-output name the sequence makes use of elsewhere, and returns just a little object: the web page as markdown, and an inventory of figures, every with a variety, a description, and a transcription.

web page = parse_page_vision("CMO-April-2026.pdf", 10, mannequin="gpt-4.1")
web page.markdown                  # headings, paragraphs, tables
web page.figures                   # one entry per chart / diagram
web page.figures[0].description    # "line chart, worth index ..."
web page.figures[0].transcription  # axes, legend, readable values

parse_page_vision is a sibling of the fitz, azure_layout, and docling parsers, as a result of it’s a parser too. The adaptive-parsing dispatcher (Article 10) reaches for it when a web page is visible sufficient that the textual engines come again empty.

The physique is brief sufficient to learn in a single cross. Two Pydantic fashions set the output: the web page as markdown, plus one entry per determine with its variety, description, and transcription. The operate renders the web page to a picture, provides the instruction, and makes one structured name via the shared llm_parse wrapper. Retries, token limits, and the decision cache include the wrapper. There isn’t any format mannequin and no OCR step: the mannequin reads the pixels and fills the schema.

class FigureContent(BaseModel):
    variety: str           # chart, diagram, picture, map, ...
    description: str    # what it reveals, in searchable phrases
    transcription: str  # axes, legend, readable values

class VisionPageParse(BaseModel):
    markdown: str                 # the web page as markdown, tables stored
    figures: listing[FigureContent]  # one entry per determine on the web page

def parse_page_vision(pdf_path, web page, *, consumer=None, mannequin=None, zoom=2.0):
    consumer = consumer or get_vision_client()
    mannequin = mannequin or vision_model()
    page_image = render_page_data_url(pdf_path, web page, zoom=zoom)
    content material = [{"type": "input_text", "text": "Parse this page."},
               {"type": "input_image", "image_url": page_image}]
    return llm_parse(
        enter=[{"role": "system", "content": VISION_PARSE_SYSTEM_PROMPT},
               {"role": "user", "content": content}],
        text_format=VisionPageParse,   # the Pydantic contract above
        consumer=consumer, mannequin=mannequin, label="imaginative and prescient.parse_page",
    )

The system immediate (VISION_PARSE_SYSTEM_PROMPT) is the opposite half of the engine: it tells the mannequin to maintain headings and studying order, render each desk as a markdown desk, and add one entry per determine whose description somebody might later search. Change that instruction and you alter the parser.

6. The lighter mode: ask the web page instantly

There’s a one-off means to make use of the identical functionality. As an alternative of parsing the web page right into a reusable construction, hand the mannequin the web page and a single query and skim again one reply. No markdown, no index, nothing stored. Helpful when constructing a mannequin can be overkill.

ans = answer_from_pdf_vision(
    "information/nist/NIST.CSWP.04162018.pdf",
    "Class Distinctive Identifier for 'Asset Administration'?",
    pages=30,
)
ans.reply        # "ID.AM"
ans.answer_found  # True (False when not on the web page)

It behaves, and right here the mannequin barely issues: each gpt-4o-mini and gpt-4.1 reply these the identical means. The Framework Core lookup returned ID.AM, Perform Determine; a query about Determine 1 of the Consideration paper, readable from the diagram, got here again proper; and a query whose reply was not on the web page returned nothing.

*The third row is the protection test: requested for one thing absent, it refused as an alternative of inventing – Picture by writer*

That third row issues as a lot as the primary two. A mannequin that reads a web page will invent a believable reply until the schema and the instruction give it an specific solution to say “not right here”. The null path firing makes the mode protected to make use of.

Identical concept, packaged. The vision-as-parser sample is now shipped as a tuned product by a number of distributors. Mistral Doc AI on Azure AI Foundry (mannequin mistral-document-ai-2512, out there as a serverless API in East US / East US 2 / Sweden Central) bundles an OCR part (mistral-ocr-2512) with a small reasoning mannequin (mistral-small-2506) and returns markdown plus a JSON object whose schema you possibly can customise. The output contract differs from parse_page_vision, markdown fairly than a line_df, structured extraction baked into the identical name fairly than punted to era. Identical underlying concept, packaged for a per-page billing mannequin. For pipelines that already assume in markdown or need the format + extraction step folded into one API name, it’s price a comparability towards the OpenAI imaginative and prescient route used on this article.

The bbox hole is actual. Mistral OCR returns bounding packing containers just for pictures embedded within the web page (every picture carries top_left_x / top_left_y / bottom_right_x / bottom_right_y). The markdown physique itself has no per-line, per-paragraph, or per-table-cell bboxes. That breaks two issues the remainder of the sequence depends on: Article 1’s PDF annotation step (spotlight the cited strains on the supply PDF wants bboxes) and Article 7’s line-level retrieval audit (each retrieved row factors again to its bbox so the reader can confirm on the web page).

An open query for the reader, then. How would you reconcile two parsers working on the identical web page, Mistral’s markdown (structured however bbox-less) and fitz / Docling’s line_df (bbox-rich however flatter), into one coherent output your downstream can use? Aligning two textual content streams on the line or token degree is a recognized onerous downside (segmentation differs, OCR errors differ, the markdown’s desk flattening loses cell positions). The article doesn’t suggest an answer. In case your downstream wants bbox-level traceability, the reconciliation value is actual and value measuring earlier than committing to the markdown contract.

Sources for this part:

Mistral OCR API endpoint specification, enter schema, response schema (pages with markdown + pictures array, picture bboxes solely).
Mistral OCR processor docs (primary OCR), table_format parameter, per-page response construction.
Mistral Doc Annotations docs, non-obligatory structured extraction with customized schemas, bbox-level annotations when explicitly requested.
Mistral Doc AI 2512 on Azure AI Foundry catalog, serverless deployment in East US / East US 2 / Sweden Central, per-page billing.
Unlocking Doc Understanding with Mistral Doc AI in Microsoft Foundry, the bundled mistral-ocr-2512 + mistral-small-2506 composition.

7. 4 parsers now, considered one of them reads the images

All 4 engines are parsers. Three learn textual content and construction; the fourth reads these too, and the pictures on high.

*fitz, azure, and docling construct the mannequin from textual content and format; imaginative and prescient additionally reads the images – Picture by writer*

Article 10 (adaptive parsing) builds the dispatcher that picks amongst them per web page. The imaginative and prescient parser sits on the visible finish: attain for it when a web page is generally a chart, when a diagram holds the reply, when a scan is simply too degraded for OCR, or when the content material is a picture with no textual content in any respect. It’s the most costly per web page and the least actual on numbers, so it runs final. However it’s the solely engine that turns an image into one thing you possibly can retrieve.

8. Conclusion

A imaginative and prescient mannequin is a parser: ask for markdown, it returns textual content and tables like fitz or Azure; ask it to explain the figures, it returns the one factor the textual parsers can’t, searchable phrases about a picture. The commerce is actual (much less actual, no bounding packing containers, one mannequin name per web page), so the imaginative and prescient parser doesn’t substitute the textual ones, it covers their blind spot. They learn the phrases on the web page; it reads the web page that has no phrases.

9. Sources and additional studying

Imaginative and prescient-language fashions as doc parsers descend from two lineages: the open VLM literature (PaliGemma, Florence-2, Qwen-VL household) and the frontier multimodal APIs (OpenAI GPT-4o / GPT-4.1, Anthropic Claude with imaginative and prescient, Google Gemini). The proper cross-reading for this text is ColPali (Faysse et al. 2024), which makes the visible web page the retrieval primitive itself, and the model-specific documentation pages the place OpenAI publishes the imaginative and prescient capabilities of gpt-4.1 and gpt-4o-mini.

Identical path because the article:

OpenAI, Imaginative and prescient capabilities of the gpt-4.1 household. Reference documentation for the mannequin behind parse_page_vision; identical architectural sample (imaginative and prescient LLM as a parser that returns markdown or structured output).
Faysse et al., ColPali: Environment friendly Doc Retrieval with Imaginative and prescient Language Fashions, 2024 (arXiv:2407.01449). Imaginative and prescient-language retrieval on the web page picture itself. Anchors the visible row of the Article 4 diagnostic grid; identical household of strategies utilized to a unique brick (retrieval fairly than parsing).

Completely different angle, totally different context:

Auer et al., Docling Technical Report, IBM Analysis 2024 (arXiv:2408.09869). Structure-based parsing with out a generative mannequin. Completely different cost-quality tradeoff: deterministic, low cost, blind to figures. Article 5ter (Docling parsing) develops this engine finish to finish.
Microsoft, Azure AI Doc Intelligence. Cloud cell-level parser. Identical blind spot as Docling on figures, complementary to imaginative and prescient LLM on each different content material sort.

Supply paperwork and licensing. The figures and tables on this article are reproduced from openly-licensed sources:

Earlier within the sequence:

Doc Intelligence: sequence intro. What the sequence builds, brick by brick, and in what order.
Baseline Enterprise RAG, from PDF to highlighted reply. The four-brick pipeline finish to finish: PDF in, highlighted reply out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. The place embedding similarity wins (synonyms, typos, paraphrase), the place it predictably breaks (unknown phrases, negation, term-vs-answer relevance), and the best way to use it anyway.
Rerankers Aren’t Magic Both: When the Cross-Encoder Layer Is Definitely worth the Value. What a cross-encoder provides over bi-encoder embeddings, measured, and when it’s well worth the latency.
RAG will not be machine studying, and the ML toolkit solves the improper downside. Why chunk-size sweeps and finetuning optimize the improper factor; route by query sort as an alternative.
From regex to imaginative and prescient fashions: which RAG method matches which downside. Two axes, doc complexity and query management, that choose the method for every case.
10 widespread RAG errors we maintain seeing in manufacturing. Ten manufacturing errors, organized brick by brick, with the repair for every.
Past extract_text: the 2 layers of a PDF that drive RAG high quality. The primary half of the parsing brick: the doc’s nature, indicators, and abstract.
Cease returning flat textual content from a PDF: the relational form RAG wants. The second half of the parsing brick: the relational tables each downstream brick reads.
When PyMuPDF can’t see the desk: parse PDFs for RAG with Azure Structure (hyperlink to return). The identical tables from Azure Structure: native desk cells, OCR, paragraph roles.
Parse PDFs for RAG domestically with Docling: wealthy tables, no cloud add (hyperlink to return). The identical tables computed domestically with Docling: TableFormer cells, nothing leaves the machine.