are highly effective fashions that take pictures as inputs, as a substitute of textual content like conventional LLMs. This opens up quite a lot of potentialities, contemplating we are able to straight course of the contents of a doc, as a substitute of utilizing OCR to extract textual content, after which feeding this textual content into an LLM.
On this article, I’ll talk about how one can apply imaginative and prescient language fashions (VLMs) for lengthy context doc understanding duties. This implies making use of VLMs to both very lengthy paperwork over 100 pages or very dense paperwork that include quite a lot of info, corresponding to drawings. I’ll talk about what to contemplate when making use of VLMs, and how much duties you may carry out with them.

Why do we want VLMs?
I’ve mentioned VLMs quite a bit in my earlier articles, and lined why they’re so vital to know the contents of some paperwork. The principle cause VLMs are required is that quite a lot of info in paperwork, requires the visible enter to know.
The choice to VLMs is to make use of OCR, after which use an LLM. The issue right here is that you simply’re solely extracting the textual content from the doc, and never together with the visible info, corresponding to:
- The place completely different textual content is positioned relative to different textual content
- Non-text info (basically every thing that isn’t a letter, corresponding to symbols, or drawings)
- The place textual content is positioned relative to different info
This info is usually important to actually perceive the doc, and also you’re thus usually higher off utilizing VLMs straight, the place you feed within the picture straight, and might subsequently additionally interpret the visible info.
For lengthy paperwork, utilizing VLMs is a challenges, contemplating you want quite a lot of tokens to signify visible info. Processing hundres of pages is thus a giant problem. Nevertheless, with quite a lot of latest developments in VLM expertise, the fashions have gotten higher and higher and compressing the visible info into cheap context lengths, making it attainable and usable to use VLMs to lengthy paperwork for doc understanding duties.

OCR utilizing VLMs
One good choice to course of lengthy paperwork, and nonetheless embrace the visible info, is to make use of VLMs to carry out OCR. Conventional OCR like Tesseract, solely extracts the textual content straight from paperwork along with the bounding field of the textual content. Nevertheless, VLMs are additionally educated to carry out OCR, and might carry out extra superior textual content extraction, corresponding to:
- Extracting Markdown
- Explaining purely visible info (i.e. if there’s a drawing, clarify the drawing with textual content)
- Including lacking info (i.e. if there’s a field saying Date and a clean area after, you may inform the OCR to extract Date
)
Lately, Deepseek launched a robust VLM primarily based OCR mannequin, which has gotten quite a lot of consideration and traction these days, making VLMs for OCR extra widespread.
Markdown
Markdown may be very highly effective, since you extract formatted textual content. This permits the mannequin to:
- Present headers and subheaders
- Symbolize tables precisely
- Make daring textual content
This permits the mannequin to extract extra consultant textual content, will extra precisely depicts the textual content contents of the paperwork. If you happen to now apply LLMs to this textual content, the LLMs will carry out manner higher than in case you utilized then to easy textual content extracted with conventional OCR.
LLMs carry out higher on formatted textual content like Markdown, than on pure textual content extracted utilizing conventional OCR.
Clarify visible info
One other factor you should use VLM OCR for is to elucidate visible info. For instance, when you have a drawing with no textual content in it, conventional OCR wouldn’t extract any info, because it’s solely educated to extract textual content characters. Nevertheless, you should use VLMs to elucidate the visible contents of the picture.
Think about you’ve the next doc:
That is the introduction textual content of the doc
That is the conclusion of the doc
If you happen to utilized conventional OCR like Tesseract, you’d get the next output:
That is the introduction textual content of the doc
That is the conclusion of the doc
That is clearly a difficulty, because you’re not together with details about the picture displaying the Eiffel tower. As a substitute, you need to use VLMs, which might output one thing like:
That is the introduction textual content of the doc
This picture depicts the Eiffel tower through the day
That is the conclusion of the doc
If you happen to used an LLM on the primary textual content, it after all wouldn’t know the doc accommodates a picture of the Eiffel tower. Nevertheless, in case you used an LLM on the second textual content extracted with a VLM, the LLM would naturally be higher at responding to questions in regards to the doc.
Add lacking info
You too can immediate VLMs to output contents if there’s lacking info. To grasp this idea, have a look at the picture beneath:

If you happen to utilized conventional OCR to this picture, you’d get:
Handle Highway 1
Date
Firm Google
Nevertheless, it will be extra consultant in case you used VLMs, which if instructed, may output:
Handle Highway 1
Date
Firm Google
That is extra informative, as a result of we’re info any downstream mannequin, that the date area is empty. If we don’t present this info, it’s unattainable to know late if the date is just lacking, the OCR wasn’t capable of extract it, or another cause.
Nevertheless, OCR utilizing VLMs nonetheless endure from a number of the points that conventional OCR struggles with, as a result of it’s not processing visible info straight. You’ve most likely heard the saying that a picture is price a thousand phrases, which regularly holds true for processing visible info in paperwork. Sure, you may present a textual content description of a drawing with a VLM as OCR, however this article will by no means be as descriptive because the drawing itself. Thus, I argue you’re in quite a lot of instances higher off straight processing the paperwork utilizing VLMs, as I’ll cowl within the following sections.
Open supply vs closed supply fashions
There are quite a lot of VLMs out there. I follw the HuggingFace VLM leaderboard to concentrate to any new excessive performing fashions. In keeping with this leaderboard, you need to go for both Gemini 2.5 Professional, or GPT-5 if you wish to use closed supply fashions by an API. From my expeirence, these are nice choices, which works nicely for lengthy doc understanding, and dealing with advanced paperwork.
Nevertheless, you may additionally need to make the most of open-source fashions, on account of privateness, value, or to have extra management over your individual software. On this case, SenseNova-V6-5-Professional tops the leaderboard. I havn’t tried this mannequin personally, however I’ve used Qwen 3 VL quite a bit, which I’ve good expertise with. Qwen has additionally launched a particular cookbook for lengthy doc understanding.
VLMs on lengthy paperwork
On this part I’ll speak about making use of VLMs to lengthy paperwork, and issues it’s a must to make when doing it.
Processing energy issues
If you happen to’re working an open-source mannequin, one in every of your important issues is how massive of a mannequin you may run, and the way lengthy it takes. You’re relying on entry to a bigger GPU, atleast an A100 usually. Fortunately that is extensively out there, and comparatively low cost (usually value 1.5 – 2 USD per hour an quite a lot of cloud suppliers now). Nevertheless, you should additional contemplate the latency you may settle for. Runing VLMs require quite a lot of processing, and it’s a must to contemplate the next components:
- How lengthy is suitable to spend processing one request
- Which picture decision do you want?
- What number of pages do you want to course of
If in case you have a dwell chat for instance, you want fast course of, nevertheless in case you’re merely processing within the background, you may enable for longer processing instances.
Picture decision can also be an vital consideration. If you happen to want to have the ability to learn the textual content in paperwork, you want high-resolution pictures, usually over 2048×2048, although it naturally relies on the doc. Detailed drawings for instance with small textual content in them, would require even larger decision. Enhance decision, tremendously will increase processing time and is a vital consideration. It’s best to intention for the bottom attainable decision that also means that you can carry out all of the duties you need to carry out. Moreover, the variety of pages is an identical consideration. Including extra pages is usually essential to have entry to all the data in a doc. Nevertheless, usually, a very powerful info is contained early within the doc, so you may get away with solely processing the primary 10 pages for instance.
Reply dependent processing
One thing you may attempt to decrease the required processing energy, is to start out of straightforward, and solely advance to heavier processing in case you don’t get the specified solutions.
For instance, you may begin of solely wanting on the first 10 pages, and seeing in case you’re capable of correctly resolve the duty at hand, corresponding to extracting a chunk of knowledge from a doc. Provided that we’re not capable of extract the piece of information, we begin extra pages. You’ll be able to apply the identical idea to the decision of your pictures, beginning with decrease decision pictures, and shifting to larger decision of required.
This will of hierarchical processing reduces the required processing energy, since most duties will be solved solely wanting on the first 10 pages, or utilizing decrease decision pictures. Then, provided that obligatory, we transfer on to course of extra pictures, or larger decision pictures.
Price
Price is a vital consideration when utilizing VLMs. I’ve processed quite a lot of paperwork, and I usually see round a 10x enhance in variety of tokens when utilizing pictures (VLMs) as a substitute of textual content (LLMs). Since enter tokens are sometimes the driving force of prices in lengthy doc duties, utilizing VLMs normally considerably will increase value. Word that for OCR, the purpose about extra enter tokens than output tokens doesn’t apply, since OCR naturally produces quite a lot of output tokens when outputting all textual content in pictures.
Thus, when utilizing VLMs, is extremely vital to maximise your utilization of cached tokens, a subject I mentioned in my latest article about optimizing LLMs for value and latency.
Conclusion
On this article I mentioned how one can apply imaginative and prescient language fashions (VLMs) to lengthy paperwork, to deal with advanced doc understanding duties. I mentioned why VLMs are so vital, and approaches to utilizing VLMs on lengthy paperwork. You’ll be able to for instance use VLMs for extra advanced OCR, or straight apply VLMs to lengthy paperwork, although with precautions about required processing energy, value and latency. I believe VLMs have gotten an increasing number of vital, highlighted by the latest launch of Deepseek OCR. I thus assume VLMs for doc understanding is a subject you need to become involved with, and you need to discover ways to use VLMs for doc processing functions.
👉 Discover me on socials:
🧑💻 Get in contact
✍️ Medium
You too can learn my different articles:
















