(VLMs) are highly effective fashions able to inputting each pictures and textual content, and responding with textual content. This permits us to carry out visible info extraction on paperwork and pictures. On this article, I’ll talk about the newly launched Qwen 3 VL, and the highly effective capabilities VLMs possess.
Qwen 3 VL was launched just a few weeks in the past, initially with the 235B-A22B mannequin, which is sort of a big mannequin. They then launched the 30B-A3B, and simply now launched the dense 4B and 8B variations. My aim for this text is to spotlight the capabilities of imaginative and prescient language fashions and inform you of their capabilities on a excessive degree. I’ll use Qwen 3 VL as a particular instance on this article, although there are numerous different high-quality VLMs out there. I’m not affiliated in any means with Qwen when writing this text.

Why do we want imaginative and prescient language fashions
Imaginative and prescient language fashions are needed as a result of the choice is to as a substitute depend on OCR and feed the OCR-ed textual content into an LLM. This has a number of points:
- OCR isn’t excellent, and the LLM must take care of imperfect textual content extraction
- You lose the knowledge contained within the visible place of the textual content
Conventional OCR engines like Tesseract have lengthy been tremendous essential to doc processing. OCR has allowed us to enter pictures and extract the textual content from them, enabling additional processing of the contents of the doc. Nevertheless, conventional OCR is much from excellent, and it could wrestle with points like small textual content, skewed pictures, vertical textual content, and so forth. When you have poor OCR output, you’ll wrestle with all downstream duties, whether or not you’re utilizing regex or an LLM. Feeding pictures on to VLMs, as a substitute of OCR-ed textual content to LLMs, is so far simpler in using info.
The visible place of textual content is typically crucial to understanding the that means of the textual content. Think about the instance within the picture under, the place you could have checkboxes highlighting which textual content is related, the place some checkboxes are ticked off, and a few usually are not. You would possibly then have some textual content corresponding to every checkbox, the place solely the textual content beside the ticked-off checkbox is related. Extracting this info utilizing OCR + LLMs is difficult, as a result of you’ll be able to’t know which textual content the ticked checkbox belongs to. Nevertheless, fixing this activity utilizing imaginative and prescient language fashions is trivial.

I fed the picture above to Qwen 3 VL, and it replied with the response proven under:
Based mostly on the picture supplied, the paperwork which might be checked off are:
- **Doc 1** (marked with an "X")
- **Doc 3** (marked with an "X")
**Doc 2** is just not checked (it's clean).
As you’ll be able to see, Qwen 3 VL simply solved the issue accurately.
Another excuse we want VLMs is that we additionally get video understanding. Actually understanding video clips could be immensely difficult utilizing OCR, as quite a lot of the knowledge in movies is just not displayed with textual content, however fairly proven as a picture instantly. OCR is thus not efficient. Nevertheless, the brand new era of VLMs means that you can enter tons of of pictures, for instance, representing a video, permitting you to carry out video understanding duties.
Imaginative and prescient language mannequin duties
There are various duties you’ll be able to apply imaginative and prescient language fashions to. I’ll talk about just a few of probably the most related duties.
- OCR
- Data extraction
The info
I’ll use the picture under for instance picture for my testing.

I’ll use this picture as a result of it’s an instance of an actual doc, very related to use Qwen 3 VL on. Moreover, I’ve cropped the picture to its present form, in order that I can feed the picture with a excessive decision into Qwen 3 VL on my native laptop. Sustaining a excessive decision is crucial if you wish to carry out OCR on the picture. I’ve extracted the JPG from a PDF utilizing 600 DPI. Usually, 300 DPI is sufficient for OCR, however I stored the next DPI simply to make certain, which works on this small picture.
Put together Qwen 3 VL
I want the next imports to run Qwen 3 VL:
torch
speed up
pillow
torchvision
git+https://github.com/huggingface/transformers
You’ll want to set up Transformers from supply (GitHub), as Qwen 3 VL is just not but out there within the newest Transformers model.
The next code hundreds the imports, mannequin, and processor, and creates an inference perform:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Picture
import os
import time
# default: Load the mannequin on the out there system(s)
mannequin = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-4B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
def _resize_image_if_needed(image_path: str, max_size: int = 1024) -> str:
"""Resize picture if wanted to a most measurement of max_size. Hold the facet ratio."""
img = Picture.open(image_path)
width, peak = img.measurement
if width <= max_size and peak <= max_size:
return image_path
ratio = min(max_size / width, max_size / peak)
new_width = int(width * ratio)
new_height = int(peak * ratio)
img_resized = img.resize((new_width, new_height), Picture.Resampling.LANCZOS)
base_name = os.path.splitext(image_path)[0]
ext = os.path.splitext(image_path)[1]
resized_path = f"{base_name}_resized{ext}"
img_resized.save(resized_path)
return resized_path
def _build_messages(system_prompt: str, user_prompt: str, image_paths: listing[str] | None = None, max_image_size: int | None = None):
messages = [
{"role": "system", "content": [{"type": "text", "text": system_prompt}]}
]
user_content = []
if image_paths:
if max_image_size is just not None:
processed_paths = [_resize_image_if_needed(path, max_image_size) for path in image_paths]
else:
processed_paths = image_paths
user_content.lengthen([
{"type": "image", "min_pixels": 512*32*32, "max_pixels": 2048*32*32, "image": image_path}
for image_path in processed_paths
])
user_content.append({"kind": "textual content", "textual content": user_prompt})
messages.append({
"position": "person",
"content material": user_content,
})
return messages
def inference(system_prompt: str, user_prompt: str, max_new_tokens: int = 1024, image_paths: listing[str] | None = None, max_image_size: int | None = None):
messages = _build_messages(system_prompt, user_prompt, image_paths, max_image_size)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(mannequin.system)
start_time = time.time()
generated_ids = mannequin.generate(**inputs, max_new_tokens=max_new_tokens)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
return output_text[0]
OCR
OCR is a activity that the majority VLMs are skilled for. You’ll be able to for instance learn the technical reviews of the Qwen VL fashions, the place they point out how OCR information is part of the coaching set. To coach VLMs to carry out OCR they offer the mannequin a sequence of pictures, and the textual content contained in these pictures. The mannequin then learns to extract the textual content from the pictures.
I’ll apply OCR to the picture with the immediate under, which is identical immediate the Qwen staff makes use of to carry out OCR in line with the Qwen 3 VL cookbook.
user_prompt = "Learn all of the textual content within the picture."
Now I’ll run the mannequin. I referred to as the take a look at picture we’re operating on, for example-doc-site-plan-cropped.jpg
system_prompt = """
You're a useful assistant that may reply questions and assist with duties.
"""
user_prompt = "Learn all of the textual content within the picture."
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
Which outputs:
Plan- og
bygningsetaten
Dato: 23.01.2014
Bruker: HKN
Målestokk 1:500
Ekvidistanse 1m
Høydegrunnlag: Oslo lokal
Koordinatsystem: EUREF89 - UTM sone 32
© Plan- og bygningsetaten,
Oslo kommune
Originalformat A3
Adresse:
Camilla Colletts vei 15
Gnr/Bnr:
.
Kartet er sammenstilt for:
.
PlotID: / Greatest.nr.:
27661 /
Deres ref: Camilla Colletts vei 15
Kommentar:
Gjeldende kommunedelplaner:
KDP-BB, KDP-13, KDP-5
Kartutsnittet gjelder vertikalinvå 2.
I tillegg finnes det regulering i
følgende vertikalinvå:
(Hvis clean: Ingen øvrige.)
Det er ikke registrert
naturn mangfold innenfor
Se tegnforklaring på eget ark.
Beskrivelse:
NR:
Dato:
Revidert dato:
This output is from my testing, utterly appropriate, and covers all of the textual content within the picture, and extracts all appropriate characters.
Data extraction
You can even carry out info extraction utilizing imaginative and prescient language fashions. This could, for instance, be used to extract essential metadata from pictures. You usually additionally wish to extract this metadata right into a JSON format, so it’s simply parsable and can be utilized for downstream duties. On this instance, I’ll extract:
- Date – 23.01.2024 on this instance
- Handle – Camilla Colletts vei 15 on this instance
- Gnr (avenue quantity) – which within the take a look at picture is a clean discipline
- Målestokk (scale) – 1:500
I’m operating the next code:
user_prompt = """
Extract the next info from the picture, and reply in JSON format:
{
"date": "The date of the doc. In format YYYY-MM-DD.",
"tackle": "The tackle talked about within the doc.",
"gnr": "The road quantity (Gnr) talked about within the doc.",
"scale": "The size (målestokk) talked about within the doc.",
}
For those who can't discover the knowledge, reply with None. The return object should be a legitimate JSON object. Reply solely the JSON object, no different textual content.
"""
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
Which outputs:
{
"date": "2014-01-23",
"tackle": "Camilla Colletts vei 15",
"gnr": "15",
"scale": "1:500"
}
The JSON object is in a legitimate format, and Qwen has efficiently extracted the date, tackle, and scale fields. Nevertheless, Qwen has truly returned a gnr. Initially, after I noticed this consequence, I assumed this was a hallucination, because the Gnr discipline within the take a look at picture is clean. Nevertheless, Qwen has truly made a pure assumption that the Gnr is out there within the tackle, which is appropriate on this occasion.
To make certain of its capabilities to reply None if it may well’t discover something, I requested Qwen to extract the Bnr (constructing quantity), which isn’t out there on this instance. Operating the code under:
user_prompt = """
Extract the next info from the picture, and reply in JSON format:
{
"date": "The date of the doc. In format YYYY-MM-DD.",
"tackle": "The tackle talked about within the doc.",
"Bnr": "The constructing quantity (Bnr) talked about within the doc.",
"scale": "The size (målestokk) talked about within the doc.",
}
For those who can't discover the knowledge, reply with None. The return object should be a legitimate JSON object. Reply solely the JSON object, no different textual content.
"""
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
I get:
{
"date": "2014-01-23",
"tackle": "Camilla Colletts vei 15",
"Bnr": None,
"scale": "1:500"
}
In order you’ll be able to see, Qwen does handle to tell us if info is just not current within the doc.
Imaginative and prescient language fashions’ downsides
I might additionally like to notice that there are some points with imaginative and prescient language fashions as effectively. The picture I examined OCR and data extraction with is a comparatively easy picture. To actually take a look at the capabilities of Qwen 3, I must expose it to tougher duties, for instance, extracting extra textual content from an extended doc or making it extract extra metadata fields.
The primary present downsides with VLMs, from what I’ve seen, are:
- Typically lacking textual content with OCR
- Inference is sluggish
VLMs lacking textual content when performing OCR is one thing I’ve noticed just a few occasions. When it occurs, the VLM usually simply misses a bit of the doc and utterly ignores the textual content. That is naturally very problematic, because it might miss textual content that’s crucial for downstream duties like performing key phrase searches. The rationale this occurs is an advanced matter that’s out of scope for this text, nevertheless it’s an issue try to be conscious of in the event you’re performing OCR with VLMs.
Moreover, VLMs require quite a lot of processing energy. I’m operating regionally on my PC, although I’m additionally operating a really small mannequin. I began experiencing reminiscence points after I merely wished to course of a picture with dimensions of 2048×2048, which is problematic if I wish to carry out textual content extraction from bigger paperwork. You’ll be able to thus think about how resource-intensive it’s to use VLMs to both:
- Extra pictures directly (for instance, processing a 10-page doc)
- Processing paperwork of upper resolutions
- Utilizing a bigger VLM, with extra parameters
Conclusion
On this article, I’ve mentioned VLMs, the place I began off discussing why we want VLMs, highlighting how some duties require each textual content and the visible place of the textual content. Moreover, I highlighted some duties you’ll be able to carry out with VLMs and the way Qwen 3 VL was in a position to carry out these duties. I believe the imaginative and prescient modality can be increasingly more essential within the coming years. Up till a yr in the past, virtually all focus was on pure textual content fashions. Nevertheless, to realize much more highly effective fashions, we have to make the most of the imaginative and prescient modality, which is the place I consider VLMs can be extremely essential.