How to Apply Vision Language Models to Long Documents

Learn how to apply powerful VLMs for long context document understanding tasks

Nov 3, 2025

10 min read

In this article, I’ll discuss how to apply vision language models to long documents. Image by ChatGPT.

Vision language models are powerful models that take images as inputs, instead of text like traditional LLMs. This opens up a lot of possibilities, considering we can directly process the contents of a document, instead of using OCR to extract text, and then feeding this text into an LLM.

In this article, I’ll discuss how you can apply vision language models (VLMs) for long context document understanding tasks. This means applying VLMs to either very long documents over 100 pages or very dense documents that contain a lot of information, such as drawings. I’ll discuss what to consider when applying VLMs, and what kind of tasks you can perform with them.

VLMs for long document understanding — This infographic highlights the main contents of this article. I’ll cover why VLMs are so important and how to apply them to long documents. You can, for example, use VLMs for more advanced OCR, incorporating more of the document information into the extracted text. Furthermore, you can apply VLMs directly to the images of a document, though you have to consider required processing power, cost, and latency. Image by ChatGPT.

Why do we need VLMs?

I’ve discussed VLMs a lot in my previous articles and covered why they are so important to understand the contents of some documents. The main reason VLMs are required is that a lot of information in documents requires visual input to understand.

The alternative to VLMs is to use OCR, and then use an LLM. The problem here is that you’re only extracting the text from the document, and not including the visual information, such as:

Where different text is positioned relative to other text
Non-text information (essentially everything that isn’t a letter, such as symbols or drawings)
Where text is positioned relative to other information

This information is often critical to really understand the document, and you’re thus often better off using VLMs directly, where you feed in the image directly, and can therefore also interpret the visual information.

For long documents, using VLMs is a challenge, considering you need a lot of tokens to represent visual information. Processing hundreds of pages is thus a big challenge. However, with a lot of recent advancements in VLM technology, the models have gotten better and better and compressing the visual information into reasonable context lengths, making it possible and usable to apply VLMs to long documents for document understanding tasks.

This figure highlights the OCR + LLM approach you can utilize. You take your document and apply OCR to get the document text. Then you feed this text, together with a user query, into an LLM, which responds with an answer to the question, given the document text. If you instead use VLMs, you can skip the OCR step completely, and answer the user question directly from the document. Image by the author.

OCR using VLMs

One good option to process long documents and still include the visual information is to use VLMs to perform OCR. Traditional OCR, like Tesseract, only extracts the text directly from documents, together with the bounding box of the text. However, VLMs are also trained to perform OCR, and can perform more advanced text extraction, such as:

Extracting Markdown
Explaining purely visual information (i.e., if there’s a drawing, explain the drawing with text)
Adding missing information (i.e., if there’s a box saying Date and a blank field after, you can tell the OCR to extract Date <empty>)

Recently, DeepSeek released a powerful VLM-based OCR model, which has gotten a lot of attention and traction lately, making VLMs for OCR more popular.

Markdown

Markdown is very powerful because you extract formatted text. This allows the model to:

Provide headers and subheaders
Represent tables accurately
Make bold text

This allows the model to extract more representative text, which will more accurately depict the text contents of the documents. If you now apply LLMs to this text, the LLMs will perform way better than if you applied them to a simple text extracted with traditional OCR.

LLMs perform better on formatted text like Markdown, than on pure text extracted using traditional OCR.

Explain visual information

Another thing you can use VLM OCR for is to explain visual information. For example, if you have a drawing with no text in it, traditional OCR would not extract any information, since it’s only trained to extract text characters. However, you can use VLMs to explain the visual contents of the image.

Imagine you have the following document:

This is the introduction text of the document

<image showing the Eiffel tower>

This is the conclusion of the document

If you applied traditional OCR like Tesseract, you would get the following output:

This is the introduction text of the document

This is the conclusion of the document

This is clearly an issue, since you’re not including information about the image showing the Eiffel Tower. Instead, you should use VLMs, which would output something like:

This is the introduction text of the document

<image>
This image depicts the Eiffel tower during the day
</image>

This is the conclusion of the document

If you used an LLM on the first text, it of course wouldn’t know the document contains an image of the Eiffel Tower. However, if you used an LLM on the second text extracted with a VLM, the LLM would naturally be better at responding to questions about the document.

Add missing information

You can also prompt VLMs to output contents if there is missing information. To understand this concept, look at the image below:

Why VLMs are important — This figure shows a typical example of how information is represented in a document. Image by the author.

If you applied traditional OCR to this image, you would get:

Address Road 1
Date
Company Google

However, it would be more representative if you used VLMs, which, if instructed, could output:

Address Road 1
Date <empty> 
Company Google

This is more informative because we’re informing any downstream model that the date field is empty. If we don’t provide this information, it’s impossible to know late if the date is simply missing, the OCR wasn’t able to extract it, or any other reason.

However, OCR using VLMs still suffers from some of the issues that traditional OCR struggles with, because it’s not processing visual information directly. You’ve probably heard the saying that an image is worth a thousand words, which often holds true for processing visual information in documents. Yes, you can provide a text description of a drawing with a VLM as OCR, but this text will never be as descriptive as the drawing itself. Thus, I argue you’re in a lot of cases better off directly processing the documents using VLMs, as I’ll cover in the following sections.

Open source vs closed source models

There are a lot of VLMs available. I follow the HuggingFace VLM leaderboard to pay attention to any new high-performing models. According to this leaderboard, you should go for either Gemini 2.5 Pro or GPT-5 if you want to use closed-source models through an API. From my experience, these are great options that work well for long document understanding and handling complex documents.

However, you might also want to utilize open-source models due to privacy, cost, or to have more control over your own application. In this case, SenseNova-V6-5-Pro tops the leaderboard. I haven’t tried this model personally, but I’ve used Qwen 3 VL a lot, which I have good experience with. Qwen has also released a specific cookbook for long document understanding.

VLMs on long documents

In this section, I’ll talk about applying VLMs to long documents and the considerations you have to make when doing it.

Processing power considerations

If you’re running an open-source model, one of your main considerations is how large a model you can run and how long it takes. You’re depending on access to a larger GPU, at least an A100 in most cases. Luckily, this is widely available, and relatively cheap (typically costs 1.5 – 2 USD per hour and a lot of cloud providers now). However, you must further consider the latency you can accept. Running VLMs requires a lot of processing, and you have to consider the following factors:

How long is acceptable to spend processing one request
Which image resolution do you need?
How many pages do you need to process

If you have a live chat, for example, you need a quick process; however, if you’re simply processing in the background, you can allow for longer processing times.

Image resolution is also an important consideration. If you need to be able to read the text in documents, you need high-resolution images, typically over 2048×2048, though it naturally depends on the document. Detailed drawings, for example, with small text in them, will require even higher resolution. Increasing resolution greatly increases processing time and is an important consideration. You should aim for the lowest possible resolution that still allows you to perform all the tasks you want to perform. Furthermore, the number of pages is a similar consideration. Adding more pages is often necessary to have access to all the information in a document. However, often, the most important information is contained early in the document, so you could get away with only processing the first 10 pages, for example.

Answer dependent processing

Something you can try to lower the required processing power, is to start of simple, and only advance to heavier processing if you don’t get the desired answers.

For example, you could start of only looking at the first 10 pages, and seeing if you’re able to properly solve the task at hand, such as extracting a piece of information from a document. Only if we’re not able to extract the piece of info, we start looking at more pages. You can apply the same concept to the resolution of your images, starting with lower resolution images, and moving to the higher resolution required.

This can of hierarchical processing reduces the required processing power, since most tasks can be solved only by looking at the first 10 pages, or using lower resolution images. Then, only if necessary, we move on to process more images or higher resolution images.

Cost

Cost is an important consideration when using VLMs. I’ve processed a lot of documents, and I typically see around a 10x increase in the number of tokens when using images (VLMs) instead of text (LLMs). Since input tokens are often the driver of costs in long document tasks, using VLMs usually significantly increases cost. Note that for OCR, the point about more input tokens than output tokens does not apply, since OCR naturally produces a lot of output tokens when outputting all text in images.

Thus, when using VLMs, it is incredibly important to maximize your usage of cached tokens, a topic I discussed in my recent article about optimizing LLMs for cost and latency.

Conclusion

In this article, I discussed how you can apply vision language models (VLMs) to long documents to handle complex document understanding tasks. I discussed why VLMs are so important and approaches to using VLMs on long documents. You can, for example, use VLMs for more complex OCR, or directly apply VLMs to long documents, though with precautions about required processing power, cost, and latency. I think VLMs are becoming more and more important, highlighted by the recent release of Deepseek OCR. I thus think VLMs for document understanding is a topic you should get involved with, and you should learn how to use VLMs for document processing applications.

👉 My Free Resources

🚀 10x Your Engineering with LLMs (Free 3-Day Email Course)

📚 Get my free Vision Language Models ebook

💻 My webinar on Vision Language Models

👉 Find me on socials:

📩 Subscribe to my newsletter

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Written By

Eivind Kjosbakken