The article is a higher-level overview of performing metadata extraction on documents, highlighting the different considerations you must make when performing metadata extraction.

Why extract document metadata
First, it’s important to clarify why we need to extract metadata from documents. After all, if the information is present in the documents already, can we not just find the information using RAG or other similar approaches?
In a lot of cases, RAG would be able to find specific data points, but pre-extracting metadata simplifies a lot of downstream tasks. Using metadata, you can, for example, filter your documents based on data points, such as:
- Document type
- Addresses
- Dates
Furthermore, if you have a RAG system in place, it will, in many cases, benefit from additionally provided metadata. This is because you present the additional information (the metadata) more clearly to the LLM. For example, suppose you ask a question related to dates. In that case, it’s easier to simply provide the pre-extracted document dates to the model, instead of having the model extract the dates during inference time. This saves on both costs and latency, and is likely to improve the quality of your RAG responses.
How to extract metadata
I’m highlighting three main approaches to extracting metadata, going from simplest to most complex:
- Regex
- OCR + LLM
- Vision LLMs

Regex
Regex is the simplest and most consistent approach to extracting metadata. Regex works well if you know the exact format of the data beforehand. For example, if you’re processing lease agreements, and you know the date is written as dd.mm.yyyy, always right after the words “Date: “, then regex is the way to go.
Unfortunately, most document processing is more complex than this. You’ll have to deal with inconsistent documents, with challenges like:
- Dates are written in different places in the document
- The text is missing some characters because of poor OCR
- Dates are written in different formats (e.g., mm.dd.yyyy, 22nd of October, December 22, etc.)
Because of this, we usually have to move on to more complex approaches, like OCR + LLM, which I’ll describe in the next section.
OCR + LLM
A powerful approach to extracting metadata is to use OCR + LLM. This process starts with applying OCR to a document to extract the text contents. You then take the OCR-ed text and prompt an LLM to extract the date from the document. This usually works incredibly well, because LLMs are good at understanding the context (which date is relevant, and which dates are irrelevant), and can understand dates written in all sorts of different formats. LLMs will, in many cases, also be able to understand both European (dd.mm.yyyy) and American (mm.dd.yyyy) date standards.

However, in some scenarios, the metadata you want to extract requires visual information. In these scenarios, you need to apply the most advanced technique: vision LLMs.
Vision LLMs
Using vision LLMs is the most complex approach, with both the highest latency and cost. In most scenarios, running vision LLMs will be far more expensive than running pure text-based LLMs.
When running vision LLMs, you usually have to ensure images have high resolution, so the vision LLM can read the text of the documents. This then requires a lot of visual tokens, which makes the processing expensive. However, vision LLMs with high resolution images will usually be able to extract complex information, which OCR + LLM cannot, for example, the information provided in the image below.

Vision LLMs also work well in scenarios with handwritten text, where OCR might struggle.
Challenges when extracting metadata
As I pointed out earlier, documents are complex and come in various formats. There are thus a lot of challenges you have to deal with when extracting metadata from documents. I’ll highlight three of the main challenges:
- When to use vision vs OCR + LLM
- Dealing with handwritten text
- Dealing with long documents
When to use vision LLMs vs OCR + LLM
Preferably, we would use vision LLMs for all metadata extraction. However, this is usually not possible due to the cost of running vision LLMs. We thus have to decide when to use vision LLMs vs when to use OCR + LLMs.
One thing you can do is to decide whether the metadata point you want to extract requires visual information or not. If it’s a date, OCR + LLM will work pretty well in almost all scenarios. However, if you know you’re dealing with checkboxes like in the example task I mentioned above, you need to apply vision LLMs.
Dealing with handwritten text
One issue with the approach mentioned above is that some documents might contain handwritten text, which traditional OCR is not particularly good at extracting. If your OCR is poor, the LLM extracting metadata will also perform poorly. Thus, if you know you’re dealing with handwritten text, I recommend applying vision LLMs, as they are way better at dealing with handwriting, based on my own experience. It’s important to be aware that many documents will contain both born-digital text and handwriting.
Dealing with long documents
In many cases, you’ll also have to deal with extremely long documents. If this is the case, you have to make the consideration of how far into the document a metadata point might be present.
The reason this is a consideration is that you want to minimize cost, and if you need to process extremely long documents, you need to have a lot of input tokens for your LLMs, which is costly. In most cases, the important piece of information (date, for example) will be present early in the document, in which case you won’t need many input tokens. In other situations, however, the relevant piece of information might be present on page 94, in which case you need a lot of input tokens.
The issue, of course, is that you don’t know beforehand which page the metadata is present on. Thus, you essentially have to make a decision, like only looking at the first 100 pages of a given document, and assuming the metadata is available in the first 100 pages, for almost all documents. You’ll miss a data point on the rare occasion where the data is on page 101 and onwards, but you’ll save largely on costs.
Conclusion
In this article, I’ve discussed how you can consistently extract metadata from your documents. This metadata is often critical when performing downstream tasks like filtering your documents based on data points. Furthermore, I discussed three main approaches to metadata extraction with Regex, OCR + LLM, and vision LLMs, and I covered some challenges you’ll face when extracting metadata. I think metadata extraction remains a task that doesn’t require a lot of effort, but that can provide a lot of value in downstream tasks. I thus believe metadata extraction will remain important in the coming years, though I believe we’ll see more and more metadata extraction move to purely utilizing vision LLMs, instead of OCR + LLM.
👉 My Free Resources
🚀 10x Your Engineering with LLMs (Free 3-Day Email Course)
📚 Get my free Vision Language Models ebook
💻 My webinar on Vision Language Models
👉 Find me on socials:
🧑💻 Get in touch
✍️ Medium





