Publish AI, ML & data-science insights to a global community of data professionals.

How to Consistently Extract Metadata from Complex Documents

Learn how to extract important pieces of information from your documents

Learn how to consistently extract metadata from complex documents. Image by ChatGPT.

Documents contain vast amounts of important information. However, this information is, in many cases, hidden deep into the contents of the documents and is thus hard to utilize for downstream tasks. In this article, I’ll discuss how to consistently extract metadata from your documents, considering approaches to metadata extraction and challenges you’ll face along the way.

The article is a higher-level overview of performing metadata extraction on documents, highlighting the different considerations you must make when performing metadata extraction.

This infographic highlights the main contents of this article. I’ll first discuss why we need to extract document metadata, and how it’s useful for downstream tasks. Continuing, I’ll discuss approaches to extract metadata, with Regex, OCR + LLM, and vision LLMs. Lastly, I’ll also discuss different challenges when performing metadata extraction, such as regex, handwritten text, and dealing with long documents. Image by ChatGPT.

Why extract document metadata

First, it’s important to clarify why we need to extract metadata from documents. After all, if the information is present in the documents already, can we not just find the information using RAG or other similar approaches?

In a lot of cases, RAG would be able to find specific data points, but pre-extracting metadata simplifies a lot of downstream tasks. Using metadata, you can, for example, filter your documents based on data points, such as:

  • Document type
  • Addresses
  • Dates

Furthermore, if you have a RAG system in place, it will, in many cases, benefit from additionally provided metadata. This is because you present the additional information (the metadata) more clearly to the LLM. For example, suppose you ask a question related to dates. In that case, it’s easier to simply provide the pre-extracted document dates to the model, instead of having the model extract the dates during inference time. This saves on both costs and latency, and is likely to improve the quality of your RAG responses.

How to extract metadata

I’m highlighting three main approaches to extracting metadata, going from simplest to most complex:

  • Regex
  • OCR + LLM
  • Vision LLMs
This image highlights the three main approaches to extracting metadata. The simplest approach is to use Regex, though it doesn’t work in many situations. A more powerful approach is OCR + LLM, which works well in most cases, but misses in situations where you’re dependent on visual information. If visual information is important, you can use vision LLMs, the most powerful approach. Image by ChatGPT.

Regex

Regex is the simplest and most consistent approach to extracting metadata. Regex works well if you know the exact format of the data beforehand. For example, if you’re processing lease agreements, and you know the date is written as dd.mm.yyyy, always right after the words “Date: “, then regex is the way to go.

Unfortunately, most document processing is more complex than this. You’ll have to deal with inconsistent documents, with challenges like:

  • Dates are written in different places in the document
  • The text is missing some characters because of poor OCR
  • Dates are written in different formats (e.g., mm.dd.yyyy, 22nd of October, December 22, etc.)

Because of this, we usually have to move on to more complex approaches, like OCR + LLM, which I’ll describe in the next section.

OCR + LLM

A powerful approach to extracting metadata is to use OCR + LLM. This process starts with applying OCR to a document to extract the text contents. You then take the OCR-ed text and prompt an LLM to extract the date from the document. This usually works incredibly well, because LLMs are good at understanding the context (which date is relevant, and which dates are irrelevant), and can understand dates written in all sorts of different formats. LLMs will, in many cases, also be able to understand both European (dd.mm.yyyy) and American (mm.dd.yyyy) date standards.

This figure shows the OCR + LLM approach. On the right side, you see that we first perform OCR on the document, which extracts the document text. We can then prompt the LLM to read that text and extract a date from the document. The LLM then outputs the extracted date from the document. Image by the author.

However, in some scenarios, the metadata you want to extract requires visual information. In these scenarios, you need to apply the most advanced technique: vision LLMs.

Vision LLMs

Using vision LLMs is the most complex approach, with both the highest latency and cost. In most scenarios, running vision LLMs will be far more expensive than running pure text-based LLMs.

When running vision LLMs, you usually have to ensure images have high resolution, so the vision LLM can read the text of the documents. This then requires a lot of visual tokens, which makes the processing expensive. However, vision LLMs with high resolution images will usually be able to extract complex information, which OCR + LLM cannot, for example, the information provided in the image below.

This image highlights a task where you need to use vision LLMs. If you OCR this image, you’ll be able to extract the words “Document 1, Document 2, Document 3,” but the OCR will completely miss the filled-in checkbox. This is because OCR is trained to extract characters, and not figures, like the checkbox with a circle in it. Attempting to use OCR + LLM will thus fail in this scenario. However, if you instead use a vision LLM on this problem, it will easily be able to extract which document is checked off. Image by the author.

Vision LLMs also work well in scenarios with handwritten text, where OCR might struggle.

Challenges when extracting metadata

As I pointed out earlier, documents are complex and come in various formats. There are thus a lot of challenges you have to deal with when extracting metadata from documents. I’ll highlight three of the main challenges:

  • When to use vision vs OCR + LLM
  • Dealing with handwritten text
  • Dealing with long documents

When to use vision LLMs vs OCR + LLM

Preferably, we would use vision LLMs for all metadata extraction. However, this is usually not possible due to the cost of running vision LLMs. We thus have to decide when to use vision LLMs vs when to use OCR + LLMs.

One thing you can do is to decide whether the metadata point you want to extract requires visual information or not. If it’s a date, OCR + LLM will work pretty well in almost all scenarios. However, if you know you’re dealing with checkboxes like in the example task I mentioned above, you need to apply vision LLMs.

Dealing with handwritten text

One issue with the approach mentioned above is that some documents might contain handwritten text, which traditional OCR is not particularly good at extracting. If your OCR is poor, the LLM extracting metadata will also perform poorly. Thus, if you know you’re dealing with handwritten text, I recommend applying vision LLMs, as they are way better at dealing with handwriting, based on my own experience. It’s important to be aware that many documents will contain both born-digital text and handwriting.

Dealing with long documents

In many cases, you’ll also have to deal with extremely long documents. If this is the case, you have to make the consideration of how far into the document a metadata point might be present.

The reason this is a consideration is that you want to minimize cost, and if you need to process extremely long documents, you need to have a lot of input tokens for your LLMs, which is costly. In most cases, the important piece of information (date, for example) will be present early in the document, in which case you won’t need many input tokens. In other situations, however, the relevant piece of information might be present on page 94, in which case you need a lot of input tokens.

The issue, of course, is that you don’t know beforehand which page the metadata is present on. Thus, you essentially have to make a decision, like only looking at the first 100 pages of a given document, and assuming the metadata is available in the first 100 pages, for almost all documents. You’ll miss a data point on the rare occasion where the data is on page 101 and onwards, but you’ll save largely on costs.

Conclusion

In this article, I’ve discussed how you can consistently extract metadata from your documents. This metadata is often critical when performing downstream tasks like filtering your documents based on data points. Furthermore, I discussed three main approaches to metadata extraction with Regex, OCR + LLM, and vision LLMs, and I covered some challenges you’ll face when extracting metadata. I think metadata extraction remains a task that doesn’t require a lot of effort, but that can provide a lot of value in downstream tasks. I thus believe metadata extraction will remain important in the coming years, though I believe we’ll see more and more metadata extraction move to purely utilizing vision LLMs, instead of OCR + LLM.

👉 My Free Resources

🚀 10x Your Engineering with LLMs (Free 3-Day Email Course)

📚 Get my free Vision Language Models ebook

💻 My webinar on Vision Language Models

👉 Find me on socials:

📩 Subscribe to my newsletter

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles