Publish AI, ML & data-science insights to a global community of data professionals.

Document Parsing Using Large Language Models – With Code

You will not think about using Regular Expressions anymore.

Motivation

For many years, regular expressions have been my go-to tool for parsing documents, and I am sure it has been the same for many other technical folks and industries.

Even though regular expressions are powerful and successful in some case, they often struggle with the complexity and variability of real-world documents.

Large language models on the other end provide a more powerful, and flexible approach to handle many types of document structures and content types.

General Workflow of the system

It’s always good to have a clear understanding of the main components of the system being built. To make things simple, let’s focus on a scenario of research paper processing.

Documents Parsing Workflow With LLM (Author: Zoumana Keita)
Documents Parsing Workflow With LLM (Author: Zoumana Keita)
  • The workflow has overall three main components: Input, Processing, and Output.
  • First, documents, in this case, scientific research papers in PDF formats are submitted for processing.
  • The first module of the processing component extract raw data from each PDF and combine that to the prompt containing instructions for the large language model to efficiently extract data.
  • The large language model then uses the prompt to extract all the metadata.
  • For each PDF, the final result is saved in JSON format, which can be used for further analysis.

But, why bother with LLMs, instead of using Regular expressions?

Regular expressions (Regex) come with significant limitations when dealing with the complexity of the structure of research papers, and some of them are illustrated below:

1. Flexibility of document structure

  • Regex require a specific pattern for each document structure, and fails when a given document deviates from the expected formats.
  • LLMs automatically understand and adapt to a wide range of document structures, and they are capable of identifying relevant information regardless of where they are in the document.

2. Context understanding

  • Regex match patterns without any understanding of the context or meaning.
  • LLMs have a granular understanding of the meaning of each document, which allows them to perform a more accurate extraction of relevant information.

3. Maintenance and Scalability

  • Regex require continuous update as documents format change. Adding support for new types of information leads to writing an entire new regex.
  • LLMs can be easily adapted to new document types with minimal changes in the initial prompt, which makes them more scalable.

Building a Document Parsing Workflow

The above reasons are strong enough to adopt LLMs for parsing complex documents like research papers.

The documents used for our illustration are:

This section provides all the steps for building a real world document parsing system leveraging large language models, and I believe this has the potential to change the way you think about AI and its capabilities.

If you are more of a video oriented person, I will be waiting for you on the other side.

Structure of the code

The code is structured as follows:

project
   |
   |---Extract_Metadata_With_Large_Language_Models.ipynb
   |
  data
   |
   |---- extracted_metadata/
   |---- 1706.03762v7.pdf
   |---- 2301.09056v1.pdf
   |---- prompts
           |
           |------ scientific_papers_prompt.txt
  • project folder is the root folder and contains the data folder, and the notebook
  • data folder has two folders, and the above two papers: extracted_metadata and prompts
  • extracted_metadata is currently empty, and will contain the json files
  • prompts folder has the prompt in text format

Metadata to extract

We first need to have a clear goal of the attributes that need to be extracted, and for simplicity’s sake let’s focus on six attributes for our scenario.

  • Paper Title
  • Publication Year
  • Authors
  • Author Contact
  • Abstract
  • Summary Abstract

Those attributes are then used to define the prompt, which clearly explains what each attribute means, and the format of the final output.

The successful parsing of the documents rely on a prompt that clearly explain what each attribute means and in which format to extract the final result.

Scientific research paper:
---
{document}
---

You are an expert in analyzing scientific research papers. Please carefully read the provided research paper above and extract the following key information:

Extract these six (6) properties from the research paper:
- Paper Title: The full title of the research paper
- Publication Year: The year the paper was published
- Authors: The full names of all authors of the paper
- Author Contact: A list of dictionaries, where each dictionary contains the following keys for each author:
  - Name: The full name of the author
  - Institution: The institutional affiliation of the author
  - Email: The email address of the author (if provided)
- Abstract: The full text of the paper's abstract
- Summary Abstract: A concise summary of the abstract in 2-3 sentences, highlighting the key points

Guidelines:
- The extracted information should be factual and accurate to the document.
- Be extremely concise, except for the Abstract which should be copied in full.
- The extracted entities should be self-contained and easily understood without the rest of the paper.
- If any property is missing from the paper, please leave the field empty rather than guessing.
- For the Summary Abstract, focus on the main objectives, methods, and key findings of the research.
- For Author Contact, create an entry for each author, even if some information is missing. If an email or institution is not provided for an author, leave that field empty in the dictionary.

Answer in JSON format. The JSON should contain 6 keys: "PaperTitle", "PublicationYear", "Authors", "AuthorContact", "Abstract", and "SummaryAbstract". The "AuthorContact" should be a list of dictionaries as described above.

Six main things are happening in the prompt, and let’s break them down.

  1. Document placeholder
Scientific research paper:
---
{document}
---

Defined with the {} sign, it indicates where the full text of the document will included for analysis.

2. Role assignment

The model is assigned a role for better execution of the task, and this is defined in the following line, setting the context and instructing the AI to be an expert in scientific research paper analysis.

You are an expert in analyzing scientific research papers.

3. Extraction instruction

This section specifies the pieces of information that should be extracted from the document.

Extract these six (6) properties from the research paper:

4. Attributes definition

Here is where each of the above attributes is defined with specific details on what information to include, along with their formatting strategy. For instance, Author Contact is a list of dictionaries containing additional details.

5. Guidelines

The guidelines tell the AI the rules to follow during the extraction, such as maintaining the accuracy, and how to handle missing information.

6. Expected output format

This is the final step, and it specifies the exact format to consider when answering, which is json .

Answer in JSON format. The JSON should contain 6 keys: ...

Libraries

Great, now let’s start installing the necessary libraries.

Our document parsing system is built with several libraries, and the main ones for each component are illustrated below:

  • PDF Processing: pdfminer.six, PyPDF2, and poppler-utils for handling various PDF formats and structures.
  • Text Extraction: unstructuredand its dependent packages (unstructured-inference, unstructured-pytesseract) for intelligent content extraction from documents.
  • OCR Capabilities: tesseract-ocr for recognizing text in images or scanned documents.
  • Image Handling: pillow-heif for image processing tasks.
  • AI Integration: openai library for leveraging GPT models in our information extraction process.
%%bash

pip -qqq install pdfminer.six
pip -qqq install pillow-heif==0.3.2
pip -qqq install matplotlib
pip -qqq install unstructured-inference
pip -qqq install unstructured-pytesseract
pip -qqq install tesseract-ocr
pip -qqq install unstructured
pip -qqq install openai
pip -qqq install PyPDF2

apt install -V tesseract-ocr
apt install -V libtesseract-dev

sudo apt-get update
apt-get install -V poppler-utils

After a successful installation, the import is perform as follows:

import os
import re
import json
import openai
from pathlib import Path
from openai import OpenAI
from PyPDF2 import PdfReader
from google.colab import userdata
from unstructured.partition.pdf import partition_pdf
from tenacity import retry, wait_random_exponential, stop_after_attempt

Set Up credentials

We need to set up the our environment with the necessary API credentials before diving into the core functionalities.

OPENAI_API_KEY = userdata.get('OPEN_AI_KEY')
model_ID = userdata.get('GPT_MODEL')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

client = OpenAI(api_key = OPENAI_API_KEY)
  • Here, we are using the userdata.get() function to securely access credentials in Google Colab.
  • We retrieve the specific GPT model ID we want to use which is gpt-4o in our use case.

Using environment variables like this to set up our credentials ensures a secure access to the credentials of the model, while maintaining flexibility in our choice of model.

It is also a better approach to manage API keys and models, especially when working in different environments or with multiple projects.

Workflow implementation

We have now all the resources to efficiently build the end-to-end workflow. Now it is time to start the technical implementation of each workflow component, starting with the data processing helper function.

  1. Data processing

The first step in our workflow is to preprocess the PDF files and extract their text content, and that is achieved with the extract_text_from_pdf function.

It takes a PDF file as input, and returns its content as a raw text data.

def extract_text_from_pdf(pdf_path: str):
    """
    Extract text content from a PDF file using the unstructured library.
    """
    elements = partition_pdf(pdf_path, strategy="hi_res")
    return "n".join([str(element) for element in elements])

Prompt reader

The prompt is stored in a separate .txt file and loaded using the following function.

def read_prompt(prompt_path: str):
    """
    Read the prompt for research paper parsing from a text file.
    """
    with open(prompt_path, "r") as f:
        return f.read()

Metadata extraction

This function is actually the core of our workflow. It leverages OpenAI API to process the content of a given PDF file.

Without using the decorator @retry we might run into the Error Code 429 - Rate limit reached for requests issue. This mainly happens when we reach the rate limit during the processing. Instead of failing, we want the function to keep trying until it successfully reaches its goal.

@retry(wait=wait_random_exponential(min=1, max=120), stop=stop_after_attempt(10))
def completion_with_backoff(**kwargs):
    return client.chat.completions.create(**kwargs)

By using the completion_with_backoff within our extract_metadata function:

  • It waits between 1 and 120 seconds before rerunning a failed API call.
  • The above waiting time increases with each retry, but always stays the range of 1 to 120 seconds.
  • This process is known as exponential backoff, and is useful to manage API rate limits, including temporary issues.
def extract_metadata(content: str, prompt_path: str, model_id: str):
    """
    Use GPT model to extract metadata from the research paper content based on the given prompt.
    """
    prompt_data = read_prompt(prompt_path)

    try:
        response = completion_with_backoff(
            model=model_id,
            messages=[
                {"role": "system", "content": prompt_data},
                {"role": "user", "content": content}
            ],
            temperature=0.2,
        )

        response_content = response.choices[0].message.content
        # Process and return the extracted metadata
        # ...
    except Exception as e:
        print(f"Error calling OpenAI API: {e}")
        return {}

By sending the paper content along with the prompt, the gpt-4o model extract the structured information as specified in the prompt.

Putting it all together

By putting all the logic together, we can use the process_research_paper function to perform the end-to-end execution for a single PDF file, from extracting the expected metadata to saving the final result in a .jsonformat.

def process_research_paper(pdf_path: str, prompt: str,
                           output_folder: str, model_id: str):
    """
    Process a single research paper through the entire pipeline.
    """
    print(f"Processing research paper: {pdf_path}")

    try:
        # Step 1: Extract text content from the PDF
        content = extract_text_from_pdf(pdf_path)

        # Step 2: Extract metadata using GPT model
        metadata = extract_metadata(content, prompt, model_id)

        # Step 3: Save the result as a JSON file
        output_filename = Path(pdf_path).stem + '.json'
        output_path = os.path.join(output_folder, output_filename)

        with open(output_path, 'w') as f:
            json.dump(metadata, f, indent=2)
        print(f"Saved metadata to {output_path}")

    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")

Here is an example of applying the logic to the processing of a single document:

# Example for a single document

pdf_path = "./data/1706.03762v7.pdf"
prompt_path =  "./data/prompts/scientific_papers_prompt.txt"
output_folder = "./data/extracted_metadata"

process_research_paper(pdf_path, prompt_path, output_folder, model_ID)
Processing steps of the PDf document (Image by Author)
Processing steps of the PDf document (Image by Author)

From the above image, we can see that the resulting .json is saved in the ./data/extracted_metadata/ folder under the name 1706.0376v7.json which is exactly the same name as the PDF but with a different extension.

The content of the json file is given below along with the research paper highlighted with the target attributes that have been extracted:

Original Paper with target attributes to extract (Image by Author)
Original Paper with target attributes to extract (Image by Author)

From the json data we notice that all the attributes have been successfully extracted. The great thing also is that Illia Polosukhin ‘s institution is not provided in the paper, and the AI left it as an empty field.

{
  "PaperTitle": "Attention Is All You Need",
  "PublicationYear": "2017",
  "Authors": [
    "Ashish Vaswani",
    "Noam Shazeer",
    "Niki Parmar",
    "Jakob Uszkoreit",
    "Llion Jones",
    "Aidan N. Gomez",
    "Lukasz Kaiser",
    "Illia Polosukhin"
  ],
  "AuthorContact": [
    {
      "Name": "Ashish Vaswani",
      "Institution": "Google Brain",
      "Email": "[email protected]"
    },
    {
      "Name": "Noam Shazeer",
      "Institution": "Google Brain",
      "Email": "[email protected]"
    },
    {
      "Name": "Niki Parmar",
      "Institution": "Google Research",
      "Email": "[email protected]"
    },
    {
      "Name": "Jakob Uszkoreit",
      "Institution": "Google Research",
      "Email": "[email protected]"
    },
    {
      "Name": "Llion Jones",
      "Institution": "Google Research",
      "Email": "[email protected]"
    },
    {
      "Name": "Aidan N. Gomez",
      "Institution": "University of Toronto",
      "Email": "[email protected]"
    },
    {
      "Name": "Lukasz Kaiser",
      "Institution": "Google Brain",
      "Email": "[email protected]"
    },
    {
      "Name": "Illia Polosukhin",
      "Institution": "",
      "Email": "[email protected]"
    }
  ],
  "Abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
  "SummaryAbstract": "The paper introduces the Transformer, a novel network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions. The Transformer achieves superior performance on machine translation tasks, setting new state-of-the-art BLEU scores while being more parallelizable and requiring less training time. Additionally, it generalizes well to other tasks such as English constituency parsing."
}

Furthermore, the additional attribute, Summary Abstract ‘s value is shown below, and it perfectly summarizes the initial abstract while staying in the two to three sentences constraint provided in the prompt.

The paper introduces the Transformer, a novel network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions. 
The Transformer achieves superior performance on machine translation tasks, setting new state-of-the-art BLEU scores while being more parallelizable and requiring less training time. 
Additionally, it generalizes well to other tasks such as English constituency parsin

Now that the pipeline works for a single document, we can implement the logic to run it for a all the documents from a given folder, and that is achieved using the process_directory function.

It processes each file and save it to the same extracted_metadata folder.

# Parse documents from a folder
def process_directory(prompt_path: str, directory_path: str, output_folder: str, model_id: str):
    """
    Process all PDF files in the given directory.
    """

    # Iterate through all files in the directory
    for filename in os.listdir(directory_path):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(directory_path, filename)
            process_research_paper(pdf_path, prompt_path, output_folder, model_id)

Here is how to call the function with the correct parameters.

# Define paths
prompt_path = "./data/prompts/scientific_papers_prompt.txt"
directory_path = "./data"
output_folder = "./data/extracted_metadata"
process_directory(prompt_path, directory_path, output_folder, model_ID)

The successful processing shows the following message, and we can see that each research paper has been processed.

Processing steps of the research papers (Image by Author)
Processing steps of the research papers (Image by Author)

And the resulting json file content for the YOLOv5 paper is given below, similarly to the above paper.

{
  "PaperTitle": "Performance Study of YOLOv5 and Faster R-CNN for Autonomous Navigation around Non-Cooperative Targets",
  "PublicationYear": "2022",
  "Authors": [
    "Trupti Mahendrakar",
    "Andrew Ekblad",
    "Nathan Fischer",
    "Ryan T. White",
    "Markus Wilde",
    "Brian Kish",
    "Isaac Silver"
  ],
  "AuthorContact": [
    {
      "Name": "Trupti Mahendrakar",
      "Institution": "Florida Institute of Technology",
      "Email": "[email protected]"
    },
    {
      "Name": "Andrew Ekblad",
      "Institution": "Florida Institute of Technology",
      "Email": "[email protected]"
    },
    {
      "Name": "Nathan Fischer",
      "Institution": "Florida Institute of Technology",
      "Email": "[email protected]"
    },
    {
      "Name": "Ryan T. White",
      "Institution": "Florida Institute of Technology",
      "Email": "[email protected]"
    },
    {
      "Name": "Markus Wilde",
      "Institution": "Florida Institute of Technology",
      "Email": "[email protected]"
    },
    {
      "Name": "Brian Kish",
      "Institution": "Florida Institute of Technology",
      "Email": "[email protected]"
    },
    {
      "Name": "Isaac Silver",
      "Institution": "Energy Management Aerospace",
      "Email": "[email protected]"
    }
  ],
  "Abstract": "Autonomous navigation and path-planning around non-cooperative space objects is an enabling technology for on-orbit servicing and space debris removal systems. The navigation task includes the determination of target object motion, the identification of target object features suitable for grasping, and the identification of collision hazards and other keep-out zones. Given this knowledge, chaser spacecraft can be guided towards capture locations without damaging the target object or without unduly the operations of a servicing target by covering up solar arrays or communication antennas. One way to autonomously achieve target identification, characterization and feature recognition is by use of artificial intelligence algorithms. This paper discusses how the combination of cameras and machine learning algorithms can achieve the relative navigation task. The performance of two deep learning-based object detection algorithms, Faster Region-based Convolutional Neural Networks (R-CNN) and You Only Look Once (YOLOv5), is tested using experimental data obtained in formation flight simulations in the ORION Lab at Florida Institute of Technology. The simulation scenarios vary the yaw motion of the target object, the chaser approach trajectory, and the lighting conditions in order to test the algorithms in a wide range of realistic and performance limiting situations. The data analyzed include the mean average precision metrics in order to compare the performance of the object detectors. The paper discusses the path to implementing the feature recognition algorithms and towards integrating them into the spacecraft Guidance Navigation and Control system.",
  "SummaryAbstract": "This paper evaluates the performance of two deep learning-based object detection algorithms, YOLOv5 and Faster R-CNN, for autonomous navigation around non-cooperative space objects. Experimental data from formation flight simulations were used to test the algorithms under various conditions. The study found that while Faster R-CNN is more accurate, YOLOv5 offers significantly faster inference times, making it more suitable for real-time applications."
}

The AI created the following summary for the initial abstract, and once again, this looks great!

This paper evaluates the performance of two deep learning-based object detection algorithms, YOLOv5 and Faster R-CNN, for autonomous navigation around non-cooperative space objects. 
Experimental data from formation flight simulations were used to test the algorithms under various conditions. 
The study found that while Faster R-CNN is more accurate, YOLOv5 offers significantly faster inference times, making it more suitable for real-time applications.

Conclusion

This article provided a brief overview of the application of LLMs to metadata extraction from complex documents, and the extracted json data can be stored in non relational databases for further analysis.

Both LLMs, and Regex have their pros and cons for content extraction, and each one should be wisely applied depending on the use case. The complete code is available on my GitHub, and subscribe to my YouTube for more content.

I hope this short tutorial helped you acquire new skill sets.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $5-a-month commitment, you unlock unlimited access to stories on Medium.

Would you like to buy me a coffee ☕️? → Here you go!

Feel free to follow me on Twitter, or say Hi on LinkedIn. It is always a pleasure to discuss AI stuff!


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles