Leveraging Gemini 1.5 for Efficient Information Extraction on Long PDFs

Published in

Google Cloud - Community

6 min readApr 25, 2024

Authors: Renato Leite | Egon Soares

The process of manually extracting specific details from extensive texts can be both time-consuming and mentally taxing. However, recent advancements in large language models (LLMs) offer promising solutions to streamline this task.

This blog post explores two strategies that harness the power of LLMs, specifically Gemini 1.5 with its long context window and multimodality capabilities, to efficiently process and query lengthy PDF documents.

In the first strategy, you will use Gemini 1.5 to return specific pages where to find the information; this is particularly useful when you want the end user to perform the last mile and extract the information themselves. Alternatively, in the second strategy, you will use Gemini to query the document directly, passing the PDF as the context and asking a question. For both scenarios we will use a synthetically generated financial report to demonstrate the concept and the following, simple, architecture to execute the code.

For each strategy, we propose a prompt and some additional steps to solve the task. The prompts were crafted specifically for these tasks to ensure accurate and efficient extraction of relevant information. Before you dive deeper into the solution, let’s review some common challenges to analyze such documents.

Challenges to process PDF with LLMs

Here is a non-exhaustive list of challenges to process PDFs with LLM.

The model must support longer contexts and multimodal inputs. Gemini 1.5 supports both features natively, including the input as PDF format without any preprocessing of the document.
The document might have complex charts to analyze. In the sample chart below, the font color is the same as the bar color, which can be challenging for the model to interpret.

The PDF document might have dense and complex tables, with multiple variables.

The user query (question) to the PDF Document might have multiple variables and some reasoning about them (sum, comparison, etc.). Example using the table above: “What is the difference in total net revenue from 2023 and 2022?”
If you are analyzing different documents from different sources, there is a chance that these documents won’t have a standard formatting.

Next, you will explore two different approaches to tackle these challenges with Gemini 1.5.

Strategy 1) Returning pages from the PDF with information about the question

To analyze and extract the pages from the PDF that are related to the given question, you will use the following prompt.

Return the numbers of all pages in the document above that contain information related to the question below.
<Instructions>
 - Use the document above as your only source of information to determine which pages are related to the question below.
 - Return the page numbers of the document above that are related to the question. When in doubt, return the page anyway.
 - The response should be a JSON list, as shown in the example below.
</Instructions>
<Suggestions>
 - The document above is a financial report with various tables, charts, infographics, lists, and additional text information.
 - Pay CLOSE ATTENTION to the chart legends and chart COLORS to determine the pages. Colors may indicate which information is important for determining the pages.
 - The color of the chart legends represents the color of the bars in the chart.
 - Use ONLY this document as context to determine the pages.
 - In most cases, the page number can be found in the footer.
</Suggestions>
<Question>
{question}
</Question>
<ExampleJSONOutput>
{
 "pages": [1, 2, 3, 4, 5]
}
</ExampleJSONOutput>
json:

This guide outlines strategies for crafting prompts to analyze PDF documents. The aim is to provide a clear framework for analyzing PDF documents, not an exhaustive instruction list.

Include the PDF: Place the entire PDF document before the prompt itself.
Format your prompt: Provide a structure to the prompt. In this case, we used tags like <Instructions> and <Suggestions> to indicate to the model where to find the information.
Source of Truth: Instruct the model to use the PDF as its sole source of information.
JSON Output: Specify that the output should be in JSON format, providing a generic example.
Guidance: Offer suggestions to aid the model’s understanding:

Document Summary: Briefly explain the document’s nature and content (e.g., financial report).
Chart Details: Emphasize attention to visual elements like colors and footers in charts.
Page References: Request that the output includes page numbers where relevant information is found.

In addition to the prompt, you can also perform the following tasks:

Gemini 1.5 Invocation and JSON Conversion: How to call the model and process the output.
PDF Filtering: Creating a new PDF containing only the relevant pages.
PDF Visualization: Visualizing the filtered PDF within a Colab notebook.

The following is an example of a question asked to this document “Cymbal Bank — Financial Statements.pdf”. The model was able to correctly identify the page which contains the information.

> Question

“From the consolidated balance sheet, what was the difference between the total assets from 2022 to 2023?”

> Output from the model

[9]

Source code for this example is available HERE.

Strategy 2) Asking questions to the entire PDF

In this second approach you will use the Gemini 1.5 to respond to your question using the entire PDF as the input context. Here is the prompt with some comments:

Use the document above to answer the question below. Follow the Instructions and Suggestions below as a guide to answering the question.
<Instructions>
- First, analyze the question below and return which variables need to be analyzed, from what time period (example: second quarter of 2020), and any other details present in the question.
- Then return an analysis of what is asked in the question.
- Finally, carefully analyze the document above and answer the question below completely and correctly, using the variables determined in the previous step.
- Explain how you arrived at this result.
- Answer ONLY what was asked.
<Instructions>
<Suggestions>
- The document above is a financial report with various tables, graphs, infographics, lists, and additional information in text.
- PAY VERY CLOSE ATTENTION to the legends of the graphs and the COLORS of the graphs to answer the question below. The colors may indicate which information is important to answer the question.
- The color of the graph legends represents the color of the graph bars.
- Use ONLY this document as context to answer the question below.
</Suggestions>
<Question>
{question}
</Question>
answer:

Here are some additional suggestions to build your prompt:

Variables: Ask the model to return all the variables it needs to analyze before answering the question.
Explanation: Ask the model to explain how it arrived at the answer.

The following is an example of a question asked to this document “Cymbal Bank — Financial Statements.pdf”. The model was able to reason and respond with the correct answer.

> Question

“From the consolidated balance sheet, what was the difference between the total assets from 2022 to 2023?”

> Output from the model

## Analysis of the Question:
The question asks for the difference in **total assets** between the years **2022** and **2023** from the **Consolidated Balance Sheet**. This requires locating the relevant section within the document and identifying the values associated with each year.
## Locating the Information:
1. **Consolidated Balance Sheet:** The document provides a “Consolidated Balance Sheet” table which contains financial data for the years 2022 and 2023.
2. **Total Assets:** We need to identify the row labeled “Total assets” within the table.
3. **Values for 2022 and 2023:** We will find the corresponding values under the “12/31/2022” and “12/31/2023” columns.
## Calculation:
1. **2023 Total Assets:** $2,238,274 million
2. **2022 Total Assets:** $2,281,868 million
3. **Difference:** $2,281,868 million — $2,238,274 million = $43,594 million
## Answer:
The difference in total assets between 2022 and 2023 according to the Consolidated Balance Sheet is **$43,594 million**. This indicates a decrease in total assets from 2022 to 2023.

Source code for this example is available HERE.

Conclusion

In conclusion, Gemini 1.5’s long context window and multimodality offer powerful tools for extracting information from complex PDFs. Whether guiding users to specific pages or directly answering queries, these strategies demonstrate the model’s versatility. As LLMs continue to evolve, we can anticipate even more sophisticated interactions with unstructured data, opening up new possibilities for research, analysis, and decision-making across various domains.

Source code: https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/tree/main/genai-on-vertex-ai/gemini/pdf_processing_with_gemini

Leveraging Gemini 1.5 for Efficient Information Extraction on Long PDFs

Challenges to process PDF with LLMs

Strategy 1) Returning pages from the PDF with information about the question

Strategy 2) Asking questions to the entire PDF

Conclusion

Written by Renato Leite