How to Scan Documents with GPT-4: A Practical Guide

Based on real case studies in invoice processing and receipts verification

Kacper Tokarczyk

Published in

autoMEE

7 min readSep 2, 2024

Introduction: Cutting Through the Paperwork with AI

Businesses today are drowning in paperwork. So much of what they do involves moving data from one place to another.

It’s the kind of work that doesn’t require much thought, just time and patience. It’s mundane, inefficient, and frankly, a waste of human potential.

SpongeBob running in panic with tons of papers — Just another day in the office

Look at the invoice… Manually enter the numbers into the system… (Stay awake, avoid mistakes)… Repeat.

But there is a smarter way.

In this article, we’ll show you how you can use GPT-4 to transform these tasks, saving time and effort.

Real-World Examples: Invoices and Receipts

We’ll look at two real-world examples of solutions we’ve developed for our clients:

automating invoice processing
verifying receipts for lotteries and contests.

These tasks require accuracy and speed, and AI, like GPT-4o, can make them seamless. Let’s dive in.

Getting Started: The Basics of Using GPT-4 for Document Scanning

Setting Up for Success

Before you start automating document scanning with GPT-4, you need to set the stage for success. Here are the key factors to consider:

Image Quality: High-quality scans are vital but not essential. For invoices, this is usually straightforward since they are often already in clear, standardized formats. Receipts are trickier, as customers often take the photos themselves, leading to varying quality. However, with the right approach, even imperfect images can be accurately processed.
Prompt: The right instruction here is crucial. It should contain a clear and specific description of what data we want to extract from the document, where the data is located and in what format. It is also essential to specify the structured output format in JSON so you can handle the returned data consistently.

Example prompt for scanning receipts:

If the image contains a receipt, scan the entire text from top to bottom and return the date and time, store name, receipt number, list of all products with quantities, and the total amount in a JSON format (without `json` formatting) — {“date_time”, “store_name”, “receipt_number”, “products”, “total”}.
Items on the receipt are typically in the format: PRODUCT NAME | TAX TYPE | QUANTITY x PRICE | TOTAL PRICE.
The receipt number is always located below the total amount/payment summary on the receipt. It is the first element after the total amount/payment and is usually on the same line as the date and time. Sometimes it is preceded by the letter “F”.
Important: “Transaction Number”, “System Number”, and BDO numbers are not receipt numbers. Look for the receipt number directly below the total amount/payment summary.
Product quantities might be indicated as units, quantity x price, or quantity * price (ensure you are summing quantities, not prices).

An example receipt with a few CocaCola products — Example receipt

Choosing Your Setup: From Simple to Tailor-Made

With your prompt ready, the next step is setting up the infrastructure for automation. There are several ways to do this:

No-code automation in Zapier/Make: For most use cases, a straightforward setup using tools like Zapier or Make is sufficient. You can connect these tools to an email account to monitor for new document and image attachments, and then send them to OpenAI with specific instructions (example below).
Custom Solutions for Greater Flexibility: In cases where more control and customization are required, setting up a simple REST API server might be the way to go. This server can act like Zapier or Make, receiving documents or images via an endpoint, handling more complex pre-processing (like cleaning up image quality or handling different file formats), and managing post-processing tasks tailored to your specific needs. While this approach offers greater flexibility, it’s generally only necessary for more complex or large-scale implementations.

Example document data extraction automation prepared in Zapier — Example automation in Zapier

With these setups, you can streamline the document scanning process, making it efficient, scalable, and tailored to your specific needs.

Fine-Tuning for Accuracy: Tips and Tricks for Better Results

Now that we’ve covered the basic setup, let’s explore the nuances that will give us the accuracy we need. Ready to take it to the next level?

Breaking Bad series scene from the lab with a caption saying “let’s cook”

Page-by-Page vs. Combined Image Scanning: What Works Best

When using GPT-4 for document scanning, understanding how the model handles image resizing is crucial for maintaining accuracy. According to OpenAI’s pricing information, the maximum size GPT-4o can process is determined by the shorter side of the image, which can be up to 768 pixels. For most documents, this will usually be the width. The height can go up to 2048 pixels without issue, but beyond this, the width must shrink accordingly. For example, if the height exceeds 2048 pixels, the width will begin to decrease, dropping to 766 pixels if the height reaches 2050 pixels.

OpenAI Vision pricing calculator (September 1st, 2024)

This resizing limitation means that combining multiple pages of a PDF document (such as an invoice) into one long image reduces the quality more as the number of pages increases, leading to a significant drop in accuracy. While combining pages might lower costs — since you’re only charged for processing one image — the trade-off in accuracy can be substantial.

Therefore, it’s recommended to send each page as a separate image to OpenAI, which can all be included in the same API call. While this approach may increase processing costs, it ensures that each page is processed with high accuracy. For invoice scans, this method has proven to work flawlessly, delivering reliable results.

Few-Shot Prompting: Enhancing Accuracy for Complex Documents

Few-shot prompting is a technique where a model is provided with a small number of example inputs and outputs to help it understand the task better.

For documents like receipts, where formats and quality can vary widely, using a few-shot prompting approach can greatly enhance accuracy. This is especially helpful for extracting non-standardized information, such as receipt numbers or product listings that can appear in different formats. Although it does increase processing costs, the improvement in accuracy, especially for critical details is a must in most cases.

Here’s an example API call to OpenAI chat leveraging a few-shot approach:

const response = await openai.chat.completions.create({
    model: "gpt-4o-2024-08-06",
    messages: [
        {
            role: "user",
            content: [
                { type: "text", text: prompt },
                { type: "image_url", image_url: { "url": 'https://www.example.com/sample_receipt1.jpg' } }
            ],
            response_format: { "type": "json_object" }
        },
        {
            role: "system",
            content: [
                {
                    type: "text",
                    text: `{
                        "date_time": "2024-03-07 10:44",
                        "store_name": "Antoniuk Roman \"ERA\" Service",
                        "receipt_number": "00056",
                        "products": [
                            {"name": "Lubella Noodles 400g", "quantity": 1, "price": 5.90},
                            {"name": "Lubella Shell Pasta 400g", "quantity": 1, "price": 5.60},
                            {"name": "Tymbark Apple Mint Drink 250ml", "quantity": 1, "price": 2.10}
                        ],
                        "total": 30.49
                    }`
                }
            ]
        },
        {
            role: "user",
            content: [
                { type: "text", text: 'Great! Now do the same for this receipt:' },
                { type: "image_url", image_url: { "url": 'https://www.example.com/sample_receipt2.jpg' } }
            ],
            response_format: { "type": "json_object" }
        },
        {
            role: "system",
            content: [
                {
                    type: "text",
                    text: `{
                        "date_time": "2024-03-14 16:14",
                        "store_name": "1001 Drobiazgów Store",
                        "receipt_number": "315511",
                        "products": [
                            {"name": "Sprite 1.5L", "quantity": 1, "price": 8.95},
                            {"name": "Gerard Princely Wafers 200g", "quantity": 1, "price": 6.85},
                            {"name": "Wawel Chocolate 90g", "quantity": 1, "price": 4.65}
                        ],
                        "total": 25.39
                    }`
                }
            ]
        },
        // More receipt examples can follow...
        {
            role: "user",
            content: [
                { type: "text", text: 'Great! Now do the same for this receipt:' },
                { type: "image_url", image_url: { "url": 'https://www.example.com/receipt_to_be_checked.jpg' } }
            ],
            response_format: { "type": "json_object" }
        }
    ],
    max_tokens: 1500,
    temperature: 0.5
});

In this example, we basically supply the system replies ourselves to show the model what the correct outputs looks like. This works best if done for the receipts that did not get processed correctly during initial tests — the ones which are less readable or less standard.

Supplying these inputs and expected outputs helps the model with accurately handling other tricky edge cases.

It’s important to note that in this case every example that we provide is another image that the model needs to process and so we are charged another fraction of a penny for it.

Note: If you’d like to use this approach in Zapier, you would need to use the ChatGPT’s “API Request (Beta)” action and provide the examples in the body of the request (according to OpenAI Chat documentation).

Conclusion: Save Time, Save Effort

This setup — structured prompts, and few-shot examples and understanding how OpenAI’s Vision module resizes the images— is currently the best way to achieve accuracy and efficiency.

As AI keeps getting better, automating these processes will become even easier and faster. But there’s no need to wait — these solutions work great today, and you can start saving time and effort right now.

Your team will thank you for it.

Thanks for reading! If you found this guide helpful, give it a clap and follow for more practical guides based on real AI applications in business.

You can find us at automee.digital and socials.