GPT4 Vision for Form and Table Understanding

3 min readNov 13, 2023

Introduction

The GPT-4 model, also known as GPT-4V or gpt-4-vision-preview, incorporates visual capabilities, enabling it to process documents and images and provide responses to queries related to them. In the past, language model systems have only been able to process text as an input type. The limitations imposed by these factors restricted the use of models such as GPT-4 to specific scenarios.

Developers who have access to GPT-4 can now use the gpt-4-vision-preview model to get to the GPT-4 model, which has visual features. Additionally, the Chat Completions API has been enhanced to accommodate image inputs. It should be noted that the current version of the Assistants API does not include support for picture inputs.

OpenAI Guide:
https://platform.openai.com/docs/guides/vision

Early Experiments

I used two images, one for form understanding and one for table extraction, and tried several prompts to get the answer, especially in JSON format. Currently, GPT-4 with vision does not support the message.name parameter, functions or tools, or the response_format parameter.

Form Understanding and Table Extraction Samples


import os
OPENAI_API_TOKEN = "sk-YOUR_KEY"
# OpenAI API Key
api_key =  os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN

import base64
import requests

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

def form_and_table_understanding(image_path, prompt_text):
  """ form_and_table_understanding """
  base64_image = encode_image(image_path)  # Path to your image
  headers = {"Content-Type": "application/json","Authorization": f"Bearer {api_key}"}
  payload = {
      "model": "gpt-4-vision-preview",
      #"response_format" : { "type": "json_object" },
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": prompt_text
            },
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}"
              }
            }
          ]
        }
      ],
      "max_tokens": 300
  }
  response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
  return(response.json())

# Prompt 1 Form
image_path = "imgs/Vision_Form.png"
prompt_text = "what is date start and date end "
resp = form_and_table_understanding(image_path, prompt_text)
print(resp)

# Prompt 2 Form
image_path = "imgs/Vision_Form.png"
prompt_text = "what is date start and date end and give output as json with exact values and dont add anything extra "
resp = form_and_table_understanding(image_path, prompt_text)
print(resp)

# Prompt 3 Form
image_path = "imgs/Vision_Form.png"
prompt_text = "find all relevant information in terms of key value pairs "
resp = form_and_table_understanding(image_path, prompt_text)
print(resp)

# Prompt 1 Table
image_path = "imgs/Vision_Table.png"
prompt_text = "find out all tables"
resp = form_and_table_understanding(image_path, prompt_text)
print(resp)

# Prompt 2 Table
image_path = "imgs/Vision_Table.png"
prompt_text = "find out all tables and give output as json with exact values and dont add anything extra "
resp = form_and_table_understanding(image_path, prompt_text)
print(resp)

Key Learnings

For form understanding, GPT-4V works perfectly, but sometimes with JSON responses, it has some troubles. But for extracting tables, it had pretty bad behavior with the completion API. On the GPT4 User Interface, the same table image works better than using the API.

Resources

Colab Link : https://colab.research.google.com/drive/1iKHuCuF5lgIX_78T9nLMg6XNEneIrZfQ?usp=drive_link

YouTube Link:

GPT4 Vision for Form and Table Understanding

Written by Yogendra Sisodia