Revolutionizing OCR: Harnessing GPT Vision Models for PDF-to-Markdown Conversion

5 min readJul 22, 2024

Flow chat of the whole process

Introduction

In the digital age, efficient document management is crucial. While PDFs reign supreme for sharing documents, extracting and converting their content especially when they contain images of text can be a daunting task. Enter Optical Character Recognition (OCR) technology, now supercharged with the power of GPT Vision models.

The Power of OCR and Markdown Combined

OCR technology has long been the go-to solution for extracting text from images, making scanned documents and PDFs searchable and editable. But what if we could take it a step further? By combining OCR with Markdown conversion, we can transform static PDFs into well-structured, easily formatted documents.

In this article, we’ll explore a Python script that leverages GPT Vision models to perform OCR on PDF documents and convert the extracted text into beautifully structured Markdown. Whether you’re a developer, researcher, or anyone dealing with large volumes of PDF documents, this guide will show you how to streamline your document conversion process using cutting-edge AI technology.

The Magic Behind the Scenes

Our Python script automates the process of extracting text from PDF documents and converting it into Markdown format using GPT Vision models. Here’s a high-level overview of how it works:

PDF Processing: The script reads PDF documents and converts each page into a base64-encoded string.
OCR with GPT Vision: These encoded images are then processed by a GPT Vision model, which performs OCR to extract the text.
Markdown Conversion: The extracted text is simultaneously converted into well-structured Markdown, preserving the original formatting.
Output Generation: The resulting Markdown content is saved, ready for further editing or integration into your workflow.

Setting Up Your Environment

Before we dive into the code, let’s get your environment set up:

Install the required libraries:

pip install pymupdf
pip install langchain_community
pip install langchain_core
pip install langchain_openai

Set up your API key:

export OPENAI_API_KEY=’your_openai_api_key’

The Code: A Closer Look

import base64
import logging
from sys import argv

import pymupdf
from langchain_community.callbacks import get_openai_callback
from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.prompts.chat import ChatPromptTemplate
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.runnables import RunnableSerializable

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def system_prompt() -> str:
    return """You are an expert in optical character recognition (OCR) specializing in converting PDF images to Markdown format. 
Your task is to analyze images of PDF pages, accurately transcribe their content into well-structured Markdown. 
Follow these guidelines:

1. Examine the provided image(s) of PDF page(s) carefully.
2. Extract all text content from the image(s).
3. Convert the extracted text into properly formatted Markdown, preserving the original structure and layout.
4. Use appropriate Markdown syntax for headings, lists, tables, and other formatting elements.
5. For complex equations or formulas, use LaTeX syntax enclosed within $$ delimiters.
6. If there are images or diagrams, indicate their presence with a brief description in square brackets, e.g., [Image: diagram of a cell].
7. Maintain the logical flow and organization of the original document in your Markdown representation.
8. Return only the Markdown content without any additional explanations or markdown code block delimiters.

Proceed with the OCR and Markdown conversion task based on these instructions."""


def get_markdown_conversion_chain() -> RunnableSerializable:
    logger.info("Initializing the markdown conversion chain.")
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    prompt_template = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt()),
            (
                "user",
                [
                    {
                        "type": "image_url",
                        "image_url": {"url": "data:image/jpeg;base64,{image_data}"},
                        "detail": "high",
                    }
                ],
            ),
        ]
    )
    return prompt_template | llm | StrOutputParser()


def pdf_to_base64(pdf_path: str) -> list[str]:
    logger.info(f"Converting PDF at {pdf_path} to base64.")
    with pymupdf.open(pdf_path) as pdf_file:
        base64_pages = [
            base64.b64encode(page.get_pixmap().tobytes()).decode("utf-8")  # type: ignore
            for page in pdf_file
        ]
    logger.info(f"Converted {len(base64_pages)} pages to base64.")
    return base64_pages


def convert_pdf_to_markdown(pdf_paths: list[str]) -> list[str]:
    logger.info("Starting PDF to Markdown conversion process.")
    markdown_conversion_chain = get_markdown_conversion_chain()
    markdown_documents = []
    for path in pdf_paths:
        logger.info(f"Processing PDF: {path}")
        base64_pages = pdf_to_base64(path)
        markdown_pages = markdown_conversion_chain.batch(
            [{"image_data": page} for page in base64_pages]
        )
        markdown_documents.extend(markdown_pages)
    logger.info("PDF to Markdown conversion process completed.")
    return markdown_documents


if __name__ == "__main__":
    logger.info("Script execution started.")
    with get_openai_callback() as callback:
        response = convert_pdf_to_markdown(argv[1:])
        output_path = "src/output/markdown.md"
        with open(output_path, "w") as output_file:
            output_file.write("\n".join(response))
        logger.info(f"Markdown content written to {output_path}.")
        print(callback)
    logger.info("Script execution finished.")

Let’s break down the key components of our script:

System Prompt: We define a detailed prompt that instructs the GPT Vision model on how to perform OCR and convert the text to Markdown.
Markdown Conversion Chain: We set up a conversion chain using ChatOpenAI, which processes the base64-encoded PDF images and produces Markdown output.
PDF-to-Base64 Conversion: Each page of the PDF is converted into a base64-encoded string for processing.
PDF-to-Markdown Conversion: The core function processes each encoded PDF page through the conversion chain.
Main Execution: The script handles command-line arguments, processes the PDFs, and writes the generated Markdown to an output file.

Results and Performance

In our tests, we processed a three-page PDF using both GPT-4o and GPT-4o-mini models. The results were impressive:

GPT-4o-mini: Cost approximately $0.01 for three pages
GPT-4o: Cost approximately $0.05 for three pages

Both models successfully captured formulas and tables in the proper format, with GPT-4o showing slightly better output quality.

Screenshot showing the formulas in LaTeX syntax

Conclusion

By combining the power of GPT Vision models with OCR technology, we’ve created a robust solution for converting PDF documents to Markdown. This approach not only saves time and reduces manual effort but also ensures high accuracy and consistency in document conversions.

As AI technology continues to evolve, we can expect even more powerful and efficient tools for document processing. For now, this Python script offers a significant leap forward in automating the conversion of PDFs to editable, searchable, and beautifully formatted Markdown documents.

Are you ready to revolutionize your document management workflow? Give this script a try and experience the power of AI-driven OCR for yourself!