Text Extraction from Image using Google Gemini

Nishant Welpulwar
Google Cloud - Community
4 min readJan 21, 2024

Introduction

In today’s digital world, we often find ourselves dealing with documents in image formats (like scanned PDFs or photos). Optical Character Recognition (OCR) is the technology that bridges the gap between these image-based documents and the world of digital, editable text. OCR enables us to extract text from images, making it searchable, editable, and usable in various applications.

How OCR works

A typical OCR process involves the following steps:

  1. Pre-processing: The image may be pre-processed to improve quality. This can include actions like:
  • Noise reduction (removing speckles or blurring)
  • Image binarization (converting to black and white)
  • Deskewing (correcting for tilted images)

2. Text Segmentation: The OCR system identifies lines of text, individual words, and eventually, individual characters within the image.

3. Character Recognition: This is the core part where patterns in the image are compared against known patterns of letters and symbols. Different OCR techniques can be used:

  • Pattern Matching: Comparing image patterns to a stored library of character templates.
  • Feature Extraction: Analyzing specific features (e.g., curves, lines) and classifying them into characters.
  • Machine Learning Algorithms: Using trained neural networks to recognize complex patterns in handwritten text or various fonts.

4. Post-processing: OCR output may contain errors due to image quality or language complexity. Post-processing can include:

  • Spell-checking or error correction
  • Language-specific rules to improve accuracy

OCR in the age of LLMs

Large Language Models (LLMs) have revolutionized Optical Character Recognition (OCR) in several ways, making it significantly easier and more effective:

1. Enhanced Accuracy through Understanding Context

  • Traditional OCR systems often focused on recognizing individual characters in isolation. This could lead to errors, especially with poor image quality or complex layouts.
  • LLMs excel at understanding context. When processing scanned text, an LLM can leverage its vast knowledge base to:
  • Correct misrecognized characters: If a character is ambiguous, the LLM can consider the surrounding words and sentences to make a more informed prediction.
  • Handle variations: LLMs are trained on a massive and diverse text dataset, making them adaptable to different fonts, handwriting styles, and languages.
  • Understand document structure: LLMs can identify headings, paragraphs, and tables, aiding in structured text extraction.

2. Direct Text Extraction without Traditional OCR Pipelines

  • Traditional OCR involves a multi-step process (preprocessing, segmentation, recognition). This pipeline can be complex and prone to errors.
  • LLMs can sometimes bypass this pipeline. You can feed an image directly to an LLM, and instruct it to extract the text. LLMs trained on both text and image data can “read” the text in the image and output it as digital text. This simplifies the OCR process and removes potential error points.

3. Handling Complex and Challenging Text

  • Poor Image Quality: LLMs, with their robust pattern recognition and contextual understanding, can often decipher text from low-resolution, noisy, or blurry images where traditional OCR might struggle.
  • Handwritten Text: LLMs trained on handwritten text datasets can handle the wide variations in handwriting styles, making OCR for handwritten documents more feasible.
  • Specialized Text (like in technical or legal documents): LLMs can be fine-tuned on domain-specific text. This means that they can be made to better understand and extract text from documents with specialized vocabulary or unique formatting.

4. Integration with Natural Language Processing (NLP)

  • LLMs go beyond simple text extraction. When combined with NLP capabilities, you can use LLMs to:
  • Summarize the extracted text from a document.
  • Classify the content based on its topic or sentiment.
  • Answer questions about the scanned document’s content.
  • Translate the extracted text into different languages.

OCR with Google Gemini

Google Gemini is a family of cutting-edge language models (LLMs) developed by Google AI. At the heart of Gemini’s capabilities lies its multimodality — it can process and generate different types of data, including text, code, images, and audio.

Using Gemini, text extraction is easy with few lines of code.
Step 1: Create a prompt to be given to model instructing what fields to extract and returning extracted output in json format.

Step 2: Provide image and the prompt to Gemini model and

Step 3: Convert json output into pandas dataframe, which is easy to read or it can also be stored in database easily.

Gradio

Gradio is an open-source Python library that is designed to rapidly build and share interactive machine learning demos or web applications. It provides a user-friendly interface for creating web-based visual components (like text boxes, image uploaders, dropdown menus, and more) that interact directly with your machine learning model or Python code.

--

--