Exploring the Microsoft Phi3 Vision Language model as OCR for document data extraction

Examples of zero-shot OCR applications of the latest version of Microsoft Phi3 vision language model. I show how to extract the data of documents such as the identity card, the driving license or the health insurance card applying the Phi3 model on the images of the interested documents.

7 min readJun 11, 2024

The Phi3 model is the latest version of the Microsoft small language model. It comes with four variants (check this link for more informations https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/)

Phi-3-mini. A 3.8B parameter language model, available in two context lengths (128K and 4K)
Phi-3-small. A 7B parameter language model, available in two context lengths (128K and 8K)
Phi-3-medium. A 14B parameter language model, available in two context lengths (128K and 4K)
Phi-3-vision is a 4.2B parameter multimodal model with language and vision capabilities

In this post I am interest in the applications of the multimodal vision language model. As explained in the official documentaation, the Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model that provides uses for general purpose AI systems and applications with visual and text input capabilities which require:

memory/compute constrained environments;
latency bound scenarios;
general image understanding;
OCR;
chart and table understanding.

In this post I am interest to check the data extraction capabilities when the model is used as OCR on personal documents such as the identity card, the driver licence and the health insurance card. The documents used in this test are facsimiles, they are not original documents and do not belong to real people.

The second part of this story, where I discuss the application of the Phi3 model on deformed documents and the computer vision techniques used to make more readable those images is at this link:

https://medium.com/@enrico.randellini/exploring-the-microsoft-phi3-vision-language-model-as-ocr-for-document-data-extraction-part-2-904f6e1b9b2d

You can find the complete notebook in my Github repo at this link

Model Instance

In order to use the model in inference mode, I built an environment as follows

conda create -n llm_images python=3.10

conda activate llm_images

pip install torch==2.3.0 torchvision==0.18.0

pip install packaging

pip install pillow==10.3.0 chardet==5.2.0 flash_attn==2.5.8 accelerate==0.30.1 bitsandbytes==0.43.1 Requests==2.31.0 transformers==4.40.2 albumentations==1.3.1 opencv-contrib-python==4.10.0.84 matplotlib==3.9.0

pip uninstall jupyter

conda install -c anaconda jupyter

conda update jupyter

pip install --upgrade 'nbconvert>=7' 'mistune>=2'

pip install cchardet

Once the environment is available, I downloaded the model from the Huggingface repository

# Import necessary libraries
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import BitsAndBytesConfig
import torch
from IPython.display import display
import time


# Define model ID
model_id = "microsoft/Phi-3-vision-128k-instruct"

# Load processor
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Define BitsAndBytes configuration for 4-bit quantization
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model with 4-bit quantization and map to CUDA
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    trust_remote_code=True,
    torch_dtype="auto",
    quantization_config=nf4_config,
)

Next, I prepared a Python function that takes as input the message and the image path to send to the model and outputs the model output.

def model_inference(messages, path_image):
    
    start_time = time.time()
    
    image = Image.open(path_image)

    # Prepare prompt with image token
    prompt = processor.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Process prompt and image for model input
    inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")

    # Generate text response using model
    generate_ids = model.generate(
        **inputs,
        eos_token_id=processor.tokenizer.eos_token_id,
        max_new_tokens=500,
        do_sample=False,
    )

    # Remove input tokens from generated response
    generate_ids = generate_ids[:, inputs["input_ids"].shape[1] :]

    # Decode generated IDs to text
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]


    display(image)
    end_time = time.time()
    print("Inference time: {}".format(end_time - start_time))

    # Print the generated response
    print(response)

In what follow I show how to extract the data from each different document. Depending on the front or the back face of the document, I prepared a specific prompt ables to identify the fields whose data I want to extract.

Identity card OCR

Front face

For the front face of the Italian Identity card I used the following prompt to extract the main personal data and put those in a JSON format output.

prompt_cie_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'Comune Di/ Municipality', 'COGNOME /Surname', 'NOME/NAME', 'LUOGO E DATA DI NASCITA/\
PLACE AND DATE OF BIRTH', 'SESSO/SEX', 'STATURA/HEIGHT', 'CITADINANZA/NATIONALITY',\
'EMISSIONE/ ISSUING', 'SCADENZA /EXPIRY'. Read the code at the top right and put it in the JSON field 'CODE'"}]

# Download image from URL
path_image = "/home/randellini/llm_images/resources/cie_fronte.jpg"

# inference
model_inference(prompt_cie_front, path_image)

For the above image I obtained the following output. It can be note that the unique card code is located at the top right of the card without any associated field. To extract its value I specified in the prompt that the model has to read the code at the top right and put it in the JSON field named “CODE”. The only error is that the first zero in the unique code has been exchanged for the character capital O.

Inference time: 9.793543815612793
{
"Comune Di/ Municipality": "SERENELLA MARITTIMA",
"COGNOME /Surname": "ROSSI",
"NOME/NAME": "BIANCA",
"LUOGO E DATA DI NASCITA": "PINO SULLA SPONDA DEL LAGO MAGGIORE (VA) 30.12.1964",
"SESSO/SEX": "F",
"STATURA/HEIGHT": "180",
"CITADINANZA/NATIONALITY": "ITA",
"EMISSIONE/ ISSUING": "30.05.2022",
"SCADENZA /EXPIRY": "30.12.2031",
"CODE": "CAO000AA"
}

Back face

To extract the data of the back face I used the following prompt

prompt_cie_back = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'CODICE FISCALE/FISCAL CODE', 'ESTREMI ATTO DI NASCITA', 'INDIRIZZO DI RESIDENZA/RESIDENCE'"}]

# Download image from URL
path_image = "/home/randellini/llm_images/resources/cie_retro.jpg"

# inference
model_inference(prompt_cie_back, path_image)

I obtained the following result. There is just one error, namly that is missing the thirdh character of the fiscal code, a capital S.

Inference time: 4.082342147827148
{
  "codice_fiscale": "RSBNC64T70G677R",
  "estremi_atto_di_nascita": "00000.0A00",
  "indirizzo_di_residenza": "Via Salaria, 712"
}

Driver license OCR

For the front face of the italian driver licence I used the following prompt

prompt_ld_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'1.', '2.', '3.', '4a.', '4b.', '4c.', '5.','9.'"}]

# Download image from URL
path_image = "/home/randellini/llm_images/resources/patente_fronte.png"

# inference
model_inference(prompt_ld_front, path_image)

Front face of the italian driver license card

obtaining as result

Inference time: 5.2030909061431885
{
"1": "ROSSI",
"2": "MARIA",
"3": "01/01/65",
"4a": "01/03/2014",
"4b": "01/01/2025",
"4c": "MIT-UCO",
"5": "A0A000000A",
"9": "B"
}

For the back face of the italian drive license at the moment I have not found a right prompt to read the values on the table with columns ‘9.’, ‘10.’, ‘11.’ and ‘12.’. Furthermore ‘12.’ appears twice. Firts, as the name of a column of the table, then as a field in the bottom left of the card.
This last field is important because it warns of particular obligations imposed on the driver. For example, code 01 expresses the obligation to drive with lenses or glasses

Back face of the italian driver license card

Health insurance card OCR

Front face

To read the values of the front face of the italian health insurance card I used the prompt

prompt_hic_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'Codice Fiscale', 'Sesso', 'Cognome', 'Nome', 'Luogo di nascita', 'Provincia', 'Data di nascita', 'Data di scadenza'"}]

# Download image from URL
path_image = "/home/randellini/llm_images/resources/tessera_sanitaria_fronte.jpg"

# inference
model_inference(prompt_hic_front, path_image)

I obtain the following result

Inference time: 7.003508806228638
```json
{
  "Codice Fiscale": "RSSMRO62B25E205Y",
  "Sesso": "M",
  "Cognome": "ROSSI",
  "Nome": "MARIO",
  "Luogo di nascita": "CASSINA DE' PECCHI",
  "Provincia": "MI",
  "Data di nascita": "25/02/1962",
  "Data di scadenza": "10/10/2019"
}
```

Back face

To read the back face of the card I used the prompt

prompt_hic_back = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'3 Cognome', '4 Nome', '5 Data di nascita', '6 Numero identificativo personale', '7 Numero identificazione dell'istituzione', 'Numero di identificazione della tessera', '9 Scadenza'"}]

# Download image from URL
path_image = "/home/randellini/llm_images/resources/tessera_sanitaria_retro.jpg"

# inference
model_inference(prompt_hic_back, path_image)

Back face of the italian health insurance card

obtaining

Inference time: 7.403932809829712
{
"3 Cognome": "ROSSI",
"4 Nome": "MARIO",
"5 Data di nascita": "25/02/1962",
"6 Numero identificativo personale": "RSSMRO62B25E205Y",
"7 Numero identificazione dell'istituzione": "0030 - LOMBARDIA",
"Numero di identificazione della tessera": "80380800301234567890",
"9 Scadenza": "01/01/2006"
}

If you liked this post and think it’s interesting, make a clap!

Exploring the Microsoft Phi3 Vision Language model as OCR for document data extraction

Examples of zero-shot OCR applications of the latest version of Microsoft Phi3 vision language model. I show how to extract the data of documents such as the identity card, the driving license or the health insurance card applying the Phi3 model on the images of the interested documents.

Model Instance

Identity card OCR

Front face

Back face

Driver license OCR

Health insurance card OCR

Front face

Back face

Written by Enrico Randellini