Exploring the Microsoft Phi3 Vision Language model as OCR for document data extraction
Examples of zero-shot OCR applications of the latest version of Microsoft Phi3 vision language model. I show how to extract the data of documents such as the identity card, the driving license or the health insurance card applying the Phi3 model on the images of the interested documents.
The Phi3 model is the latest version of the Microsoft small language model. It comes with four variants (check this link for more informations https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/)
- Phi-3-mini. A 3.8B parameter language model, available in two context lengths (128K and 4K)
- Phi-3-small. A 7B parameter language model, available in two context lengths (128K and 8K)
- Phi-3-medium. A 14B parameter language model, available in two context lengths (128K and 4K)
- Phi-3-vision is a 4.2B parameter multimodal model with language and vision capabilities
In this post I am interest in the applications of the multimodal vision language model. As explained in the official documentaation, the Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model that provides uses for general purpose AI systems and applications with visual and text input capabilities which require:
- memory/compute constrained environments;
- latency bound scenarios;
- general image understanding;
- OCR;
- chart and table understanding.
In this post I am interest to check the data extraction capabilities when the model is used as OCR on personal documents such as the identity card, the driver licence and the health insurance card. The documents used in this test are facsimiles, they are not original documents and do not belong to real people.
The second part of this story, where I discuss the application of the Phi3 model on deformed documents and the computer vision techniques used to make more readable those images is at this link:
You can find the complete notebook in my Github repo at this link
Model Instance
In order to use the model in inference mode, I built an environment as follows
conda create -n llm_images python=3.10
conda activate llm_images
pip install torch==2.3.0 torchvision==0.18.0
pip install packaging
pip install pillow==10.3.0 chardet==5.2.0 flash_attn==2.5.8 accelerate==0.30.1 bitsandbytes==0.43.1 Requests==2.31.0 transformers==4.40.2 albumentations==1.3.1 opencv-contrib-python==4.10.0.84 matplotlib==3.9.0
pip uninstall jupyter
conda install -c anaconda jupyter
conda update jupyter
pip install --upgrade 'nbconvert>=7' 'mistune>=2'
pip install cchardet
Once the environment is available, I downloaded the model from the Huggingface repository
# Import necessary libraries
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import BitsAndBytesConfig
import torch
from IPython.display import display
import time
# Define model ID
model_id = "microsoft/Phi-3-vision-128k-instruct"
# Load processor
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Define BitsAndBytes configuration for 4-bit quantization
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load model with 4-bit quantization and map to CUDA
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
trust_remote_code=True,
torch_dtype="auto",
quantization_config=nf4_config,
)
Next, I prepared a Python function that takes as input the message and the image path to send to the model and outputs the model output.
def model_inference(messages, path_image):
start_time = time.time()
image = Image.open(path_image)
# Prepare prompt with image token
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# Process prompt and image for model input
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
# Generate text response using model
generate_ids = model.generate(
**inputs,
eos_token_id=processor.tokenizer.eos_token_id,
max_new_tokens=500,
do_sample=False,
)
# Remove input tokens from generated response
generate_ids = generate_ids[:, inputs["input_ids"].shape[1] :]
# Decode generated IDs to text
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
display(image)
end_time = time.time()
print("Inference time: {}".format(end_time - start_time))
# Print the generated response
print(response)
In what follow I show how to extract the data from each different document. Depending on the front or the back face of the document, I prepared a specific prompt ables to identify the fields whose data I want to extract.
Identity card OCR
Front face
For the front face of the Italian Identity card I used the following prompt to extract the main personal data and put those in a JSON format output.
prompt_cie_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'Comune Di/ Municipality', 'COGNOME /Surname', 'NOME/NAME', 'LUOGO E DATA DI NASCITA/\
PLACE AND DATE OF BIRTH', 'SESSO/SEX', 'STATURA/HEIGHT', 'CITADINANZA/NATIONALITY',\
'EMISSIONE/ ISSUING', 'SCADENZA /EXPIRY'. Read the code at the top right and put it in the JSON field 'CODE'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/cie_fronte.jpg"
# inference
model_inference(prompt_cie_front, path_image)
For the above image I obtained the following output. It can be note that the unique card code is located at the top right of the card without any associated field. To extract its value I specified in the prompt that the model has to read the code at the top right and put it in the JSON field named “CODE”. The only error is that the first zero in the unique code has been exchanged for the character capital O.
Inference time: 9.793543815612793
{
"Comune Di/ Municipality": "SERENELLA MARITTIMA",
"COGNOME /Surname": "ROSSI",
"NOME/NAME": "BIANCA",
"LUOGO E DATA DI NASCITA": "PINO SULLA SPONDA DEL LAGO MAGGIORE (VA) 30.12.1964",
"SESSO/SEX": "F",
"STATURA/HEIGHT": "180",
"CITADINANZA/NATIONALITY": "ITA",
"EMISSIONE/ ISSUING": "30.05.2022",
"SCADENZA /EXPIRY": "30.12.2031",
"CODE": "CAO000AA"
}
Back face
To extract the data of the back face I used the following prompt
prompt_cie_back = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'CODICE FISCALE/FISCAL CODE', 'ESTREMI ATTO DI NASCITA', 'INDIRIZZO DI RESIDENZA/RESIDENCE'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/cie_retro.jpg"
# inference
model_inference(prompt_cie_back, path_image)
I obtained the following result. There is just one error, namly that is missing the thirdh character of the fiscal code, a capital S.
Inference time: 4.082342147827148
{
"codice_fiscale": "RSBNC64T70G677R",
"estremi_atto_di_nascita": "00000.0A00",
"indirizzo_di_residenza": "Via Salaria, 712"
}
Driver license OCR
For the front face of the italian driver licence I used the following prompt
prompt_ld_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'1.', '2.', '3.', '4a.', '4b.', '4c.', '5.','9.'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/patente_fronte.png"
# inference
model_inference(prompt_ld_front, path_image)
obtaining as result
Inference time: 5.2030909061431885
{
"1": "ROSSI",
"2": "MARIA",
"3": "01/01/65",
"4a": "01/03/2014",
"4b": "01/01/2025",
"4c": "MIT-UCO",
"5": "A0A000000A",
"9": "B"
}
For the back face of the italian drive license at the moment I have not found a right prompt to read the values on the table with columns ‘9.’, ‘10.’, ‘11.’ and ‘12.’. Furthermore ‘12.’ appears twice. Firts, as the name of a column of the table, then as a field in the bottom left of the card.
This last field is important because it warns of particular obligations imposed on the driver. For example, code 01 expresses the obligation to drive with lenses or glasses
Health insurance card OCR
Front face
To read the values of the front face of the italian health insurance card I used the prompt
prompt_hic_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'Codice Fiscale', 'Sesso', 'Cognome', 'Nome', 'Luogo di nascita', 'Provincia', 'Data di nascita', 'Data di scadenza'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/tessera_sanitaria_fronte.jpg"
# inference
model_inference(prompt_hic_front, path_image)
I obtain the following result
Inference time: 7.003508806228638
```json
{
"Codice Fiscale": "RSSMRO62B25E205Y",
"Sesso": "M",
"Cognome": "ROSSI",
"Nome": "MARIO",
"Luogo di nascita": "CASSINA DE' PECCHI",
"Provincia": "MI",
"Data di nascita": "25/02/1962",
"Data di scadenza": "10/10/2019"
}
```
Back face
To read the back face of the card I used the prompt
prompt_hic_back = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'3 Cognome', '4 Nome', '5 Data di nascita', '6 Numero identificativo personale', '7 Numero identificazione dell'istituzione', 'Numero di identificazione della tessera', '9 Scadenza'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/tessera_sanitaria_retro.jpg"
# inference
model_inference(prompt_hic_back, path_image)
obtaining
Inference time: 7.403932809829712
{
"3 Cognome": "ROSSI",
"4 Nome": "MARIO",
"5 Data di nascita": "25/02/1962",
"6 Numero identificativo personale": "RSSMRO62B25E205Y",
"7 Numero identificazione dell'istituzione": "0030 - LOMBARDIA",
"Numero di identificazione della tessera": "80380800301234567890",
"9 Scadenza": "01/01/2006"
}
If you liked this post and think it’s interesting, make a clap!