Document Image Transformer: Introduction, Usage and Deployment

Document Image Transformer

Document Image Transformer(DiT) is a transformer that can classify the category of the document with just a picture of it.

For example, you have an image like below, feed the image to the model, and the model will tell you what kind of a document it is:


We will use huggingface and pinferencia

Pinferencia makes it super easy to serve any model with just three extra lines.
HuggingFace makes it easy to use the pre-trained model with just several lines.

Install Dependencies


pip install "transformers[pytorch]"

If it doesn’t work, please visit Installation ( and check their official documentations.


pip install "pinferencia[uvicorn]"

If it doesn’t work, please visit Install — Pinferencia ( and check their official documentations.

Example Usage of the Model

import base64
from io import BytesIO
from PIL import Image
from transformers import pipeline
classifier = pipeline(model="microsoft/dit-base-finetuned-rvlcdip")def classify(image_base64_str):
image =
return classifier(images=image)

We can get the base64 encoded string of our image from: Image to Base64 converter to convert Image to Base64 String. (


The output is:

[{'score': 0.8400426506996155, 'label': 'presentation'},
{'score': 0.043046072125434875, 'label': 'advertisement'},
{'score': 0.024246374145150185, 'label': 'questionnaire'},
{'score': 0.014194409362971783, 'label': 'form'},
{'score': 0.013648252934217453, 'label': 'news article'}]

So, it thinks our image is most likely a presentation.

Deploy the Model

Create a file


uvicorn app:service --reload

Wait for the model get downloaded. When it’s finished, you’ll see:

Call the Service

You can use curl or the interactive api page from Pinferencia.

Interactive API Page

Open your browser and visit, use the below api to predict.

The result is


If you want to know more about Pinferencia, visit: underneathall/pinferencia: Python + Inference — Model Deployment library in Python. Simplest model inference server ever. (



