Published in


DPT : Segmentation Model Using Vision Transformer

This is an introduction to「DPT」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.


DPT (DensePredictionTransformers) is a segmentation model released by Intel in March 2021 that applies vision transformers to images. It can perform image semantic segmentation with 49.02% mIoU on ADE20K, and it can also be used for monocular depth estimation with an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network.


In DPT, vision transformers (ViT)are used instead of convolutional network. Using transformers allows to make more detailed and globally consistent predictions compared to convolutional networks. In particular, performance is improved when a large amount of training data is available.


The encoder divides the image into tiles, which are then tokenized (Embed in the graph above), and transformers process it. The process marked as Embed is a patch-based method to divide image into tiles, and tokenize the pixel feature map obtained by applying ResNet50 to the input image.

The decoder in DPT converts the output of each resolution of the transformer into an image like representation and uses a convolutional network to generate the segmentation image.

There are three model architectures defined in DPT: ViT-Base, ViT-Large, and ViT-Hybrid. ViT-Base performs patch-based embedding and has 12 transformer layers. ViT-Large performs the same embedding as ViT-Base, but has 24 transformer layers and a larger feature size. ViT-Hybrid performs embedding using ResNet50 and has 12 transformer layers.

DPT accuracy

DPT sets a new state of the art for the semantic segmentation task on ADE20K, a large data set with 150 classes.


It is also the state of the art after some fine-tuning on smaller datasets such as NYUv2, KITTI, and Pascal Context.


Below is a comparison of MiDaS and DPT for depth estimation. DPT is able to predict the depth inmore detail. It can also improve the accuracy of large homogeneous regions and relative positioning within an image, which is a shortcoming of convolution networks.


Below is a comparison for the segmentation task. DPT tends to produce more detailed output at object boundaries, and it tends to produce less cluttered output in some cases.



You can use the following commands to perform segmentation and depth estimation on the input images with ailia SDK.

$ python3 -i input.jpg -s output.png --task=segmentation -e 0$ python3 -i input.jpg -s output.png--task=monodepth -e 0

Here is a result you can expect.

Related topics

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store