DPT : Segmentation Model Using Vision Transformer

David Cochard
axinc-ai
Published in
3 min readJun 10, 2021

This is an introduction to「DPT」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

DPT (DensePredictionTransformers) is a segmentation model released by Intel in March 2021 that applies vision transformers to images. It can perform image semantic segmentation with 49.02% mIoU on ADE20K, and it can also be used for monocular depth estimation with an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network.

Architecture

In DPT, vision transformers (ViT)are used instead of convolutional network. Using transformers allows to make more detailed and globally consistent predictions compared to convolutional networks. In particular, performance is improved when a large amount of training data is available.

Source: https://arxiv.org/pdf/2103.13413

The encoder divides the image into tiles, which are then tokenized (Embed in the graph above), and transformers process it. The process marked as Embed is a patch-based method to divide image into tiles, and tokenize the pixel feature map obtained by applying ResNet50 to the input image.

The decoder in DPT converts the output of each resolution of the transformer into an image like representation and uses a convolutional network to generate the segmentation image.

There are three model architectures defined in DPT: ViT-Base, ViT-Large, and ViT-Hybrid. ViT-Base performs patch-based embedding and has 12 transformer layers. ViT-Large performs the same embedding as ViT-Base, but has 24 transformer layers and a larger feature size. ViT-Hybrid performs embedding using ResNet50 and has 12 transformer layers.

DPT accuracy

DPT sets a new state of the art for the semantic segmentation task on ADE20K, a large data set with 150 classes.

Source: https://arxiv.org/pdf/2103.13413

It is also the state of the art after some fine-tuning on smaller datasets such as NYUv2, KITTI, and Pascal Context.

Source: https://arxiv.org/pdf/2103.13413

Below is a comparison of MiDaS and DPT for depth estimation. DPT is able to predict the depth inmore detail. It can also improve the accuracy of large homogeneous regions and relative positioning within an image, which is a shortcoming of convolution networks.

Source: https://arxiv.org/pdf/2103.13413

Below is a comparison for the segmentation task. DPT tends to produce more detailed output at object boundaries, and it tends to produce less cluttered output in some cases.

Source: https://arxiv.org/pdf/2103.13413

Usage

You can use the following commands to perform segmentation and depth estimation on the input images with ailia SDK.

$ python3 dense_prediction_transformers.py -i input.jpg -s output.png --task=segmentation -e 0$ python3 dense_prediction_transformers.py -i input.jpg -s output.png--task=monodepth -e 0

Here is a result you can expect.

Related topics

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--

David Cochard
David Cochard

Written by David Cochard

Engineer with 10+ years in game engines & multiplayer backend development. Now focused on machine learning, computer vision, graphics and AR