Classification of transformers for solving various computer vision problems

Artemy Malkov, PhD
Product AI
Published in
2 min readAug 12, 2021

In recent years, TRANSFORMER models have demonstrated high performance in a wide range of language tasks, such as text classification and machine translation. Among them, the most popular implementations are BERT, GPT versions 1–3, RoBERTa, and T5.

Transformer architectures are based on a self-observing mechanism — a mechanism that studies the relationship between the elements of a sequence. Unlike recurrent networks, which process elements of a sequence recursively and can only take into account short-term context, transformer architectures can consider complete sequences, thus studying long-distance relationships, and can be easily parallelized.

An important feature of these models is their scalability to very highly complex models and large scale datasets.

The profound impact of transformer models has become more evident with their scalability. For example, the BERT-large model has 340 million parameters, the GPT-3 model has 175 billion parameters, and the newest Switch transformer mix model is scalable to 1.6 trillion parameters.

Transformers’ breakthrough in natural language processing (NLP) has generated a lot of interest in the computer vision community as well. Visual data has spatial and temporal coherence, which requires new network designs and learning patterns. As a result, transformers and their variants have been successfully used for image recognition, object detection, segmentation, super-resolution, video understanding, image generation, text synthesis from images, and more.

The slide shows the classification of transformers for solving various tasks of computer vision, examples of networks, and the currently achieved accuracy of solving recognition problems on public datasets.

Original article written by Rinat S.

https://medium.com/@rinats

--

--

Artemy Malkov, PhD
Product AI

Scientist, Entrepreneur, AI Product Management practitioner