Understanding Deep Self-attention Mechanism in Convolution Neural Networks

Limitations and improvements of encoder-decoder architectures in computer vision

Shuchen Du
AI Salon

--

Photo by Francisco Moreno on Unsplash

Convolution neural networks (CNN) are broadly used in deep learning and computer vision algorithms. Even though many CNN-based algorithms meet industry standards and can be embedded in commercial products, the standard CNN algorithm is still limited and can be improved in many aspects. This post discusses semantic segmentation and encoder-decoder architecture as examples clarify the limitations and why self-attention mechanism can help mitigate the problem with reason.

Limitations of standard encoder-decoder architecture

Fig. 1: a standard encoder-decoder architecture

Encoder-decoder architecture (Fig. 1) is standard method in many computer vision tasks, especially pixel-level prediction tasks such as semantic segmentation, depth prediction and some GAN-related image generators. In an encoder-decoder network, an input image is convoluted, relued and pooled to a latent vector and then recovered to an output image with the same size as the input. The architecture is symmetric and…

--

--