A Look at MobileNetV2: Inverted Residuals and Linear Bottlenecks

7 min readNov 3, 2019

MobileNetV2 [2] introduces a new CNN layer, the inverted residual and linear bottleneck layer, enabling high accuracy/performance in mobile and embedded vision applications. The new layer builds on the depth-wise separable convolutions introduced in MobileNetV1 [1]. The MobileNetV2 network is built around this new layer and can be adapted to perform object classification and detection, and semantic segmentation.

Overview of Standard 2D Convolution
Depth-wise Separable Convolutions
Inverted Residual and Linear Bottleneck Layer
Model Architecture

Overview of Standard 2D Convolution

Before diving into the mechanics of the depth-wise separable convolution, let’s review the standard 2D convolution. Suppose a convolution operation transforms an input volume of dimensions Dᵤ x Dᵤ x M to an output volume of dimensions Dᵥ x Dᵥ x N, as shown in Fig. 1(a). Specifically, we require N filters, each of dimension Dᵣ x Dᵣ x M, as shown in Fig. 1(b).

Fig. 1: Standard 2D convolution shown in (a) with filters shown in (b)

Interpreting the mappings in the depth-wise separable convolution can be facilitated by first gaining an intuition of the standard 2D convolution. There are at least two possible interpretations of the standard convolution, both of which will be helpful to understand.

The perhaps-more-functionally-correct interpretation of 2D convolution is portrayed in Fig. 2. A filter (shown in orange) is stepped along the spatial dimensions of an input volume (shown in blue). An inner product is taken between the overlapping regions of the input volume and filter at every step (shown in light red). In practice, the overlapping portion of both the filter and input volume are vectorized and the dot product is taken between the two resulting vectors. In either case, a single value is computed, as shown by the red element in Fig. 2. Although functionally-correct, this interpretation obscures the spatial filtering that occurs within the convolution.

Fig. 2: Functional interpretation of 2D convolution (source)

The other interpretation of the standard 2D convolution draws more emphasis to the spatial filtering that takes place. Similar to the interpretation above, a filter, shown in Fig. 3(b), is stepped along the spatial dimensions of an input volume, shown in Fig. 3(a). But rather than the inner product spanning the depth dimension, the inner product is taken on a per-channel basis. In other words, channel i in the input volume is convolved with channel i in the filter (single channel convolution), where i indexes along the depth dimension. The resulting volume is shown in Fig. 3(c). Finally, all resulting values are summed, leading to a single value, which is not shown. Again, note that this interpretation emphasizes the spatial filtering and that for N filters, each and every channel of the input volume is filtered N times, which begins to seem rather excessive.

Fig. 3: Input volume (a) and filter (b) are convolved on a per-channel basis, resulting in (c) (source)

Depth-wise Separable Convolutions

Depth-wise separable convolutions were introduced in MobileNetV1 and are a type of factorized convolution that reduce the computational cost as compared to standard convolutions. The new MobileNetV2 layer incorporates depth-wise separable convolutions, so it’s worthwhile reviewing them.

Despite the standard convolution contains spatial filtering and linear combinations, note that it’s not possible to decompose or factorize the two stages. Constrastingly, the depth-wise separable convolution is structured around such factorization. As before, suppose an input volume of Dᵤ x Dᵤ x M is transformed to an output volume of Dᵥ x Dᵥ x N, as shown in Fig. 4(a). The first set of filters, shown in Fig. 4(b), are comprised of M single-channel filters, mapping the input volume to Dᵥ x Dᵥ x M on a per-channel basis. This stage, known as depth-wise convolutions, resembles the intermediate tensor shown in Fig. 3(c) and achieves the spatial filtering component. In order to construct new features from those already captured by the input volume, we require a linear combination. To do so, 1x1 kernels are used along the depth of the intermediate tensor; this step is referred to as point-wise convolution. N such 1x1 filters are used, resulting in the desired output volume of Dᵥ x Dᵥ x N.

Fig. 4: Desired input and output volumes shown in (a) obtained by applying two different sets of filters shown in (b) and (c)

The steps above are summarized in the example below with an input volume of 7 x 7 x 3 and an output volume 5 x 5 x 128.

Fig. 5: Concrete example of depth-wise separable convolutions (source)

The lowered computational cost of depth-wise separable convolutions comes predominantly from limiting the spatial filtering from M*N times in standard convolutions to M times. Standard convolution has a computational cost on the order of

while depth-wise convolution has a computational cost on the order of

Taking the ratio between the cost of depth-wise separable and standard convolution gives 1/N + 1/Dᵣ². N will often be greater than Dᵣ² in practical applications, particularly as one goes deeper into a network, so the ratio can be approximated by 1/Dᵣ². For reference, if one uses 3x3 kernels, depth-wise separable convolutions enjoy nearly an order of magnitude fewer computations than standard convolutions.

Inverted Residual and Linear Bottleneck Layer

The premise of the inverted residual layer is that a) feature maps are able to be encoded in low-dimensional subspaces and b) non-linear activations result in information loss in spite of their ability to increase representational complexity. These principles guide the design of the new convolutional layer.

The layer takes in a low-dimensional tensor with k channels and performs three separate convolutions. First, a point-wise (1x1) convolution is used to expand the low-dimensional input feature map to a higher-dimensional space suited to non-linear activations, then ReLU6 is applied. The expansion factor is referred to as t throughout the paper, leading to tk channels in this first step. Next, a depth-wise convolution is performed using 3x3 kernels, followed by ReLU6 activation, achieving spatial filtering of the higher-dimensional tensor. Finally, the spatially-filtered feature map is projected back to a low-dimensional subspace using another point-wise convolution. The projection alone results in loss of information, so, intuitively, it’s important that the activation function in the last step be linear activation (see below for an empirical justification). When the initial and final feature maps are of the same dimensions (when the depth-wise convolution stride equals one and input and output channels are equal), a residual connection is added to aid gradient flow during backpropagation. Note that the final two steps are essentially a depth-wise separable convolution with the requirement that there be dimensionality reduction.

Fig. 6: Visualization of the intermediate feature maps in the inverted residual layer (source)

Table 1: Alternative representation of the inverted residual layer (source)

The authors stress two points throughout the paper: 1) the final 1x1 convolution that maps back to low-dimensional space should be followed by a linear activation and 2) the residual connections should be made between the low-dimensional feature maps. Empirical results for both are shown below in Fig. 7.

Fig. 7: The impact of non-linearities and various types of residual connections (source)

Fig. 8 gives a comparison between the conventional residual block and the newly-proposed inverted residual block. Emphasis is made on the fact that the inverted residual block sees low-dimensional feature maps at its input and output and that these feature maps are the result of linear activations. Additionally, non-linear activations are applied only on volumes that live in the intermediate, higher-dimensional subspaces.

Fig. 8: Comparison between the conventional residual layer and the inverted residual layer (source)

Last but not least, below is the inverted residual layer implemented as a Keras custom layer.

Model Architecture

The MobileNetV2 network is predominantly built from the inverted residual layer introduced in the paper (referred to as “bottleneck” in Fig. 9), as shown below. Additionally, by using the network as a feature extractor, the model can be extended to perform object detection and semantic segmentation.