An Overview on MobileNet: An Efficient Mobile Vision CNN

5 min readJun 10, 2020

MobileNet is a simple but efficient and not very computationally intensive convolutional neural networks for mobile vision applications. MobileNet is widely used in many real-world applications which includes object detection, fine-grained classifications, face attributes, and localization. In this lecture, I will explain you the overview of MobileNet and how exactly it becomes the most efficient and lightweight neural network.

Before moving further here is the reference research paper: https://arxiv.org/abs/1704.04861

What’s in the index:

Depth-wise separable convolution
— 1.1 Depth-wise convolution
— 1.2 Point-wise convolution
The entire Network Structure
Parameters of MobileNet
— 3.1 Width-wise Multiplayer
— 3.2 Resolution-wise Multiplayer

1. Depth-wise separable convolution

The Depth-wise separable convolution is comprising of two layers, the depth-wise convolution, and the point-wise convolution. Basically the first layer is used to filter the input channels and the second layer is used to combine them to create a new feature.

1.1 Depth-wise convolution

The depth-wise convolutions are used to apply a single filter into each input channel. This is different from a standard convolution in which the filters are applied to all of the input channels.

Let’s take a standard convolution,

From the above image, the computational cost can be calculated as :

Standard Convolution Cost

Where DF is the special dimensions of the input feature map and DK is the size of the convolution kernel. Here M and N are the number of input and output channels respectively.

For a standard convolution, the computational cost depends multiplicatively on the number of input and output channels and on the spatial dimensions of the input feature map and convolution kernel.

Incase of depthwise convolution, as seen in the below image, contains an input feature map of dimension DF*DF and M number of kernels of channel size 1.

As per the above image, we can clearly see that the total computational cost can be calculated as:

However, this method is only used to filter the input channel.

1.2 Point-wise Convolution

Since the depthwise convolution is only used to filter the input channel, it does not combine them to produce new features. So an additional layer called pointwise convolution layer is made, which computes a linear combination of the output of depthwise convolution using a 1 × 1 convolution.

As per the image, let's calculate the computational cost again:

Pointwise convolution cost

So the total computational cost of Depthwise separable convolutions can be calculated as:

Depthwise separable convolutions cost

Comparing it with the computational cost os standard convolution, we get a reduction in computation, which can be expressed as:

To put this in a perspective to check the effectiveness of this depthwise separable convolution. Let’s take an example.

Let’s take N=1024 and Dk=3, plugging the values into the equation.

We get 0.112, or in another word, standard convolution has 9 times more the number of multiplication than that of the Depthwise convolution.

2. The entire Network Structure

Below is architecture table of MobileNet

Left: Standard Convolution followed by batch normalization and RELU. Right: Depthwise convolution layer and pointwise convolution layer, each followed by batch normalization and RELU.

From the above image, we can see that every convolution layer followed by a batch normalization and a ReLU. Also, a final average pooling is been introduced just before the fully connected layer to reduce the spatial dimension to 1.

Note that the above architecture has 28 layers by counting widthwise and pointwise convolution as separate layers.

3. Parameters of MobileNet

Although the base MobileNet architecture is already small and computationally not very intensive, it has two different global hyperparameters to effectively reduce the computational cost further.
One is the width multiplayer and another is the resolution wise multiplayer.

3.1 Width Multiplier: Thinner Models

For further reduction of computational cost, they introduced a simple parameter called Width Multiplier also refer as α.

For each layer, the width multiplier α will be multiplied with the input and the output channels(N and M) in order to narrow a network.

So the computational cost with width multiplier would become.

Computational Cost: Depthwise separable convolution with width multiplier

Here α will vary from 0 to 1, with typical values of [1, 0.75, 0.5 and 0.25]. When α = 1, called as baseline MobileNet and α < 1, called as reduced MobileNet. Width Multiplier has the effect of reducing computational cost by α².

3.2 Resolution Multiplier: Reduced Representation

The second parameter to reduce the computational cost effectively. Also known as ρ.

For a given layer, the resolution multiplier ρ will be multiplied with the input feature map. Now we can express the computational cost by applying width multiplier and resolution multiplier as:

Computational cost by applying width multiplier and resolution multiplier

4. Comparison of MobileNet with Popular Models

The below image shows that MobileNet dominates over other famous and state-of-art models like GoogleNet and VGG 16 with just lesser parameters.

The below image shows the difference between the Depthwise separable model and the Standard convolution model.

Depthwise Convolution vs Standard convolution

It is very much evident that the ImageNet accuracy is just 0.1% lesser than the standard convolution model but with very fewer Mult-Adds and Parameters.

To Conclude,

MobileNet is not the only model that competes with other state-of-art models but with this smaller and lightweight network by Depthwise Separable Convolution has done a really great job!