Explained: MobileNet V1

Zain Ehtesham
4 min readOct 23, 2022

--

MobileNet V1 architecture is one of the most widely used Deep Net architectures for computer vision applications. If you are one of those people (like me), who have used a model based on this architecture, along with other models then you probably know how fast and efficient it is on edge devices (mobile phones, dev boards such as raspberry pi, etc.). But what exactly makes the MobileNet V1 architecture so fast and efficient? That’s exactly the question that will be answered here.

To understand the basic building block of the MobileNet V1 architecture, there is one prerequisite, you should know how convolutional layers work. In case you are not too sure about that, here is the playlist by Andrew Ng that can get you to know enough.

The factor that affects the speed of a deep learning model is the “computational cost”, in simple words, the number of multiplications needed to be done to make a prediction. For convolutional layers the computational cost is significant. If we can somehow reduce this computational cost, our model will get fast and efficient, that is what MobileNet does, it reduces the number of multiplications that needs to be done to pass through a convolutional layer. To understand how MobileNet does that, let us look at an example.

Let's say there is an input image of dimensions 5x5x3, and we need to convolve it with 16 filters, each with dimensions 3x3x3.

Input image being convolved by 16 filters and producing an output of 3x3x16

Each filter is applied to all of the input channels at once. The calculation of the dimensions of the output is out of the scope of this article but you can read about it here.

The computational cost for a convolutional operation is calculated by the following formula,

computational_cost = filter_height * filter_width * num_of_input_channels * num_of_filters * output_height * output_width

For the convolutional operation shown in the image, the computational cost can be given as follows,

computational cost = 3x3x3x3x3x16 = 3,888 multiplications.

That’s the number of multiplications required to perform the convolutional operation shown in the image.

MobileNet reduces the computational cost by using a concept known as Depth-wise separable convolutions.

Depth-wise Separable Convolution:

Depth-wise separable convolution consists of the following two steps:

  1. Depth-wise convolution
  2. Pointwise convolution

In depth-wise convolution, unlike normal convolution, each filter is reserved for one channel of the input image rather than all. This means that each of the input channels is convolved with a single filter and the output of all the channels is concatenated at the end to give the final output.

If we continue with our example, then during depth-wise convolution the 5x5x3 input image will be convolved with 3, 3x3 filters, one for each input channel, as shown below.

Depth-wise convolution performed on the input image that yields an output with dimensions 3x3x3

The formula to calculate the computational cost for a depth-wise convolution operation is a bit different,

computational_cost = filter_height * filter_width * num_of_input_channels * output_height * output_width

We can use the above formula to calculate the computational cost for the depth-wise convolution illustrated in the image above,

computational cost = 3x3x3x3x3 = 243 multiplications.

But there is a problem, the standard convolutional layer had an output dimension of 3x3x16, whereas when we applied depth-wise convolution we have an output dimension of 3x3x3.

This is where we use pointwise convolution. Pointwise convolution is basically a convolutional operation with a filter of dimensions 1x1. So, if we use 16 filters if size 1, we can get our desired output dimensions (i.e., 3x3x16). If you didn’t get the last part refer here, then read the story again :)

Applying pointwise convolution to the output from depth-wise convolution.

We will use the same formula as that used for standard convolution to calculate the computational cost of pointwise convolution,

computational cost = 1x1x3x3x3x16 = 432 multiplications

Now the total computational cost for the depth-wise separable convolution applied to our example will be,

total computational cost = 243+432 = 675 multiplications.

This shows that by applying depth-wise separable convolution rather than standard convolution, we can reduce the computational cost by 82.64% (THAT’S ALOOTT!!)

So, Depth-wise separable convolutions are responsible for making MobileNet V1 architecture so efficient and fast.

For more similar content subscribe to my email list and follow me on LinkedIn.

--

--

Zain Ehtesham

I write about Ideas that I can't stop thinking about | AI, Tech and Philosophy