Creating MobileNets with TensorFlow from scratch

5 min readJul 17, 2021

MobileNet (Efficient Convolutional Neural Networks for Mobile Vision
Applications) is an architecture that focuses on making the deep learning networks very small and having low latency. The MobileNet models can be easily be deployed easily on the mobile and embedded edge devices. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. The MobileNet uses a Depthwise separable convolution instead of the usual convolution to reduce the computation time and parameters. It works by applying convolution to each channel of the image instead of as a block n times, and then using 1x1 convolution to get n filters.

MobileNet paper link: https://arxiv.org/pdf/1704.04861.pdf

Next, we look at how all these blocks and layers look, and how to implement them in python.

Figure 1: The MobileNet frame (Source: Original MobileNet paper)

MobileNet starts with a basic 2D convolution layer. Then there are a series of convolution layers called Depthwise Separable attached one after another, having different strides and filter counts.

Defining the convolutional block — Each convolutional block after the input has the following sequence: BatchNormalization, followed by ReLU activation and then passed to the next block.

The first convolution block has 32 filters of kernel size (3x3) and a stride of 2. And as said, it is followed by a BatchNormalization layer and a ReLU activation. These three lines can be represented with the following code.

input = Input (input_shape)
x = Conv2D(32,3,strides=(2,2),padding='same', use_bias=False)(input)
x = BatchNormalization()(x)
x = ReLU()(x)

Then comes the main ingredient block of the MobileNet architecture — Depthwise Separable convolution layer. This process is done in 2 steps: Depthwise convolution and then Pointwise convolution.

Figure 2: Depthwise and Pointwise Convolution visualisation (Source:https://penseeartificielle.fr/)

A normal convolution is a block of same channel as the image which is applied to all the channels but in Depthwise Convolution it is a kernel for each channel. Then upon the stacked output layers, a Pointwise convolution i.e 1x1 conv is applied. The Pointwise convolution is applied n times and is computationally less expensive than doing n transformation on images. To implement that, we can write the following functions.

def depth_block(x, strides):    x = DepthwiseConv2D(3,strides=strides,padding='same',  use_bias=False)(x)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    return xdef single_conv_block(x,filters):    x = Conv2D(filters, 1,use_bias=False)(x)
    x= BatchNormalization()(x)
    x = ReLU()(x)
    return x

Alright, but what’s the point of creating a depthwise separable convolution?

Suppose we have an image of size 12x12x3 to be transformed into 8x8x3 image.

Let’s calculate the number of multiplications the computer has to do in the original convolution. There are 256 5x5x3 kernels that move 8x8 times. That’s 256x3x5x5x8x8=1,228,800 multiplications.

What about the separable convolution? In the depthwise convolution, we have 3 5x5x1 kernels that move 8x8 times (3 because of the 3 channels). That’s 3x5x5x8x8 = 4,800 multiplications. In the pointwise convolution, we have 256 1x1x3 kernels that move 8x8 times. That’s 256x1x1x3x8x8=49,152 multiplications. Adding them up together, that’s 53,952 multiplications.

52,952 is a lot less than 1,228,800. With less computations, the network is able to process more in a shorter amount of time.

Combination Layer:

Figure 3: The combination block (Source:Original MobileNet Paper)

The depthwise convolution and pointwise convolution functions are called collectively through a combination layer function. If you look closely in the architecture you would notice a pattern. For a n filter number, the depthwise convolution uses stride of 2 reduce the size followed by depthwise convolution of stride 1.

Figure 4: Edited image of architecture from Paper

def combo_layer(x,repetition):
    x = depth_block(x,strides)
    x = single_conv_block(x, filters)
    return x

The channels are increased slowly from 32 to 1024 with each combination layer. At channel output 512, the block is called iteratively 5 times. Using this 5x layer is the MobileNet and if not used then the model becomes shallower thus called as Shallow MobileNet.

for _ in range(5):
      x = combo_layer(x,512,strides=(1,1))

In the end, there is GlobalAveragePooling, followed by the final output layer. The output layer is a Dense layer where the classes are to be mentioned. If the classes are 3 then it shall be Dense(3). The activation function used is Softmax.

output = Dense(n_classes,activation='softmax')(x)

Now that we have all the blocks together, let’s merge them to see the entire MobileNet architecture.

Complete MobileNet architecture:

def depth_block(x, strides):    x = DepthwiseConv2D(3,strides=strides,padding='same',  use_bias=False)(x)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    return xdef single_conv_block(x,filters):    x = Conv2D(filters, 1,use_bias=False)(x)
    x= BatchNormalization()(x)
    x = ReLU()(x)
    return xdef combo_layer(x,repetition):
    x = depth_block(x,strides)
    x = single_conv_block(x, filters)
    return xdef MobileNet(input_shape=(224,224,3),n_classes = 1000):    input = Input ( input_shape)    x = Conv2D(32,3,strides=(2,2),padding = 'same', use_bias=False) (input)
    x =  BatchNormalization()(x)
    x = ReLU()(x)    x = combo_layer(x,64, strides=(1,1))    x = combo_layer(x,128,strides=(2,2))
    x = combo_layer(x,128,strides=(1,1))    x = combo_layer(x,256,strides=(2,2))
    x = combo_layer(x,256,strides=(1,1))    x = combo_layer(x,512,strides=(2,2))
    for _ in range(5):
      x = combo_layer(x,512,strides=(1,1))     x = combo_layer(x,1024,strides=(2,2))
     x = combo_layer(x,1024,strides=(1,1))     x = GlobalAveragePooling2D()(x)     output = Dense(n_classes,activation='softmax')(x)     model = Model(input, output)
     return modeln_classes = 1000
input_shape = (224,224,3)
model = MobileNet(input_shape,n_classes)
model.summary()

Output : (Assuming 1000 final classes — last few lines of model summary)

And that’s how we can implement the MobileNets architecture.

References:

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017.

Creating MobileNets with TensorFlow from scratch

Alright, but what’s the point of creating a depthwise separable convolution?

Written by Sumeet Badgujar