MobileNets-v1 Paper Walkthrough

Karunesh Upadhyay

5 min readOct 3, 2023

Representation of Depthwise Seperable Conv (Source : Link)

INTRODUCTION:

The paper describes an efficient network architecture that uses depth-wise separable convolutions and a set of two-hyperparameters in order to build small, low-latency models that can match the low-design requirements for mobile and embedded vision applications.
2 global hyperparameters are used (width and resolution multiplier) to tradeoff between latency and accuracy. These hyperparameters allow the model to select right size for the application based on the constraints of the problem

ARCHITECTURE:

Core: depth wise-separable filter and model shrinking hyperparameters: width and resolution multipliers.

(1) Depthwise Separable Convolution:

MN is based on depth wise separable convolutions which is form of factorized convolutions which factorize a standard convolution into a depthwise convolution and 1×1 pointwise convolution
Standard convolution both filters and combines input into a new set of outputs in a single step.

Depthwise separable convolution divides this single step into 2 steps of filtering and combining by using factorized convolutions called depthwise separable convolution. First does depthwise convolution by applying a separate filter for each channel. Thus, the depth of filters becomes 1 and the number of filters becomes M instead of N.

Secondly, pointwise convolutions are applied which combine the filtered result by implementing 1 × 1 convolution. Since depthwise convolution only filters input channels, we need this additional layer of 1 × 1 convolution to combine the results.

Summarising

Stadndard vs Depthwise Seperable Conv (Source : Link)

Cost Comparison:

(a) Combination of depthwise convolution and 1 × 1(pointwise) is called depthwise separable convolution. Following operation shows the cost reduction with depthwise seperable convolutions

Computation Cost reduction (Source : Link)

(b) depthwise convolutions are much less computationally expensive than pointwise convolutions. (N>>DK).

(2) Network Structure and Training:

Mobilenets use batchnorm and ReLU nonlinearities for both layers. All layers follow this except for the final FC layer which has softmax for classification.

Building block of Modilenet (Source : Link)

Downsampling is handled with strided convolution in depthwise as well as the first layer.
A final average pooling reduces spatial resolution to 1 before the fully connected layer.
Counting depthwise and pointwise convolutions as separate layers, MobileNet has 28 layers. The whole network is based on depthwise convolutions except for the first layer which is a full convolution.

Resource Utilization: MobileNet uses 95% of the computation time in 1 × 1 convolutions which also has 75% parameters. Nearly all the computation is put into 1 × 1 convolutions. Mult-Adds refer to the total number of multiplication addition operations.

Less regularization and data augmentation: Almost no weight decay(L2 regularization) is used on the depthwise filters since there are so few parameters in them (1.06%). Authors claim that MobileNets require less regularization and data augmentation as small models have less trouble with overfitting. Same logic is applied for not using label smoothing and reduction in the distortions required for the input images. No Side-heads or label smoothening was used.

(3) Hyperparameters: Width Multiplier and Resolution Multiplier

Although the MobileNet architecture is already small and has low latency, a specific use case/application may require the model to be smaller and faster.

Width Multiplier (𝛼) :

The role of width multiplier is to thin a network uniformly at each layer.
# Input channels M becomes 𝛼M and Output channels N becomes 𝛼N
Updated Computation cost:

Effective cost reduction quadratically by 𝛼2 (𝛼-Squared).
Can be applied to any model with reasonable accuracy,latency and size tradeoff
With this new reduced structure, it needs to be trained from scratch.

Resolution Multiplier (⍴):

Apply this to the input image and the internal representation is subsequently reduced by the same multiplier.
In practice, input image resolution is changed to have the effect of a resolution multiplier.
Thus, the net computation cost becomes

Effective cost reduction quadratically by ⍴2.

(4) Overall Effect of depthwise convolutions and its hyperparameters:

Following table shows the total parameters(in million) with the applied changes in sequence.

EXPERIMENTS:

(1) Depthwise vs Full Convolution:

Reduce accuracy by 1% with huge save on Mult-adds and parameters

(2) Thin Model (𝛼) vs Shallow Models:

To make a shallow model, 5 layers of separable filters with feature size 14 x 14 x 512 were removed( refer network architecture image ). With similar computation and parameters, thinner models produce 3% better accuracy than shallow models.

(3) Width Multiplier (𝛼) and Resolution Multiplier (⍴):

Accuracy drops off smoothly until architecture is made too small.

Width and Resolution Multipler (Source : Link)

(4) Mobilenet vs Other Archs:

Mobilenet is really competitive both in terms of accuracy and parameter count.

Comparison of Architectures (Source : Link)

Mobilenet also showed a drastic drop in the Mult-Adds operation and #parameters with almost similar accuracy figures in the tasks of facial classification, OD and classification.

CONCLUSION:

MobileNet’s core innovation lies in its use of depthwise separable convolutions, which drastically reduce parameter counts and computational demands, making it exceptionally efficient
What also sets MobileNet apart is its adaptability through width and resolution multipliers, allowing it to flexibly adjust model size to fit specific resource constraints.
This feature empowers MobileNet to strike the perfect balance between speed and accuracy, catering to a wide spectrum of applications.

REFERENCE:

(1) MobileNet — v1 paper