MobileNets-v1 Paper Walkthrough

Karunesh Upadhyay
5 min readOct 3, 2023

--

Representation of Depthwise Seperable Conv (Source : Link)

INTRODUCTION:

  1. The paper describes an efficient network architecture that uses depth-wise separable convolutions and a set of two-hyperparameters in order to build small, low-latency models that can match the low-design requirements for mobile and embedded vision applications.
  2. 2 global hyperparameters are used (width and resolution multiplier) to tradeoff between latency and accuracy. These hyperparameters allow the model to select right size for the application based on the constraints of the problem

ARCHITECTURE:

Core: depth wise-separable filter and model shrinking hyperparameters: width and resolution multipliers.

(1) Depthwise Separable Convolution:

  • MN is based on depth wise separable convolutions which is form of factorized convolutions which factorize a standard convolution into a depthwise convolution and 1×1 pointwise convolution
  • Standard convolution both filters and combines input into a new set of outputs in a single step.
  • Depthwise separable convolution divides this single step into 2 steps of filtering and combining by using factorized convolutions called depthwise separable convolution. First does depthwise convolution by applying a separate filter for each channel. Thus, the depth of filters becomes 1 and the number of filters becomes M instead of N.
  • Secondly, pointwise convolutions are applied which combine the filtered result by implementing 1 × 1 convolution. Since depthwise convolution only filters input channels, we need this additional layer of 1 × 1 convolution to combine the results.
  • Summarising
Stadndard vs Depthwise Seperable Conv (Source : Link)
  • Cost Comparison:

(a) Combination of depthwise convolution and 1 × 1(pointwise) is called depthwise separable convolution. Following operation shows the cost reduction with depthwise seperable convolutions

Computation Cost reduction (Source : Link)

(b) depthwise convolutions are much less computationally expensive than pointwise convolutions. (N>>DK).

(2) Network Structure and Training:

  • Mobilenets use batchnorm and ReLU nonlinearities for both layers. All layers follow this except for the final FC layer which has softmax for classification.
Building block of Modilenet (Source : Link)
  • Downsampling is handled with strided convolution in depthwise as well as the first layer.
  • A final average pooling reduces spatial resolution to 1 before the fully connected layer.
  • Counting depthwise and pointwise convolutions as separate layers, MobileNet has 28 layers. The whole network is based on depthwise convolutions except for the first layer which is a full convolution.
MobileNet Architecture (Source : Link)
  • Resource Utilization: MobileNet uses 95% of the computation time in 1 × 1 convolutions which also has 75% parameters. Nearly all the computation is put into 1 × 1 convolutions. Mult-Adds refer to the total number of multiplication addition operations.
Resource Used per layer (Source : Link)
  • Less regularization and data augmentation: Almost no weight decay(L2 regularization) is used on the depthwise filters since there are so few parameters in them (1.06%). Authors claim that MobileNets require less regularization and data augmentation as small models have less trouble with overfitting. Same logic is applied for not using label smoothing and reduction in the distortions required for the input images. No Side-heads or label smoothening was used.

(3) Hyperparameters: Width Multiplier and Resolution Multiplier

  • Although the MobileNet architecture is already small and has low latency, a specific use case/application may require the model to be smaller and faster.

Width Multiplier (𝛼) :

  • The role of width multiplier is to thin a network uniformly at each layer.
  • # Input channels M becomes 𝛼M and Output channels N becomes 𝛼N
  • Updated Computation cost:
  • Effective cost reduction quadratically by 𝛼2 (𝛼-Squared).
  • Can be applied to any model with reasonable accuracy,latency and size tradeoff
  • With this new reduced structure, it needs to be trained from scratch.

Resolution Multiplier (⍴):

  • Apply this to the input image and the internal representation is subsequently reduced by the same multiplier.
  • In practice, input image resolution is changed to have the effect of a resolution multiplier.
  • Thus, the net computation cost becomes
  • Effective cost reduction quadratically by ⍴2.

(4) Overall Effect of depthwise convolutions and its hyperparameters:

  • Following table shows the total parameters(in million) with the applied changes in sequence.
(Source : Link)

EXPERIMENTS:

(1) Depthwise vs Full Convolution:

  • Reduce accuracy by 1% with huge save on Mult-adds and parameters
Conv vs Mobilenet (Source : Link)

(2) Thin Model (𝛼) vs Shallow Models:

  • To make a shallow model, 5 layers of separable filters with feature size 14 x 14 x 512 were removed( refer network architecture image ). With similar computation and parameters, thinner models produce 3% better accuracy than shallow models.
Shallow vs Thin (Source : Link)

(3) Width Multiplier (𝛼) and Resolution Multiplier (⍴):

  • Accuracy drops off smoothly until architecture is made too small.
Width and Resolution Multipler (Source : Link)

(4) Mobilenet vs Other Archs:

  • Mobilenet is really competitive both in terms of accuracy and parameter count.
Comparison of Architectures (Source : Link)
  • Mobilenet also showed a drastic drop in the Mult-Adds operation and #parameters with almost similar accuracy figures in the tasks of facial classification, OD and classification.

CONCLUSION:

  • MobileNet’s core innovation lies in its use of depthwise separable convolutions, which drastically reduce parameter counts and computational demands, making it exceptionally efficient
  • What also sets MobileNet apart is its adaptability through width and resolution multipliers, allowing it to flexibly adjust model size to fit specific resource constraints.
  • This feature empowers MobileNet to strike the perfect balance between speed and accuracy, catering to a wide spectrum of applications.

REFERENCE:

(1) MobileNet — v1 paper

--

--