(ML) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

10 min readApr 16, 2023

MobileNets 是一種針對移動和嵌入式視覺應用的高效模型類別，其基於流暢的架構，使用深度可分離卷積構建輕量級深度神經網絡。論文中引入了兩個簡單的全局超參數，以有效地權衡延遲和準確性，這些超參數允許模型構建者根據問題的約束條件選擇適合其應用的正確大小的模型。我們在資源和準確性的權衡方面進行了廣泛的實驗，並展示了在 ImageNet 分類方面與其他流行模型相比的強大性能。並展示了 MobileNets 在廣泛的應用和用例中的有效性，包括目標檢測、細粒度分類、面部屬性和大規模地理定位。

MobileNet models can be applied to various recognition tasks for efficient on device intelligence

Standard CNN

左圖 input tensor 有兩個 channel ，這時我們做卷積運算時，假設每個 filter 大小為 3x3 ，此時 filter 也會是立體的，而每個 filter 的大小就會是 3x3x2 (input channel) ，這時如果有 3 個 filter ，那參數量就會有 3x3x2x3=54 parameters

3 x 3 x 2 (input channel) x 3 (filter count) = 54

標準卷積假設 stride=1, 並且考慮 padding 的情況下，輸出的 feature map 計算如下

標準卷積的計算成本為

source

DF : input tensor 大小
DK : filter 大小
M : input tensor channel
N : filter count

Depthwise Separable Convoultion

depthwise separable Convoultion 是分解卷積的一種形式將 Standard Convoultion (標準卷積) 分解為 Depthwise Convoultion (深度卷積) 和 Pointwise convolution (逐點卷積) 的 1×1 卷積操作

Depthwise Convoultion

左圖 input tensor 有兩個 channel ，在做 Depthwise Convoultion (深度卷積) 時，filter 的 channel 會根據你的 input tensor 來決定，也就是每個 channel 對應一個 filter 進行卷積操作，那參數量就會有 3x3x2=18 parameters

3 x 3 x 2 (input channel)= 18 parameters

深度卷積的計算成本為

DF : input tensor 大小
DK : filter 大小
M : input tensor channel

Pointwise convolution

左圖為對 input tensor 做完 Depthwise Convoultion 後輸出的 feature map，接下來再使用 1x1 filter 將輸出進行融合，在這產生的參數量為 1x1x2x3=6 parameters

1 x 1 x 2(input channel) x 3(filter count) = 6 parameters

1×1 Convolutional Filters called Pointwise Convolution in the context of Depthwise Separable Convolution

可以跟使用一般的卷積參數量進行比較，原本產生的參數量為 54 ，但使用 Depthwise Separable Convoultion 後，產生的參數量僅為 6+18=24

深度可分離卷積成本

DF : input tensor 大小
DK : filter 大小
M : input tensor channel
N : 1x1 filter count

使用深度可分離卷積後可以減少計算量

Network Structure and Training

Left: Standard convolutional layer with batchnorm and ReLU
Right: Depthwise Separable convolutions with Depthwise and Pointwise layers followed by batchnorm and ReLU

除了在第一層使用 Standard Convolution 以外，其它層都使用 Depthwise Separable Convoultion 進行特徵提取

MobileNet 將其 95% 的計算時間花費在 1 × 1 卷積也有 75% 的參數

Width Multiplier: Thinner Models

讓 MobileNet 變得更小，因此引入寬度因子 α ，讓通道進一步縮減，降低更多參數

source

其中 α ∈ (0, 1]，典型設置為 1、0.75、0.5 和 0.25。 α = 1 是基線 MobileNet，α < 1 能減少 MobileNets 參數

Resolution Multiplier: Reduced Representation

減少計算量的第二個超參數神經網絡的成本是分辨率乘數 ρ ，也就是用來降低 input tensor 的大小

source

其中 ρ ∈ (0, 1] 通常是隱式設置的，因此網絡的輸入分辨率為 224、192、160 或 128。ρ = 1 是基線 MobileNet，並且 ρ < 1 被縮減計算 MobileNets 參數

上圖例適用於內部 MobileNet 層 DK = 3，M = 512，N = 512，DF = 14。
DF : input tensor 大小
DK : filter 大小
M : input tensor channel
N : 1x1 filter count

Experiments

Depthwise Separable vs Full Convolution MobileNet

使用 Depthwise Separable Convolution 的 MobileNet 比使用 full convolutions 參數來的少，而性能也只差一些

Narrow vs Shallow MobileNet

Shallow MobileNet 是在 MobileNet v1 基礎上進行修改，將前五層 14x14x512 的 Depthwise Convolutional Filter 去掉，與其進行對比的就是讓 channel 縮小的 0.75 MobileNet

MobileNet Width Multiplier

Width Multiplier 通道縮減比率的比較

MobileNet Resolution

Resolution Multiplier 輸入圖片縮減的比較

Computation (Mult-Adds) and accuracy on the ImageNet benchmark

The number of parameters and accuracy on the ImageNet benchmark

MobileNet Comparison to Popular Models

Smaller MobileNet Comparison to Popular Models

MobileNet for Stanford Dogs

與 Inception V3 相比，MobileNet 能夠達到與 Inception 差不多的性能，而參數量差了快 7 倍

使用 MobileNet 架構的 PlaNet 的性能。百分比是 Im2GPS 測試數據集的分數被定位在距地面真相一定距離內。這原始 PlaNet 模型的數字基於更新後的具有改進的體系結構和訓練數據集的版本

COCO object detection results comparison using different frameworks and network architectures

Example objection detection results using MobileNet SSD

MobileNet Distilled from FaceNet

FaceNet 模型是最先進的人臉識別技術模型 [25]。它基於三元組構建人臉嵌入損失。為了構建移動 FaceNet 模型，我們使用蒸餾通過最小化輸出的平方差來訓練

Conclusion

MobileNets，它是基於深度可分離卷積的。作者們經過了一些設計決策，使得該模型能夠更有效率地運行。他們還展示了如何使用寬度乘法器構建更小、更快的MobileNets，以及如何權衡分辨率乘數以減少模型的大小和延遲。這種模型與其他流行的模型相比，具有卓越的尺寸、速度和準確性特徵。作者們通過在應用時展示 MobileNet 的有效性，來得出該模型的優點，並計劃在TensorFlow中發布該模型，以幫助更多人採用和探索MobileNets。

(ML) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Standard CNN

Depthwise Separable Convoultion

Depthwise Convoultion

Pointwise convolution

Network Structure and Training

Width Multiplier: Thinner Models

Resolution Multiplier: Reduced Representation

Experiments

Depthwise Separable vs Full Convolution MobileNet

Narrow vs Shallow MobileNet

MobileNet Width Multiplier

MobileNet Resolution

Computation (Mult-Adds) and accuracy on the ImageNet benchmark

The number of parameters and accuracy on the ImageNet benchmark

MobileNet Comparison to Popular Models

Smaller MobileNet Comparison to Popular Models

MobileNet for Stanford Dogs

COCO object detection results comparison using different frameworks and network architectures

MobileNet Distilled from FaceNet

Conclusion

Written by YEN HUNG CHENG