# EfficientNet and EfficientDet Explained

EfficientDet is an improvement upon EfficientNet, so we’ll look at the latter first.

## EfficientNet

The paper sets out to explore the problem of given a baseline model i.e. CNN architecture how can we scale the model to get better accuracy. There are 3 ways to scale a model.

- Resolution — Input resolution e.g. (224,224) to (512,512)
- Depth — Number of conv blocks (layers) e.g. Resnet-18 to Resnet-101
- Width — Number of channels in each conv block e.g. 8, 16, 32 to 16, 32, 64

While different models choose to use different scaling techniques, according to their experiments the improvements on accuracy saturate pretty soon if only one way of scaling is done. If only the input resolution were to be increased then the given number of conv blocks aren’t enough to retain the perceptive field. Also the with more channels we can hope to extract more finer features from the new high resolution input. Increasing the network depth works only to a level, even with batch norm and skip connections to combat the vanishing gradients problem. And increasing the number of channels with the same number of layers seems to be a bottleneck situation.

Thus, there lies a relationship between the resolution, depth and width of the tensors and the network. For most cases the CNN can be modelled as a compound function **F** of a tensor of shape **H, W, C**. And since most models repeat their conv blocks the depth of the network can be modelled as how many times (**L**) we apply **F **on the tensor. So we can choose to increase H,W and C and L.

They introduce a solution called **compound scaling** which depends on 4 hyperparameters. They are ϕ, α, β, γ, where ϕ is the compound scaling factor determined by the user, based on the available compute resource. And α, β, γ determine the relative ratio of scale for depth, width and resolution respectively. The FLOPS of a CNN is proportional to α, β², γ². Read appendix for how. Assuming we have twice the resources then

And any further scaling follows the form of

where, ϕ is the user chosen hyperparameter. They also introduce a new baseline architecture found using Neural Architecture Search (think AutoML). Using this baseline model they perform a grid search for the best performing values for α, β, γ, which are 1.2, 1.1, 1.15 respectively. They then use the same values for all values of ϕ. Even though a separate grid search for each variant would give better values and thus better performance, it is not worth the time and compute.

The main observation is to scale resolution, width and depth together in the ratio of α, β, γ to get the most accuracy.

## EfficientDet

To understand EfficientDet we need to first know about Feature Pyramid Network. It’s based on the traditional idea of running the algorithm on multiple resolutions of the same image hoping to capture both small and large scale phenomena. Here, however instead of using the image at different resolutions they use feature maps at different resolutions.

In the image above, bottom-up represents the traditional backbone of a CNN. And the top-down represents feature fusion at multiple scales. The idea of the lateral connection is that of to combine feature-rich but low resolution feature maps with high resolution but less meaningful feature maps. The predict block applies a 3x3 convolution and produces final feature maps upon which further operations (detection, segmentation) can be made.

Now that FPN is clear, let’s move on to EfficientDet. The paper makes two major modifications to EfficientNet, namely **BiFPN** and a new **compound scaling** method. EfficientDet uses the same backbone as EfficientNet but adds a bi directional feature pyramid network to help in multi scale feature fusion.

BiFPN has 5 modifications over a normal FPN.

- Instead of only top-down feature, it adds another bottom-up feature fusion branch
- It has skip connections from the initial feature map to the fused feature map
- Nodes with only one input are removed, cause they do not do much fusion as other nodes
- The entire module is repeated multiple times
- Features are not summed directly, instead a weighted average is used hoping different resolution feature maps contribute to the fusion at different capacity. Unbounded weights bring problems in backprop, so we need to normalise it. They tried applying softmax to the weight values which worked but slowed down training. So a simple average after relu activation is used to normalise the weights.

The need for a new scaling technique comes from the fact that we have the BiFPN as an additional module in the network and that too can be scaled. But there’s no heuristic given about the scaling technique here. The input resolution, depth of BiFPN increase linearly with ϕ and the width of BiFPN increases exponentially.

## Appendix

Let T1 and T2 be the input and output tensor, with dimensions H, W, C1 and H, W, C2 respectively. The FLOPS required for the convolution with kernel of size K, K is

Note: The number of kernels is determined by the number of channels of the output tensor i.e. C2 in this case. And thus doubling the number of channels results in a 4x increase in FLOPS.