The Efficiency of Densenet
From Restnets, Highway networks, and deep and wide neural networks, people try to add more inter-layer connections besides the direct connection in adjacent layers to boost information flow along layers. Similar to Resnet, Densenet adds shortcuts among layers.Different from Resnet, a layer in dense receives all the outs of previous layers and concatenate them in the depth dimension. In Resnet, a layer only receives outputs from the previous second or third layer, and the outputs are added together on the same depth, therefore it won’t change the depth by adding shortcuts. In other words, in Resnet the output of layer of k is x[k] = f(w * x[k-1] + x[k-2]), while in Densenet it is x[k] = f(w * H(x[k-1], x[k-2], … x)) where H means stacking over the depth dimension. Besides, Resnet makes learn identity function easy, while Densenet directly adds identity function.
Densenet is more efficient on some image classification benchmarks. From the following charts, we can see densenet is much more efficient in terms of parameters and computation for the same level of accuracy, compared with resnet.
Densenet contains a feature layer (convolutional layer) capturing low-level features from images, serveral dense blocks, and transition layers between adjacent dense blocks.
Dense block contains several dense layers. The layer is composed as follows:
The depth of a dense layer’s output is ‘growth_rate’. As every dense layer receives all the output of its previous layers, the input depth for the kth layer is (k-1)*growth_rate + input_depth_of_first_layer. This is also where the name of ‘growth_rate’ comes from. If we keep adding more layers in a dense block, the depth will grow linearly. Let’s say the growth rate is 30. After 100 layers the depth will be over 3000! We have the problem of computation explosion here. The paper applies a transition layer to reduce and abstract the features after a dense block with limited number of dense layers to circumvent this problem. I really like to use this image of the bing bang theory for demonstration. Unfortunately it is not copyleft.
To reduce the computation, a 1x1 convolutional layer (bottleneck layer) is added, which makes the second convolutional layer always has a fixed input depth. It is also easy to see the size (width and height) of the feature maps keeps the same through the dense layer, which makes it easy to stack any number of dense layers together to build a dense block. For example, densenet121 has four dense blocks, which have 6, 12, 24, 16 dense layers respectively. With repetition, it is not that difficult to make 112 layers though :-).
As a tradition, the size of output of every layer in CNN decreases in order to abstract higher level features. In densenet, the transition layers take this responsibility while the dense blocks keep the size and depth. Every transition layer contains a 1x1 convolutional layer and a 2x2 average pooling layer with a stride 0f 2 to reudce the size to the half. Be aware that transition layers also receive all the output from all the layers of its last dense block. So the 1*1 convolutional layer reduces the depth to a fixed number, while the average pooling reduces the size.
As we have seen, it uses much fewer parameters than resnet. Therefore, the parameters in a densenet are more representative on average. The above chart uses L1 norm of the weights to represent the degree of utilisation of the input. We can see the layers tend to use the information from its closer previous layers, which means the closer layers are not bypassed. The transition layer might be the most interesting ones from this point of view as shown in the chart. Therefore, there are fewer redundant layers. Fewer redundant layers means more parameter efficiency and less computation.
The architecture graph
Here you can see the detailed architecture of densenet121 as printed out in Pytorch.