Exploring DenseNets and a comparison with other Deep Architectures

Published in

Analytics Vidhya

5 min readMay 8, 2020

We all know about the Convolutional Neural Networks and their efficiency in learning non-linear and complex representations. But the issue with learning these representations is that as our problem gets more complex, we add more layers and make the model deeper, and then we see multiple issues come to surface. This makes handling deep neural networks tricky.

Over the last decade, we have seen a multitude of deep network architectures successfully address these issues, but the one thing common in all of these is they use some novelty to curb the problem of vanishing gradients, overfitting and parameter explosion. We will explore more about how these architectures are built to address these issues and how DenseNet is novel in its approach. All the content is based on the extensive research work published by Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger titled “Densely Connected Convolutional Networks”

The current state-of-the-art deep neural net architectures

ResNets and Highway Networks bypass signal from one layer to the next via identity connections, that is they pass on the input from the previous layer as is to the next layer, this preserves the sanctity of the input features to be processed by each layer. Stochastic depth shortens ResNets by randomly dropping layers during training to allow better information and gradient flow.

FractalNets repeatedly combine several parallel layer sequences with the different number of convolutional blocks to obtain a large nominal depth, while maintaining many short paths in the network. Google’s Inception also uses multiple filter sizes and combine these feature maps. Although these different approaches vary in network topology and training procedure, they all share a key characteristic:

Create short paths from early layers to later layers.

How dense are these DenseNets?

DenseNets are based on a simple connectivity pattern: each layer in the network is connected to every other layer directly. Yes! you read it right, directly. To preserve the feed-forward nature, each layer obtains inputs from all preceding layers and passes on its own feature-maps to all subsequent layers.

To reiterate this idea, in a normal neural network of “L” layers, there will be “L” connection, one between each layer and its next layer. But, in DenseNets there will be “L(L+1)/2 “ connections.

A typical dense block: Each layer is being connected to other layers.

Let’s look inside this densely connected network!

So what are the uses of these connections over other architectures?

Lesser parameters — Yeah! Even with the additional connections, these networks have fewer parameters comparatively, because there is more input connectivity, the DenseNet layers are very narrow ( 12 filters per layer), adding only a small set of feature-maps to the network and keep the remaining feature maps unchanged — and the final classifier makes a decision based on all feature-maps in the network.

Better information flow — Since the filters are smaller and input is preserved, the gradients and information flow freely through the network, and the last layer also have access to input and all the feature maps and hence there is an implicit “Deep Supervision”.

Overfitting — These connections also prevent overfitting and also makes the training easier.

What is the Novelty of DenseNet that makes it stand out?

Feature-reuse — Instead of drawing representational power from extremely deep or wide architectures, DenseNets exploit the potential of the network through feature reuse, concatenating feature-maps learned by different layers increases variation in the input of subsequent layers and improves efficiency. This constitutes a major difference between DenseNets and ResNets. Also, compared to Inception networks( which also concatenate features from different layers), DenseNets are simpler and more efficient.

The Architecture

A deep DenseNet with three dense blocks. The layers between two adjacent blocks are referred to as transition layers and change feature-map sizes via convolution and pooling.

The above diagram depicts the architecture of a typical DenseNet model. I will not be talking about this in detail as the paper explains it better. At a high level, we have Dense blocks composed of these interconnected layers. Each layer in itself is Batch Normalisation followed by ReLU and a convolution layer. All these Dense Blocks are connected to each other by a set of transitional convolution and pooling layers.
So now we know these basic elements of the layer, we can build our own architecture by adding any number of dense blocks. we can also customize each layer by choosing the filter size and the number of feature maps according to our problem.

Growth rate in DenseNets

The initial number of feature maps decides the number of parameters and complexity of the model. This hyperparameter is called a “growth rate”.

If each function H(L) produces k feature maps, it follows that the Lth layer has

k0 + k ×(L−1) input feature-maps,

where k0 is the number of channels in the input layer, and k is the growth rate. This means that with every adding layer, we add ‘k’ more feature maps to the total number of feature maps and hence the total number of parameters. so this number “k” determines the number of filters. Even here, with a small k value of 12, which results in a comparatively less number of parameters, this network yields state-of-the-art results. The growth rate regulates how much new information each layer contributes to the global state and hence decides the information flow of the model.

Performance and Results

Let’s now look at how efficient this new approach is!

Accuracy — This model was trained on some of the benchmark datasets like CIFAR, SVHN, and Imagenets. The model outperformed all the current state-of-the-art models like ResNets, FractalNets on the CIFAR, and SVHN datasets. That’s impressive!

Capacity and parameter efficiency — DenseNet models are capacitive, in the sense that, as we increase the number of Layers and the growth rate of the model, the model’s efficiency and accuracy increases proportionally. This proves that the model is not succumbing to overfitting and has the ability to learn more complex problems. The model is also parameter efficient, a 250-layer model only has 15.3M parameters, but it consistently outperforms other models such as FractalNet and Wide ResNets that have more than 30M parameters.

A final Summary

The model introduces direct connections between any two layers with the same feature-map size.
DenseNets scale naturally to hundreds of layers while exhibiting no optimization difficulties.
DenseNets tend to yield consistent improvement in accuracy with a growing number of parameters, without any signs of performance degradation or overfitting. Under multiple settings, it achieved state-of-the-art results across several highly competitive datasets.
Moreover, DenseNets require substantially fewer parameters and less computation to achieve state-of-the-art performances.

Overall, since the authors were training the model in alignment with current architectures to be comparable, these DenseNets can be further optimized by more detailed tuning of hyperparameters and learning rate schedules and has high future prospects in deep learning!

All the credits to Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger.
Original paper — “Densely Connected Convolutional Networks”