Understanding Hinton’s Capsule Networks. Part IV: CapsNet Architecture

Part of Understanding Hinton’s Capsule Networks Series:

Part I: Intuition
Part II: How Capsules Work
Part III: Dynamic Routing Between Capsules
Part IV: CapsNet Architecture (you are reading it now)

Introduction

In this part, I will walk through the architecture of the CapsNet. I will also offer my shot at calculating the number of trainable parameters in the CapsNet. My resulting number is around 8.2 million of trainable parameters which is different from the 11.36 officially referred to in the paper. The paper itself is not very detailed and hence it leaves some open questions about specifics of the network implementation that are as of today still unanswered because the authors did not provide their code. Nonetheless, I still think that counting parameters in a network is a good exercise for purely learning purposes as it allows one to practice understanding of all building blocks of a particular architecture.

Part I. Encoder.

CapsNet encoder architecture. Source: original paper.

Layer 1. Convolutional layer

Input: 28x28 image (one color channel).
Output: 20x20x256 tensor.
Number of parameters: 20992.

Layer 2. PrimaryCaps layer

Input: 20x20x256 tensor.
Output: 6x6x8x32 tensor.
Number of parameters: 5308672.

Layer 3. DigitCaps layer

Input: 6x6x8x32 tensor.
Output: 16x10 matrix.
Number of parameters: 1497600.

The loss function

The loss function might look complicated at first sight, but it really is not. It is very similar to the SVM loss function. In order to understand the main idea about how it works, recall that the output of the DigitCaps layer is 10 sixteen-dimensional vectors. During training, for each training example, one loss value will be calculated for each of the 10 vectors according to the formula below and then the 10 values will be added together to calculate the final loss. Because we are talking about supervised learning, each training example will have the correct label, in this case it will be a ten-dimensional one-hot encoded vector with 9 zeros and 1 one at the correct position. In the loss function formula, the correct label determines the value of T_c: it is 1 if the correct label corresponds with the digit of this particular DigitCap and 0 otherwise.

Color coded loss function equation. Source: author, based on original paper.
Loss function value for correct and incorrect DigitCap. Note that the red graph is “squashed” vertically compared to the green one. This is due to the lambda multiplier from the formula. Source: author.

Part II. Decoder.

CapsNet decoder architecture. Source: original paper.
Top row: original images. Bottom row: reconstructed images. Source: original paper.

Layer 4. Fully connected #1

Input: 16x10.
Output: 512.
Number of parameters: 82432.

Layer 5. Fully connected #2

Input: 512.
Output: 1024.
Number of parameters: 525312.

Layer 6. Fully connected #3

Input: 1024.
Output: 784 (which after reshaping gives back a 28x28 decoded image).
Number of parameters: 803600.

Conclusion

This wraps up the series on the CapsNet. There are many very good resources around the internet. If you would like to learn more on this fascinating topic, please have a look at this awesome compilation of links about CapsNets.

Thanks for reading! If you enjoyed it, hit that clap button below and subscribe to updates on my website! It would mean a lot to me and encourage me to write more stories like this.

You can follow me on Twitter. Let’s also connect on LinkedIn.