A Deep Dive into Capsule Networks

The theory behind Geoffrey Hinton’s fascinating creation.

Published in

Analytics Vidhya

5 min readJun 28, 2020

Let’s begin with the human visual recognition where when we see an object (any real object is made up of smaller objects) our eyes make some fixation points and the relative positions of these fixation points helps our brain in recognizing that object. In this way, our brain does not need to process every minute detail, it can combine hierarchical information together to help identify it.

The assumption behind Capsule Networks or simply put Capsnet is that there are capsules (a groups of neurons) that tell whether certain objects are present in an image. Corresponding to that object, there is a capsule which gives us the probability that the entity exists along with the instantiation parameters of that entity.

Therefore in a simple definition, a CapsNet takes input an image and finds what kind of object is present in it and what are the instantiation parameters(rotation, thickness, etc).

What is a Capsule?

A capsule is a group of neurons whose activity vectors represent various instantiation parameters of an object. Lets break it down:
Active capsules at the lower level makes prediction for the parameters at the higher level capsules. For example the lower level capsules can represent features like rectangle or triangle whereas the higher level features represents the house. High level capsules are active when many low level capsules come to an agreement and thus they are able to establish a hierarchy of features.
Next, the probability that an entity is present is determined by the length of the activity vectors and its property is determined by the orientation of the vector.

We use a special type of non-linear transformation called a squashing function to make sure the length of the output vector is less than 1 and it serves as an activation function for capsule networks.

Architecture

Source: Original paper by Geoffrey Hinton

The above figure is a simple 3 layer architecture and is shallow with only two convolutional layers and one fully connected layer.

The levels of Capsule network can be divided as follows:

Level 1 : Input layer
At this level, the input image is fed into the net.
Level 2 : CNN layer
At this level, 9 x 9 dim filter are applied on entire image stride of 2 and 256 layers followed by relu activation function. The output dimension becomes 20 x 20 x 256
Level 3 : Primary Capsules
The primary capsules represent entities of the lowest level of multi-dimensional entities and are connected to the digit capsules. Therefore the total capsule units are 32 6 6 with 6* 6 sharing the same weight vector Wij as shown in the above figure.
Level 4 : Digit Capsules
The Digit capsules represent entities of higher level and become active by agreement of primary capsules in the layer below. There are 10 caspsules in this layer each having 16 dimensions.

Routing by Agreement

This is the key aspect of the Capsule Network.
A CapsNet consists of several layers. Capsules within the lower layer correspond to simple entities . These low-level capsules combined their probabilities to get the output of high-level capsules.
There is a coupling effect when some low-level capsules agree on the presence of a high-level entity, the high-level capsule corresponding to that entity sends a feedback to these low-level capsules which increases their bet on that high-level capsule.Hence, other capsules are rejected and this is eliminates noise.

Dynamic Routing Algorithm:

Given: Prediction vectors 𝐮̂𝑗|𝑖 , number of routing iterations 𝑟

for all capsule i in layer l and capsule j in layer l+1: bij=0
for r iterations do:

for all capsules i in the layer :(this is the coupling coefficient)

2. for all capsules j in the layer : (the output vector is the weighted sum of prediction vectors)

3. for all capsules j in the layer : (activation function)

4. for all capsule i in layer l and capsule j in layer:

return vj

The last step in the loop is where that the routing happens. This infact is how CapsNet works.

Loss Function

The loss function which is used for Capsule networks is linear combination of margin loss function and reconstruction loss. The first part of the margin loss is calculated for correct prediction and the second part is calculated for incorrect prediction.

Why Capsule Networks?

The idea of Capsnet was introduced to overcome the problems and difficulties that occur in Convolutional Neural Network .CNN being a traditional neural network is made up of neurons. Capsule networks enhance the performance of a CNN by imposing the structure on CNN and grouping neurons into capsules.There are two issues with CNN’s max-pooling and translation invariance.

Max Pooling :- CNN detect the occurrence of attribute and features in an image and based on these features it predicts whether an object exists or not. But the issue is that CNN’s takes into consideration only if the features are present or not irrespective of their position.This problem is caused by max pooling. Capsule Networks can solve this by capturing relationships which takes into account features and their orientation.

Translation Invariance :- Max-pooling problem leads to translational invariance .That is even if an object is rotated clockwise or anticlockwise by certain amount of degrees,it will still be identified as the same object in CNN.
CapsNet solves this problem by using equivariance which can understand the rotation and the proportion and adapts itself accordingly to preserve the information.

Advantages and Disadvantages of CapsNet

A capsule network recognizes the pictures and the objects efficiently regardless of the point of view .It also makes use of capsules which group neurons thereby reducing the number of connections required as compared to that of CNN and preserves the position and pose information of an object.
Finally CapsNet does not lose information between layers unlike CNN and thus, a CapsNet model is preferred over CNN.

CapsNet is a new concept and has not made a descent in the field of neural network. As a result it has a lot of inconsistencies like time required for routing, inadequate testing on larger and complex datasets.

To conclude, CapsNet has shown a lot of promise and i believe over time it is bound to become more efficient and a staple classification neural network.

References

[1] https://arxiv.org/pdf/1710.09829.pdf

[2] https://www.youtube.com/watch?v=pPN8d0E3900&t=619s