Capsule Networks

The recent paper published by Geoffrey Hinton has received a lot of media coverage due to the promising new advancements in the evolution of neural networks.

The advancement is called “Capsule Networks” and its latest implementation referred to in his paper “Dynamic Routing between capsules”.

To be clear, capsule networks were first introduced by Hinton in 2011 but remained somehow dormant as he had hard time making them work.

The recent research paper from October 2017 seems to be paving the way for a future use of this technology.

So what is a Capsule Network?

(For a more detailed explanation, check Aurélien Géron’s great video).

In computer graphics, we start with an abstract representation of an object, through its “instantiation parameters” and then apply a rendering function in order to obtain an image.

Think about the “instantiation parameters” as “pose parameters” for example. In this case, the object could be “represented by its x, y location, and angle value.

Inverse graphics, on the other hand, is precisely the opposite way. We start from an image and through inverse rendering we find what objects it contains and what their instantiation parameters are.

A Capsule Network is a neural network that performs the inverse graphics mechanism.

This network is composed of many capsules, and each of them is a function that aims to predict the presence and the instantiation parameters of an object at a given location within an image.

The arrows in the representation below represent the output vector of a capsule.

The length of the activation vector is the estimated probability that the object is indeed present within the image.

The orientation of the vector encodes the object’s estimated pose (instantiation) parameters.

To obtain this “representation” of the instantiation parameters, we apply several of convolutional layers to the image, which output an array containing a bunch of feature maps.

Then we need to reshape this array to get a set of vectors for each location.

Finally, we “squash” the vectors to ensure there are no vectors longer than 1 (since the length of a vector represents the probability, we can not have a value more than 100%). “Squashing” the vectors will put their length’s value between 0 and 1.

One of the main advantages of a capsule network is that it preserves the object location within an image. This feature is called “equivariance”.

On the other hand, a convolutional neural network loses this location information of an object, due to the pooling layers that “extract” only the most meaningful information.

Another important feature brought by the capsule network is its ability to analyze the hierarchy of objects and, in a way, their “interaction” among them within the image.

This characteristic is obtained because every capsule in the first layer tries to predict the output of every capsule in the next layer.

During training, the network learns the transformation matrix for each pair of capsules in the first and second layer, which allows the “discovery” of the relationships between objects.

As I mentioned at the beginning of the post, for a further and more detailed explanation I redirect to Aurélien Géron’s video, which gives an incredible overview of the subject. (Geoffrey Hinton himself even praised it!!)

This blog was inspired by Aurélien Géron’s video