(PPS) Dynamic Routing Between Capsules

Kevin Shen
Mini Distill
Published in
8 min readJun 18, 2018

--

Capsules is one of the recent works coming from Geoff Hinton which is why it’s so hyped up (I think an independent implementation came out the day after the paper went on arXiv). The goal of capsules is to add equivariance (rotation, scale, color, etc.) to deep learning models. Before we begin our discussion, a distinction needs to be made between a model that is invariant and a model that is equivariant under transformations of input. Mathematically, a model f is invariant to function g if:

Here I is the input, perhaps an image. The output of the model f does not change if I is first perturbed by g.

A model f is equivariant to function g if:

Here g´ is another function. In plain words, perturbing I by g before the model f leads to a predictable change in the output g´ of f.

Examples of invariances we might care about in a machine learning application: translation, rotation, scale, illumination.

In practice invariance means your model is robust to the perturbation to the input (output does not change despite change in input). On the other hand, equivariance means your model accounts for the perturbation to the input such that a perturbation to the input leads to a corresponding change in the output. Convolutional neural networks are invariant to translational perturbations. We can still detect the cat if it has been shifted over by 2 pixels in the original image. However, CNN’s are not equivariant because given that we’ve detected a cat, there’s no way to know if the input image was shifted by 2 pixels or not. As we will see, capsule networks aim to be equivariant instead of invariant to perturbations.

In his lectures on capsules, Hinton likes to mention “capturing the linear (input) manifold.” What he means by this is that most image transformations you can think of lie on a linear manifold in high dimensional input space. In other words, while images are high dimensional inputs (each pixel in an image equals one feature dimension), you can use a much simpler model to describe translation or rotation; specifically you only need a linear model with a few parameters. It shouldn’t be as hard as learning a single CNN filter for each scale or each rotation.

Consider translation. For each location (x, y) in the original image, the new pixel at that location I´(x, y) under a translation (u, v) is I(x + u, y + v). The set of all translational variants of image I can be described by all the pixel values of I plus two parameters u and v. There is an asymmetry in how easy it is to describe all translational variants of I and how difficult it is to train a model f to correctly classify the translational variants of I. This is because the model has not captured the linear manifold of the image. Of course CNN’s were invented to solve this exact problem: CNN’s do capture the linear input manifold of translation. However, we can think of some other transformations such as rotation which can arguably be described just as simply as translation (using a few parameters) but are difficult for CNN’s to correctly classify. Capsule tries to extend the idea of “capturing the linear manifold” beyond translation.

While neural nets represent a neuron using a scalar value, the main idea of capsules is to represent a neuron using a vector value. Furthermore, each capsule should represent an object. The dimensions of the vector represent what Hinton calls “instantiation parameters” of the object which are essentially the aesthetic properties: pose, deformation hue, texture, etc. The magnitude of the vector is related to whether the object is present in the image or not. The authors propose for the the dimension of the vector to increase with layer number since higher layers represent higher level objects (and therefore requires more capacity). This vector representation is the fundamental difference between capsules and regular neural networks and is what allows capsule networks to be equivariant. By computing only scalar neurons, NN’s throw out information related to the transformation of the input image. In contrast, capsule networks maintain this information in the dimensions of the capsule.

Now for the details of the model:

  • There is an assumed hierarchy of objects represented by capsules. At lower levels, capsules represent low level objects such as nose, eyes, mouth. A higher level capsule would represent a face.
  • If a higher level capsule turns on, then the evidence from the lower level capsules must be consistent. E.g. A horizontal mouth and diagonal nose does not make a face. If evidence is not consistent, the higher capsule has been pruned as a hypothesis (for what exists in the input image). This is the essence of the algorithm.
High level idea of capsule networks. Each box is a neuron which should activate for a particular object in the image and is represented by a vector value. The top neuron activates for a face while the neurons in the previous layer activate for mouth, nose, and eyes respectively. Each vector value contains information about the state of the object. For example, the face has two parameters specifying the scale and rotation of the face. Note that these parameters are learned rather than predetermined.

The forward pass is described below for neuron i of layer L and neuron j of layer L+1. There is an outer loop which trains the parameters of the model and an inner loop which computes the activations of the next layer given the activations of the previous layer. The purpose of the outer loop is to predict higher level object based on lower level objects in the image. The purpose of the inner loop is to reconcile all the predictions and check for consistency.

First let’s discuss the outer loop which resembles the forward pass of regular neural networks. Here’s the relevant notation:

  • u_i: output of the previous layer’s capsule i
  • W_ij: weights of the capsule network, which is learned by backpropagation
  • û_j|i: predicted vector for capsule j given the output of capsule i
  • v_j: output of capsule j

As we will see in a moment, there’s a difference between the predicted vector for capsule j û_j|i and the actual vector v_j. This is because an inner loop must run to check that all predictions are consistent. In the outer loop, the parameters of the model W_ij are learned as they would be in a regular NN. The activations of higher level objects are predicted based on the activations of lower level objects (e.g. W_ij tells you about the existence of a 45 degrees tilted mouth given two 45 degrees tilted parallel lines). û_j|i is the prediction of the activation of capsule j given capsule i. It’s computed as,

In the inner loop, we check to see if the predictions/proposals are consistent (a horizontal mouth and diagonal nose does not make a face) before finalizing the output v_j of capsule j.

First I’ll give an overview of the math before giving an intuition explanation of the inner loop in terms of a democratic voting process. The notion for the inner loop:

  • c_ij: coupling coefficient
  • b_ij: unnormalized coupling coefficient
  • s_j: weighted instantiation of object j

The algorithm for the inner loop:

Input - û_j|i or predictions/proposals for the next layer based on the previous layer

Output - v_j or activations for the next layer

Repeat until convergence:

(1)
(2)
(3)
(4)
(5)

Now let’s try to explain the math using an analogy. To put it colloquially, each capsule in the lower layer gets a limited amount of voting power in a democratic process to decide which objects exist in the next layer. c_ij is the percentage of “votes” capsule i devotes to its proposal of capsule j. Naturally this needs to be normalized by considering all other capsules k that i can contribute to in the next layer (equation 1). In equation 2, capsule j receives a ballot of votes from each capsule in the previous layer specifying the likelihood of existence of j as well as vector value of j. A weighted average is taken over all proposals. We can think of this as a “community” activation proposal. In equation 3, the community activation is passed through a regularized sigmoid-like function. x²/1+x² has the general behavior of a sigmoid function. When s_j is large, the output v_j has a length of approximately 1. When s_j is small, v_j is smaller than 1. In equation 4, we dot the community proposal against proposals from each capsule of the previous layer. By doing this, we check how well each lower level capsule’s proposal agrees with the community proposal. If there is good agreement between the community proposal of capsule j and the individual proposal of capsule i, then i devotes more of its voting power towards capsule j in the next iteration of the inner loop. By repeating this process, capsules that propose consistent vectors for the next layer will come together and vote for the same capsules in the next layer.

We have just described the mechanism of operation of a single capsule layer. A capsule network is constructed by stacking multiple capsule layers on top of some regular CNN layers. The final layer contains N capsules where N is the number of classification classes. The magnitude of each capsule represents the probability of that class existing in the image while the pose of each class is described by the vector value. Routing between two capsule layers requires,

Number of weights between two capsule layers.

number of weights where v are the capsule dimensions, c is the number of channels, and k is the filter size. This is typically on the order of 10⁶.

The authors showed moderate improvements over CNN on classifying MNIST digits that are overlapping and on rotated 3D objects. For more results, refer to the paper. Another cool thing they tried is to reconstruct the input image from the capsule parameters of the last layer. By doing so, they show that the capsule network is learning the instantiation parameters of the object.

As we have mentioned, while capsule networks have the capacity to learn pose equivariance, it’s not clear that they will in practice. What prevents the capsule network from using two separate capsules for the same image at different rotations? Perhaps the answer is regularization. Maybe a model with sufficient capacity and good regularization will capture the linear manifolds in the simplest way possible, but this is just speculation.

--

--