Available Now: Open-Source Implementation of Hinton’s “Matrix Capsules with EM Routing”

Ashley Gritzman
Analytics Vidhya
Published in
10 min readAug 26, 2019

--

Geoffrey Hinton has been talking about “capsule networks” for a long time (and when he talks, we listen), so when his team published two papers on this topic towards the end of 2017, it naturally created quite a stir in the machine learning community.

The first paper titled “Dynamic Routing Between Capsules” presented a capsule network architecture featuring dynamic routing-by-agreement that reached state-of- the-art performance on MNIST, and beat CNNs at recognising overlapping digits. Thankfully 🙏, the code for this paper was made available on GitHub. The second paper titled “Matrix Capsules with EM Routing” addressed some of the deficiencies of the first, and reduced the test error on smallNORB by 45% compared to state-of-the-art. Unfortunately this time the research community was not so lucky, and no code was made available (Booo… 👎). This left interested researchers with the frustrating task of trying to implement the paper and reproduce the benchmarks on their own.

While writing our own implementation, we noticed several common mistakes in other open source implementations that we came across. In this post we share some of these learnings, specifically focusing on three implementation pitfalls and how to avoid them. While our implementation is a considerable improvement over currently available implementations, it still falls slightly short of the performance reported by Hinton et al. But we happily make the full source code for this implementation available on GitHub (Applause 👏👏👏).

Caution 🚧: we are about to dive head-first into the details, so if you are not familiar with Hinton’s paper then we recommend starting with Jonathan Hui’s blog, then reading the paper itself, before venturing down below👇. If you are an eager beaver and itching to see the code, then feel free to skip the details below and head over to the GitHub repo.

Pitfall 1: What’s wrong with an only-child?

AKA: The problem of parent capsules having only one child capsule

The EM routing algorithm is a form of cluster finding which iteratively adjusts the assignment probabilities between child capsules and parent capsules.

When EM routing starts out, all child capsules are shared evenly between parent capsules. But as routing proceeds, parent capsules compete for the children until they have sole custody.

At the start of EM routing, the output of each child capsule is evenly shared between all of the parent capsules. As EM routing proceeds, the affinity of parent capsules for particular child capsules increases, and eventually parent capsules may have “sole custody” of child capsules. This does not cause a problem as long as child capsules have siblings, but the “only child” scenario does cause a problem i.e. if a parent capsule receives input from only one child capsule. This situation is similar to a clustering scenario whereby a cluster has only one data point.

Since EM routing fits a Gaussian distribution to each of the parent capsules, we need to calculate the mean and variance of each parent capsule. In the only child scenario, the parent capsule has only one child and so the variance is 0. This causes numerical instability when calculating the activation cost log(σ), which is undefined at σ=0. Also, if the variance is 0, then the probability density is infinity ∞ at the mean, and zero elsewhere, which again causes numerical issues.

The prevalence of the only child problem depends on the ratio of child capsules to parent capsules. If the ratio is high (e.g. 100:1) meaning many child capsules feeding fewer parent capsules, then the problem only occurs at higher routing iterations. Whereas, if the ratio is low and approaches 1:1, or even lower (i.e. more parent capsules than child capsules), then the only child problem starts occurring at a lower number of routing iterations.

Once we understand this problem, it is actually very simple to address 🥳. We just impose a lower bound on the variance by adding ε=10⁻ ⁴, and we are good to go.

Pitfall 2: Being fair to the parents

AKA: Normalising the amount of routing data assigned to parent capsules

When determining the cost of activating a parent capsule, we compute R_ij, which is the total amount of data assigned to parent capsule j from all child capsules i. This is then put through the logistic function, so if the cost is high then the parent capsule will not be activated, but if the cost is low then the parent capsule will be activated.

Let’s ponder the following question: do all parent capsules have an equal chance of being activated? 🤔

A good place to start is to try figure out if all the parent capsules expect to receive the same amount of routing data R_ij. To investigate this further, we will consider the “smaller” capsule network configuration from the paper (A=64, B=8, C=D=16).

The table below shows a summary of the layers in the smaller capsule network. We use the following terminology: K is the kernel size, S is the stride, Ch is the number of channels in a regular convolution, I is the number of input capsule types, O is the number of output capsule types, W & H are the spatial width and height.

Layers in the “smaller” capsule network architecture. Notice how parent capsules in the class_caps layer receive more routing data than either of the layers conv_caps1 and conv_caps2.

To get an idea of how much routing data R_ij a parent capsule in a particular layer can expect, we calculate the mean_data for the layer, which is just the total number of child capsules divided by the total number of parent capsules. (P.S. we assume here that all the child capsules are active, which is unlikely, but we roll with it as the conclusion is still valid.)

For the final class_caps layer, the spatial dimensions of the child tensor are flattened, so child capsules are fully connected to the class capsules.

Having a look at the table, we see that the mean_data of the class_caps layer is 80.0, which is roughly 30× larger than the two conv_caps layers. So parent capsules in the class_caps layer receive much more routing data than the other layers.

So why exactly is this a problem?

Well, looking back at the equation for the cost, we see that the amount of routing data R_ij acts as scaling factor. If the routing data is too small, then the cost will be close to zero for all the parent capsules; but if the routing data is too large, then the cost will vary wildly and we will be operating in the saturation areas of the logistic function (and the gradient will be
≈zero 😱). What we actually want is for the input to the logistic function to be nicely distributed over a sensible range, say [-5, 5]. But now we are stuck, if we are in a sensible range for the class_caps layer then the conv_caps layers with 30× less routing data will be a problem, and vice versa.

So how do we fix this then?

Again the fix is pretty simple once we have understood the problem 🎉. We just normalise the amount of routing data R_ij by the mean_data for that layer, which actually improves the speed of training as well as the accuracy.

Pitfall 3: Fighting over the children

AKA: Parent capsules at different positions competing for child capsules

Let’s consider the case of a 1D convolutional capsule layer, with a kernel size of 3 and a stride of 1, and where both child and parent layers contain only 1 capsule type. Simple enough.

Each parent capsule receives votes from 3 child capsules, but the number of parent capsules competing for each child differs depending on the spatial position.

In the M-step, the kernel slides over the child capsules so each parent receives votes from 3 child capsules. In the E-step, child capsules at the edges receive feedback from only 1 parent capsule each, while child capsule C₃ at the center receives feedback from 3 parent capsules (the maximum is equal to the kernel size K).

When we look at this simple case, it is clear that child capsules receive feedback from parent capsules at different spatial positions, and therefore these parent capsules must compete for the vote of the child capsule. The competition happens in the update of the assignment probabilities of the E-step, where we normalise across all parent capsules competing for a particular child capsule. This point was further clarified by Hinton & Co. in response to a question on OpenReview.net.

But when we had a look at several implementations on GitHub, we found that incorrect normalisation in the E-step is a common mistake. In particular, they normalise only across parent capsule types, and not across parent capsule positions. This prevents parent capsules at different positions from competing for child capsules. The correct method is to normalise across all parent capsules that receive input from a particular child capsule, which will include normalising across parent capsule types and parent capsule positions.

Alas 😞, this time the solution is not so easy to implement, so stick with me here, we’ll go slowly…

The first step is to look at the kernel size and stride to figure out which child capsules fall within the receptive field of each parent capsule. We then tile these child capsules accordingly so that we can multiply by the K transformation matrices to give us the “votes”. The trick here is that we need to keep track of where each of these child capsules came from (we will need this later), so we store the mapping between child capsules and parent capsules in a 2D binary matrix called the spatial routing map.

The votes are then scaled by the corresponding assignment probabilities R_ij, and used in the M-step to compute the mean μ , standard deviation σ, and activation a of each parent capsule.

Our implementation of the M-step where we keep track of the connections between child capsules and parent capsules in the spatial routing map.

Moving on to the E-step.

We look at each vote that contributes to a parent capsule, and compute how much it differs from the mean μ of that parent capsule. This gives us the probability density p_ij of vote v_ij.

Now time for the big competition…🏆 We need to reorganise things so that we can see which parent capsules are competing for each of the child capsules. To do this we need to remember where each of the probabilities p_ij came from. Ahhh, good thing we kept track of this exact thing in the spatial routing map, wipe off the dust, it’s time to put it to work. We use the spatial routing map to align all of the probabilities for a particular child capsule in one column. This creates a sparse representation of the probability densities, which we then multiply by the activations of the parent capsules a_j. All that remains is to normalise over all parent capsules competing for the child capsule, which is just dividing by the column totals, and we are done! 🎉

Well actually we are done with the first iteration. But we just update the assignment probabilities R_ij, and then rinse and repeat for the next, and the next, and the next iterations.

Our implementation of the E-step where we use the spatial routing map to align competing parent capsules for a particular child capsule in one column.

Results

AKA: Results

The graph below shows the test accuracy of our implementation after each training epoch for 1–3 iterations of EM routing. We achieve our best accuracy of 95.4% with 2 routing iterations, and with 3 iterations we get 93.7%. The table shows how our results stack up to other open source implementations available on GitHub: yl-1993, www0wwwjs1, Officium (as recorded on 28 May 2019). The accuracy of our implementation at 95.4% is a 3.8 percentage point improvement on the previous best open source implementation at 91.8%, however it is still a bit below the accuracy of Hinton & Co. at 97.8%. To our knowledge, our implementation is currently the best open source implementation available.

Acknowledgements

  1. Jonathan Hui’s blog, “Understanding Matrix capsules with EM Routing
    (Based on Hinton’s Capsule Networks
    )”
  2. Questions and answers on OpenReview, “Matrix capsules with EM routing
  3. Suofei Zhang’s implementation on GitHub, “Matrix-Capsules-EM-Tensorflow
  4. Guang Yang’s implementation on GitHub, “CapsulesEM

Links to Code & Paper

Click here for the GitHub repo containing the code, and here for the arXiv paper.

We welcome contributions, so if you have ideas on how to improve the accuracy even further, please submit a pull request on GitHub.

How to cite this work

If you find this post or the code useful in your academic work, please cite as follows:

A. Gritzman, “Avoiding Implementation Pitfalls of Matrix Capsules with EM Routing by Hinton et al.”, in Joint Workshop on Human Brain and Artificial Intelligence (HBAI) at IJCAI’19, Macao, 2019.

If you found this post helpful, please go ahead and give us some claps so this article will rank higher and others will see it 👏👏👏. Don’t be shy, you can hit that clap button to your hearts content (well actually I think the max is 50 claps, but we’ll worry about that when you get there 😜).

--

--