Pushing the limits of Machine Learning with Capsule Networks

Roger Fong
Picterra
Published in
4 min readJan 13, 2021

In our previous post about pushing industry limits we talked about SAR data. We now switch attentions to the domain of machine learning and in particular to a family of deep learning networks called Capsule Networks.

What are Capsule Networks?

Capsule networks first hit the limelight in 2017 as one of Geoffrey Hinton’s (considered to be one of the “fathers” of deep learning) ground breaking innovations. Normally, and indeed in other blog posts, we’ve often said that the specific architecture you use is not a huge factor in the performance of your model, maybe by a few percentage sure, but nothing that would be very significant in a real world application. To clarify that however, this is mostly referring to architectures within the family of convolutional neural networks which is the basically the only type of network that is used on any imagery based task currently (YOLO, MaskRCNN, UNet all of these famous architectures you’ve heard of use CNNs). Capsule Networks are very different from the CNN family of networks.

Long story short, Capsule Networks have the ability to encode hierarchical information. The easiest example would be in the identification of what is and isn’t a face. We know that the eyes are on top, the nose a bit below in the middle and the mouth below that. Now take that face and flip the positions of an eye and a nose, or just rotate the nose 90 degrees, etc. With a CNN it is very possible that the model will still think it is a face. It sees that all the right elements to make up a face are there, but what it’s missing is the relationship between each of these elements, so where each of parts of a face sit and how they are oriented relative to each other. Taking it one step farther, the face is part of a head which has relationships with other larger body parts, like legs, the torso, and the arms. Now we’re starting to create a hierarchy of relationships. This is exactly what SegCaps (in theory) is able to capture, and it is at a conceptual level, a major step in the right direction since these kinds of hierarchical relationships are exactly how we as humans reason about what is and isn’t a specific type of object.

An image from this article which provides a more in depth look at capsule networks. A CNN could easily think that the re-organised “face” on the right is a proper face.

So, let’s ship it?

It all sounds quite sexy on paper, but in reality this technology is still in its infant stages. It’s rather slow, hard to train properly, hard to get to converge, etc. In fact in existing literature it’s only really been tested in the context of simple imagery like pictures of numbers, and also biomedical imagery (from microscopes). But running capsule networks on natural imagery remains an open challenge that seems difficult to tackle. There are many more technical and mathematical progress that need to be made in academia on this type of network in order to make it truly viable and impactful in real life applications. It’s also important on the for those of us on the industry side of things to keep on testing these advancements on these real life applications to measure and be aware of that progress.

This is just one of many experiments that we’ve been playing with here at Picterra as we know that it’s important not only to take advantage of existing mature technologies but also to try to push the limits of the most recent advances, whether or not they are ready for production. Below are some tests that we’ve run on detecting buildings comparing a traditional CNN architecture (U-Net) to a capsule network based model for segmentation, SegCaps.

Detections using a CNN based architecture, using a small low shot dataset, (less than 20 annotations)
Some honest-to-god non-cherry picked results using SegCaps, using the same dataset as before.

As you can see the results are comparable, but nothing to write home about. Nevertheless capsule networks are so different from the standard CNNs that we’re used to using that from a technical perspective it’s fascinating for us to see that it manages to output anything even remotely reasonable at all!

We also tried to test the exact same capsule network based model that we used for the images above on the much larger Spacenet2 dataset which has many thousands of annotations. However, it failed to converge and did not produce any output. This is a clear sign that more work needs to be done on this technology in academia to adapt it for use in real life imagery. But until then we’ll keep an eye out on progress in this field and keep on pushing!

That’s all from us for now, we’ll be sure to keep you updated with our latest advances. As a sneak peak, keep an eye out for our new upcoming features: workflows and blocks, which will let our users harness much more powerful and customisable functionality to suit the needs of their domain.

Picterra out!

--

--