Neural networks which see a little more like we do

A novel approach to object recognition and semgnetation by Geoffery Hinton, Sara Sabour and Nicholas Frosst.

Aaron Baw
The Startup
5 min readNov 30, 2019

--

I’m writing this to better understand an awesome idea I came across in the Computer Vision space.

For my Bachelor’s Dissertation I was fortunate enough to take a deep dive into the field’s state of the art Object Recognition and Localisation. Papers like YOLO (and her latest v2 and v3 iterations — the last of which is an entertaining read to say the least), as well as RCNN, Fast-RCN and Faster-RCNN were some of the most talked about at the time.

YOLO Object Detection works by decomposing images into an SxS grid, where a prediction is made detailing how confident the network is that an object of some class resides in the grid square. Then, a number of bounding boxes are produced by the network, each detailing the how confident the network is that an object resides in the bounding box. It’s a very explicit separation of the localisation-classification dilmena. Classifications and bounding boxes are then merged to produce the final detections as output. [Image courtesy of PJ Redmon, et al]

Object Detection is an interesting, and difficult problem in Computer Vision. This is because it really combines two smaller underlying problems; that of object localisation and classification. The novelty in approaches like the YOLO architecture solve both of these problems in a single forward pass of the network. (Only looking once!).

One issue with these approaches is that they primarily operate on the pixel and image domain. The strength in these networks comes from their ability to capture learned features at lower levels from each of their filters, and learn to recognise higher and higher level features at each progressive layer of the network — going from ‘lines and edges’ to ‘outlines and shapes’ to ‘this is a cat’, for example.

What if, instead of combining features of an image to determine whether or not this image has a cat in it — we could instead learn a more abstract representation of this, closer to how we might biologically mentally model and interpret visual stimuli, and instead use that to decide what’s in the image?

That appears to be what Geoffery Hinton, et al, propose in their incredibly interesting Dynamic Routing Between Capsules paper.

The paper proposes a number of novel techniques and ideas, which allow for impressive generalisation capability while significantly reducing parameters compared to CNN baselines.

Capsules are small groups of active neurons. A capsule in itself will respond to particular features in an image — similar to a filter, but with a marked difference. Here, the activation length (i.e. the length of the vector output) denotes the probability that the feature being looked for exists in the portion of the image being scanned. Each of the components of the vector output can then represent particular values corresponding to parameters of this feature.

For example, a particular capsule may denote the presence of furr in an image. A long activation length would indicate high probabilty that this cat image we have fed into the network has fur. Each of the individual components of the vector output for this capsule activation would correspond to the particular instantiation parameters of this property — such as how thick or coarse the fur is, along with other parameters such as colour, albedo, and so on.

Illustration of this semantic composition idea, courtesy of jhui.

So how do all of these capsules work together? Dynamic routing. This is an interesting idea whereby child ‘capsules’ effectively choose which parents they have, and can determine where to send their activations for further processing. Like a semantic parse tree, the idea here is to build up to a higher-level representation of say a cat from lower level features such as fur, whiskers, shape, and so on. The marked difference here compared to Convolutional Neural Networks CNNs is in this routing process.

Understanding how Capsule Networks represent features

The notion that these capsules are each learning some kind of internal representation of the object being recognised lead to some interesting ways of de-constructing and visualising exactly what instantiation parameters each capsule learned.

By taking the activations of the final Capsule layer as a latent encoded space, and training a decoder network, the authors were able to reconstruct the original input image from the final layer activations. By taking it a step further, they could then tweak each of the vector components of the final activations to see what kind of effect this would have on the reconstructed image, and therefore understand which instantiation parameters the capsule considered. They found that almost all capsules accounted for width, as well as ‘localised’ features particular to specific digits, such as the length of the arch of a ‘2’ or its curvature.

Variations in the instantiation parameters learned by a capsule. [Image courtesy of Geoffery Hinton et al]

Segmenting digits

If that wasn’t enough, the novel architecture showed impressive capability to segment various overlapping digits from each other. This really blew my mind. Because capsules are learning a more abstract represnetation of what a ‘digit’ is, it means that there is less reliance on classification solely within the pixel domain. Traditional approaches will typically segment images pixel-by-pixel and assign pixels to a class. Capsules, however, are able to leverage dynamic routing to develop a kind of ‘parallel attention’ mechanism as described in the paper. What this means, is that we can assign a single pixel to one or more digits (so long as they are not the same digit — this is an assumed limitation of the network that helps to increase its efficiency and reduces the total number of parameters needed).

The results are impressive. Take a look below and see if you can do a better job at segmenting these digits than the network.

Digit segmentation on the MNIST dataset. In each cell, the top image shows the overlapped digits, where the original digits are denoted by the L(x, y) labels, and the decoded representations by the R(a, b) labels.

Generating these deconstructions was also fascinating. The team took the two longest vector activations from the final layer of the network and used the same decoder network trained previously, generating each digit and colouring them in red and green respectively.

Final thoughts

One of the most promising features of this novel approach is in the impressive generalisation capability afforded by dynamic routing, and along with it a subsequent drastic reduction in the number of parameters required to achieve similar performance to baseline CNNs.

I’m particularly excited about what this could mean for generation too. It would be exciting to explore how we could learn representations for concepts such as people, vehicles, or anything — and use this to generate new instances of that object by decoding the vector activations from the network with slight perturbations in each of the instantiation parameters — like training the network to recognise people, and be able to ‘tweak’ what they look like (changing hair colour, height, and so on) ala sims style!

If you thought this was at all interesting, you need to check out the original paper, which goes into far more detail and does a great job of explaining the dynamic routing, training process and architecture in detail. It was my hope to make things a little more accessible and give those of you in a rush a readable ‘run-down’ of this novel idea.

--

--

Aaron Baw
The Startup

Long-form tweets. Philosophical nuggets. Extended thoughts.