How do humans recognise objects from different angles? An explanation of one-shot learning.

This is the second of a series of articles explaining the work we’re doing at Syntropy, and tracking our progress as we make ground through some of the unsolved (or unsatisfactorily solved) problems in machine learning. These articles are split into technical (for Machine Learning professionals) and non-technical (for a more general audience). This article is non-technical, and will have a technical follow-on article.

In our last article, we explained that the visual world is made up of a hierarchy of parts. Bikes are made up of handlebars, wheels, pedals, etc; wheels are made up of tyres, spokes, hub, etc; and at the lowest levels, everything is colours, edges, shapes and textures. At each layer of this hierarchy, our brains are invariant to some degree of change. At the lower levels of the visual hierarchy, invariance allows you to recognise a rectangle or line even when it is skewed, rotated or scaled; and at the higher levels it allows you to recognise people and objects regardless of viewing angle, lighting conditions, or context. If this isn’t conceptually clear, just go back and read the first four paragraphs.

The astute reader may have picked up on a problem we left unaddressed in the last article. If a concept is invariant to change then we can be left with the following problem.

Three arrangements of the same two parts.

In the above image, each of the three shapes are arrangements of the same two invariant concepts. The first two we can recognise as both being the capital letter T, but the third is clearly not — even though it contains the same parts. What this tells us is that it’s not just the presence of the parts that define an object, but also the relationships between them. The second T still looks like a T because the two parts still join in the same place relative to each-other, and have been rotated to the same degree. The third doesn’t look like a T because the parts now have different relationships — they rotated in opposite directions, and join in a different relative place.

This tells us a few things about how our brain works. First, even though we have a tolerance for variance, we can still see that there has been a change. Second, we can describe what this change was (rotation), which means we have deconstructed change as a concept into a set of independent dimensions (rotation, translation, colour, brightness, etc). Last, the dimensions we use to describe changes are common between parts, and we can relate them together. To prove this, try imagining the following image, but change the colour to red, and rotate it 90 degrees.

Imagine the black parts are red, and the whole thing is rotated 90 degrees.

You’ve probably never seen this exact combination of lines and shapes before, but you can still quite easily imagine it rotated, and in a different colour. This implies that we employ a common set of dimensions of change to both recognise and imagine how objects look from different angles, without having to see them from every angle first.

In 2011, Geoffrey Hinton, Alex Krishevsky and Sida Wang released a paper called ‘Transforming Autoencoders’, which laid out a theory commonly referred to as ‘Capsules Theory’. This paper proved that given a common set of dimensions that describe how each visual concept can transform, a network could accurately predict and classify unseen variations of an input, having only seen the original input once (or just a few times). This ability to accurately classify objects after seeing them only once is called ‘one-shot learning’, and is something that humans can do naturally, but has proven difficult to replicate well thus far in machines. The architecture described in the capsules paper achieved one-shot learning, but prior knowledge of the transformational changes was required to train the system. For this reason it would be very difficult to scale the system to real-world vision applications — we simply don’t have access to that required training data.

So what needs to be done to create a more scalable capsules architecture? Let’s start by going over what a ‘capsule’ is, and what it does. Here is the description of a capsule from the original paper.

Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting and deformation of the visual entity relative to an implicitly defined canonical version of that entity.

That’s a pretty dense sentence, but it essentially means that each capsule represents a visual concept that is invariant to some degree of changes like lighting, viewing angle, etc. If that part sounds familiar it’s because that is exactly what our previous article demonstrated. What a capsule does that our previous demo does not is provide “instantiation parameters” along the dimensions of change. In other words, not only can it identify that an object is present, but also its precise position, angle of rotation, size, etc.

The capsules architecture relies on prior knowledge of transformational changes during training, but humans don’t have these changes labelled for us when we’re learning to see. We are able to deconstruct our visual world into a common set of dimensions of change like position, lighting conditions and rotation simply by observing. Similar to our proposition in the last article, we propose that humans do this dimension separation by leveraging episodic, or sequential data.

In our previous demo, we showed that episodes can be used to group all the variations of a visual concept into manifold detectors. In our architecture, each manifold can detect a visual concept in all of its variations, but it doesn’t give us back any information about the current variation. It can tell us “there is a square in this image” but not “the square is rotated to 10 degrees, is relatively large, and is located near the bottom left corner”. The following demonstration shows how we might conceivably build upon our manifold detectors to allow them to provide this information using episodic data. In other words, how we can turn our manifold detector architecture into a more scalable version of capsules.

Below is an example of a manifold representing a heart in different positions and rotations. Each point in the 3d visualisation represents a particular version of the heart. Initially the system is disordered, so moving the sliders around does not produce anything useful, but after some training (press play) will automatically order itself, discovering the underlying latent dimensions without any prior knowledge of what the dimensions mean. Once organised, the sliders will each represent a single dimension of change.

Mobile readers will get a better experience viewing directly on CodePen

The key to this unsupervised unfolding is in episodic data. Each training sample contains two versions of the heart that are next to each other in transformational space. That is, each axis of transformation is either unchanged, or only changed to one degree. This constraint on the training data is not considered prior knowledge, it simply represents how objects actually change in the real world.

We dive deeper into how this works, related work and historical inspiration in the technical follow-on article: Dimensional Reduction via Sequential Data.

If you’re interested in following along as we expand on these ideas then please subscribe, or follow us on Twitter. If you’re interested in our work or have feedback after reading this, please comment, or reach out via email (info at syntropy dot xyz) or Twitter.