Unsupervised learning of a useful hierarchy of visual concepts — Part 1

This is the first of a series of articles explaining the work we’re doing at Syntropy, and tracking our progress as we make ground through some of the unsolved (or unsatisfactorily solved) problems in machine learning. These articles are split into technical (for Machine Learning professionals) and non-technical (for a more general audience). This is article is non-technical, and will have a technical follow-on article.

This article originally formed the first half of Unsupervised learning of a useful hierarchy of visual concepts — Part 2. The non-technical portion of the original article has been extracted, leaving the technical portion as a follow-on.

Look at the picture above. It’s a picture of a bike — that’s obvious — but it’s a bike you’ve never seen before, so how do you know it’s a bike? I expect you’ll list the properties of bikes that we can see here — two wheels, pedals, handlebars, seat etc. That begs the question though — how do you know those are wheels if you’ve never seen those exact wheels before? The answer here — tyres, spokes, hub — reveals that the properties that make up a bike are themselves made up of other properties. Our visual world is composed of a hierarchy of parts. At the top of the hierarchy there are abstract concepts like bike, car, dog; and if you follow it all the way down you’ll find that everything is made up of basic shapes, lines, colours and textures.

So is that the answer? We see a bike because it’s made up of the properties that define a bike? Actually it’s only one part — there’s another question that will throw a spanner in the works.

Two pedals. The same, but different.

The two pictures above are of the same pedal. Again, it seems obvious, but if you followed the hierarchy down to the bottom you’d notice that actually, the shapes that make up the first image are different to the shapes that make up the second. So how do we know they’re the same when, seemingly, they are made up of different parts?

The answer is that at each level of the visual hierarchy, your brain has learned to tolerate some amount of variance in the parts that compose a particular concept. This is called invariance. At the lower levels of the visual hierarchy, invariance allows you to recognise a rectangle or line even when it is skewed, rotated or scaled; and at the higher levels it allows you to recognise people and objects regardless of viewing angle, lighting conditions, or context.


If we want to build a computer vision system that can see like humans do, then it must have a hierarchy of visual concepts, where each is invariant to some degree of change in the parts that compose it.

Machine learning algorithms take varying approaches to finding invariance in visual data. Deep learning, currently in vogue, takes what you might call a ‘brute force’ approach — show the system thousands or millions of images, label each one (bike, car, dog), and it will eventually discover correlations between similarly labelled images. This works astonishingly well when applied to narrowly defined classification problems, but the invariance that emerges from this process is very much inferior to a human’s hierarchy of invariance.

First, while the system itself does learn invariances throughout the hierarchy, they’re not organised in any useful way, but scattered around indiscriminately within each layer. This makes them uninterpretable at any level other than the top (where we assigned the labels), and is the reason we refer to deep learning as a ‘black box’. If we could look inside and see what the system was learning then we could do far more with it than just recognise bikes and dogs. We could ask further questions about the dog (tail length, fur colour, etc), identify why the system did or didn’t correctly recognise the dog, and easily rectify any weaknesses in the system — all things humans can do naturally.

Secondly, to learn anything at all, deep learning vision systems require thousands of labelled images for each thing you want them to recognise. Compare this with humans, who can learn to recognise objects and people before we can even talk — well before we could be said to be given any labelled data. Learning without labels is called unsupervised learning, while learning from labels is called supervised learning. Much of the learning that humans do is unsupervised, while deep learning is supervised.

Fully-supervised learning is problematic because it can only find invariance by recognising statistical similarities between pictures that had the same label. These are not necessarily true invariances. For example, of 1,000 pictures labeled ‘dog’, 95% might contain whiskers, which is good, but 80% might also contain grass. Does grass define a dog? Of course not! But if the data is skewed in some way like this then the discovered invariances might not map well to the real world.

The skewed invariances problem is often tolerable for deep learning applications because the training data is usually similar to the test data. It becomes a problem though when we need a more general system that we can build upon to perform new tasks. If the hierarchy of invariance was robust, like a human’s, then it would also be reusable. This is the reason humans can learn to recognise something after seeing it only a few times — we are building upon a good mental model of the visual world. Deep learning vision systems do not have a robust mental model, leaving them useful only for performing the exact task they were trained to perform.

Our goal at Syntropy is to help computers understand the world the same way that humans do. To achieve this, we need an algorithm that can learn the same hierarchy of invariant concepts that humans build internally, without relying on labelled data. This can be phrased in machine learning terms as unsupervised learning of a hierarchy of invariant parts. The rest of this article details some initial stages of our approach towards this goal.


In a typical artificial neural network, each neuron is a feature detector. It is looking for some specific thing in the input which, if found, will cause it to activate. In the first few layers of a deep learning network, the neurons typically work together to detect edges, lines and basic shapes. One of these layers might contain dozens of combinations of neurons that are looking for vertical edges in varying positions, lengths and degrees of rotation; each different to the next, but similar enough that we can still consider them all to be vertical edges. These combinations of neurons could together be said to cover the whole manifold of possible vertical edge variations. The problem is, there’s nothing explicitly tying the combinations together. The neurons are not organised in any meaningful way, but distributed randomly throughout the layer, interspersed with other neurons that are detecting totally different things.

Our objective is to organise these feature detectors into an explicit structure, grouping them together in such a way that each group could be considered a manifold detector. That is, a group of neurons that can activate in different ways to detect all the variations of a particular visual concept. A hierarchy of manifold detectors would directly mirror the visual hierarchy of the real world, be easy to inspect and explain, and form a useful schema that can be built upon to rapidly learn new things. A successful implementation should identify the same object in various positions by recognising that even as the position changes, the set of activated manifold detectors remains constant. This will be our yardstick for measuring progress against this objective.


We trained our network on the Omniglot data set, a set of 1623 different handwritten characters from 50 different alphabets. Omniglot comes with images, labels and stroke data, but for our experiments we have used only the images.

A sample of characters from the Omniglot data set

We feed our network a single character at a time, transforming the input in subtle ways across a few input frames to simulate video input, or the way a human might see something move. Using this data, our system is able to construct a map of invariant manifold detectors, each of which represents a ‘part’ in a variety of poses and positions. Each manifold detector attempts to reconstruct the part that it identified, and the reconstructions generated by the activated manifolds can be combined to reconstruct the full input. This is demonstrated below, colours are used only as a visual aid.

Left column: Input. Middle column: Manifold reconstructions. Right column: Combined reconstruction.

The following diagram shows a history of the last 15 reconstructions that some of the manifold detectors had generated. Note that in general, each is reconstructing a range of variations of a particular type of part. This demonstrates that the manifold detector has an invariance to some degree of pose and positional change.

Last 15 reconstructions performed by each manifold.

Finally, as stated earlier, a successful implementation should identify an object in various positions by recognising that even as the position changes, the set of activated manifold detectors remains the same. The following diagram illustrates that our system is capable of this ability.

Left: Transforming input. Right: Reconstruction.

We go into more depth on the inspirations, related work, and technical implementation details in the follow-on article: Unsupervised Learning of a Useful Hierarchy of Visual Concepts — Part 2. The follow-on article is more technical and is aimed at machine learning professionals. The general reader can continue on to the next non-technical article: How do humans recognise objects from different angles? An explanation of one-shot learning.

If you’re interested in following along as we expand on these ideas then please subscribe, or follow us on Twitter. If you have feedback after reading this, please comment, or reach out via email (info at syntropy dot xyz) or Twitter. Finally, if you’re interested in our work, please get in touch — we are always looking to expand our team.