Unsupervised learning of a useful hierarchy of visual concepts — Part 2

Published in

Syntropy

8 min readAug 18, 2017

--

This is the first of a series of articles explaining the work we’re doing at Syntropy, and tracking our progress as we make ground through some of the unsolved (or unsatisfactorily solved) problems in machine learning. These articles are split into technical (for Machine Learning professionals) and non-technical (for a more general audience).

This article is a technical follow-on to Unsupervised learning of a useful hierarchy of visual concepts — Part 1.

Most attempts at unsupervised learning for vision involve what is called self supervision. Given any particular input, rather than trying to learn a relationship between the input and a label, instead try and learn a model that is able to reconstruct, or redraw that original input. The intuition here is that if the system is able to reconstruct the input well, then it might have learned a useful representation of the data internally. The keyword here is “might” however, as just being able to reconstruct data says nothing specific about the representation itself. This idea was popularised by the autoencoder architecture, which has since seen a large set of variants developed with the sole focus of learning different kinds of internal representations. Despite their popularity, and a brief period where they were used to pretrain supervised networks, they remain largely unused due to their inability to actually learn good enough representations (the variational autoencoder is promising, but will be discussed in later articles). The reason for this is what’s called the information loss problem.

The popular idea is that neural networks tend to lose information when going from input to output. So if they throw away the wrong information (information you might need to perform some task), then the network is problematic and needs to be improved. Some argue that supervised signals like labels are the only away to achieve this (due to joint learning), however our brains appear to solve this problem in an unsupervised way. We believe that light can be shed on this problem by comparing the differences between vanilla and sparse autoencoders. Vanilla autoencoders compress the data space by creating a bottleneck in the model that forces it to be good at reconstructing inputs while using less resources. Sparse autoencoders on the other hand, try and learn features that activate more rarely, so that only a few activations are found for each input. They are worse at performing reconstructions, yet interestingly are much better at doing tasks like classification. But if the autoencoder is better at reconstructions, then arguably they retain more information than their sparse counterpart. How do you make sense of this weird contrast?

MNIST Vanilla autoencoder features (left), Sparse autoencoder features (right)

If you look at a single sparse autoencoder feature, by definition it is activated for a much smaller number of inputs. Inputs that have some particular thing in common. We like to say that these features have a high usefulness, because they represent some building block in the input space that is common enough to warrant its existence. A building block that helps perform good reconstructions, but is also good at doing classification, as it only activates for inputs that have what it represents in common. Interestingly the more building blocks you allow a sparse autoencoder to learn, the better it becomes. So why don’t either of these networks do well when compared to supervised models?

If you imagine the complexity of the real world, the number of building blocks that could be useful would be staggeringly large. It is for this reason that a model would need to compress its data as well as possible. As illustrated in the crude diagram above, supervised deep learning achieves this combination of compression and usefulness, while the two different types of autoencoders do not. Vanilla autoencoders achieve decent information retention, but poor usefulness, while the sparse autoencoders are good at finding useful features, but cannot scale well. As you add more compression to a sparse autoencoder, the usefulness quickly degrades. This is because as you reduce the sparsity, the model has no guidance for the best way to compress the data (blue line). It just compresses in a way that satisfies the objective of building good reconstructions.

Compression in this light becomes a question of grouping. If you took any group of sparse autoencoder features, they can be compressed to use less resources, but at the expense of being as useful, as they throw away separability information. This is where prior invariant knowledge becomes important. If you knew which sparse autoencoder features represented the same thing semantically, then these would become good candidates for compressing into a single group. As mentioned in earlier writings, humans don’t leverage static data, but rather data that changes over time. This data is powerful as it contains information about how things we consider to be semantically the same appear from this moment to the next. We believe the brain uses this extra information for guidance as to how to compress these building block features in a way that keeps them maximally useful, while making them scalable through appropriate compression.

By learning the type of invariant manifold structure described earlier, you can achieve the desired mix of compression and usefulness. The individually learned manifolds behave like a sparse dictionary, in that they provide good usefulness. Then by also learning a heavily compressed representation of each individual manifold, a scalable compression can be achieved, avoiding unwanted information loss. The usefulness does not drop because the compression is over groups that contain semantically similar or invariant features.

There has been lots of work over the last two decades that revolve around these ideas, but have most likely come from different intuitions. The earliest was the Adaptive Subspace Self Organising Map. This was inspired by neuroscience, utilised episodic data, and attempted to build a topographic map like the one seen in V1 of the visual cortex. It was only able to select one manifold per input, and utilised a simple type of sparse compression. Following on the topographic path were [3,4], which overcame the limitation of only selecting a single manifold per input. These worked by trying to maximise the sparsity of feature neighbourhoods, which resulted in a similar emerging topography. These methods also used a simple linear compression, but did not utilise episodic data, which limits the combined manifold features to being close in euclidean space.

Building upon the idea of lasso regularisation (or basis pursuit denoising), [6,7,8,9,10,13] utilise a more complex sparse group lasso regularisation. While the structure of these models is very similar to the ideas described earlier, they are less about trying to directly learn manifolds or invariance, and more about maximising selection sparsity. The primary difference to the topographic methods above is that these explicitly group separate manifolds, rather than being constrained to learn an overlapping map. The shortcomings of these methods is that they don’t use episodic data, they require batching (so must operate offline), and are much further away from biological plausibility.

Our approach to group-based compression is an amalgamation of two existing algorithms. Firstly, to do the sparse manifold selection we use orthogonal matching pursuit. In this algorithm, manifolds compete for selection, with the ‘winner’ being the one that best represents some aspect of the input. The reconstruction generated by the winning manifold is then subtracted away from the input, and the rest of the manifolds again compete for the remaining input. The process repeats until some threshold is reached. Orthogonal matching pursuit is simple, works online, and is a potentially good analogue for how the cortex does manifold selection. By competing to phase lock certain aspects of the input, a cortical area could conceivably both activate for, and effectively remove, that aspect of the input; rendering it invisible to the other manifolds competing to activate for the remaining input. This would also allow for potentially selecting the same manifold multiple times, by utilising different temporal frames; a potential solution to the duplication problem from [12].

Secondly, to do the compression in the intra-manifold features we use an online implementation of non negative matrix factorisation. Rather than trying to learn a reconstruction, this learns sparsity through a nonlinear competitive dynamic process. By leveraging both feedforward hebbian and locally connected antihebbian learning, neurons are penalised when coactivating, promoting an optimum level of sparsity. This is interesting because the usual sparse parameter does not need to be provided, and the architecture has a very close analogue to the input L4 of a cortical column [15].

By using episodic data, we are able to lock the manifold selections across multiple frames. This provides the compression guidance as explained in the information loss section. This also has a biological analogue in the mexican hat connectivity patterns of L2/3 in the cortical column. [16] shows that this type of connectivity pattern causes bubbles or blobbing to occur after sufficient input activity, and is self sustaining even after the input patterns have changed, keeping the lower layers active, able to learn patterns they may not have otherwise selected for.

While this architecture shows interesting results, it is only one part of a larger architecture. We describe the next part in How do humans recognise objects from different angles? An explanation of one-shot learning.

If you’re interested in following along as we expand on these ideas then please subscribe, or follow us on Twitter. If you have feedback after reading this, please comment, or reach out via email (info at syntropy dot xyz) or Twitter. Finally, if you’re interested in our work, please get in touch — we are always looking to expand our team.

Unsupervised learning of a useful hierarchy of visual concepts — Part 2

Written by Matt Way