How can we make machines understand the world like humans do?
Syntropy is on a mission to exponentially increase the productivity of every human being on the planet. To achieve this, we believe we need to solve a fundamental problem in machine learning — how can we make machines comprehend the world like humans do? In this article we’ll dive into this mission in more detail, highlight the problems we need to overcome to realise it, and explain the principles and insights that are guiding our work.
Without providing a recap of our collective human history, it is safe to say our progress thus far has been incredible. From nomadic hunter-gatherer tribes to industrialised megacities, this progress has been driven by two complementary ideas; leveraging technology to increase productivity, and sharing our discoveries with others. The most recent step change in our collective output is thanks to the internet, which has dramatically sped up knowledge propagation between people.
Once likely driven by pure survival, it seems as though modern progress is now about maximising our time. More time for family and friends, more time to pursue our goals, dreams and passions. This was the original promise of automation, and if you ignore our arguably inept social and economic structures for dealing with it, many of us are indeed seeing net benefits to our freedoms in daily life. In reality however, this perceived increase is fairly futile when considering the image below.
Above is your entire waking life in weeks. This is all you get before you die. You may have seen an image like this before, and may, like others, find it somewhat depressing. And you should. But there is hope. You see, every year our average life span changes based on better standards of living and better medical technology. At present it takes more than a year’s worth of progress in these areas to extend our life expectancy by one year, but if this could be accelerated to the point where the same progress can be compacted into a year or less, then for all practical purposes our lifespans would be open ended. This idea is known as the longevity escape velocity. It may feel like a distant dream of science fiction, and arguably if our rate of progress went unchanged then we would agree. But we are notoriously known for mistakenly thinking our progress is linear.
Rather than addressing our mortality by focusing on specific medical problems, we believe that it is much more prudent to hack the acceleration of progress itself. We can do this by speeding up the rate at which we are able to automate processes. Humans have been getting better at this over time, but we are currently held by the bottleneck of software development. Developing software is still an extremely inefficient, costly process, that usually requires a large amount of time and expertise. Imagine how many useful programs don’t get written because of these disincentives. If we were able to communicate with, and teach computers the same way that we might teach another human, we would afford every person the power to automate any process as though they had their own team of software engineers. This would facilitate a step change in global productivity.
To the casual observer, it would appear that we are already making great headway on this problem. DeepMind’s AlphaGo has now beaten the world №1 Go player, self-driving cars are allowed to drive themselves in 9 US states, computers are getting “super-human” scores on object recognition tasks, VCs are pouring billions of dollars into AI-related startups, and “deep learning” is in the headlines every week. With all this happening, it’s easy to imagine that human-like AI is only a stone’s throw away, but the reality is that there is still a fundamental problem to overcome — we don’t yet know how to program computers to learn a useful representation of the world.
To get good at playing Go, DeepMind trained AlphaGo on a database of 30 million moves, before having it play against itself thousands of times to generate more data. To achieve human-level performance at object recognition, a system typically requires millions of labelled training examples. Tesla has well over 1 Billion miles worth of labelled training data for its self-driving cars, but they still can’t handle poor weather conditions, and struggle on roads without lane markings. Clearly, humans don’t require anywhere near this much labelled training data to learn. As a child, you need only be told a handful of times what a tree is before you can recognise other trees. We can’t do the same with computers.
If you trained a computer to classify 1000 different animals, then wanted to add worms to the list, you’d have to start again and re-learn all 1001 types of animals at once. We humans don’t need to forget everything we know to incorporate a new piece of knowledge. This problem is called catastrophic forgetting, and is so named because if you tried to learn only the new thing, it would be at the expense of the accuracy of the current knowledge. This forces almost all systems to remain static after training, making active learning impossible. We also have difficulty taking the skills learned in one domain and transferring or building upon them in a new domain. For example, it is difficult to take a system that was trained to classify cars, and leverage its existing knowledge to answer questions about objects that aren’t cars.
Another big problem with current deep learning systems is that they are essentially black boxes, inscrutable from the outside. While they might make very accurate predictions, it’s practically impossible to decipher why they made a particular prediction. This means that when they are wrong, we can’t see exactly why; and we must either supply more data, better data, or altered parameters to hopefully improve them. This problem precludes the use of these models in any application that requires decision explainability. It’s also been found that models can be quite easily fooled into making a wrong decision by changing an input in ways imperceptible to humans. These types of inputs, called adversarial images, are not only problematic in terms of security (a malicious actor could trick a system into making a specific decision), they also speak to the mysterious nature of the inner workings of these models.
These problems are all symptoms of the fundamental problem of representation learning, the process of forming a mental model of the world through experience. Humans can learn what a tree is with very little actual teaching, and we can learn what worms are without having to relearn about trees. We can do this because we learn without being explicitly taught. We learn by experiencing the world and building a mental model of all our experiences, so when our parent points to that tall, green thing and tells us it’s a tree, we’re assigning a link between our existing model of trees and our existing model of the word tree. Computers are currently taught in the reverse order — we show a picture of a tree, tell it it’s a tree, and ask it to rearrange it’s mental model so that it’s more likely to recognise the next tree. Repeat a few million times and we have a computer that knows what trees are — and worms, and cars, and cups of coffee — but in the computer’s universe there are as many types of objects as it’s been told about, and no more.
Computers don’t build mental models like ours because they don’t live in our world. When we show it that picture of a tree, we generally ask it to learn one thing — tree. If we want to know what type of leaves or bark a tree has then we can add more labels to our training data — like leaf, trunk, bark and branch type — but we can only ask questions about the things we’ve labelled, and we can’t feasibly label every factlet that can be gleaned from every image. Computers live in a world of labels, whilst humans live in a world of experiences.
Consider the following sentence.
The trophy would not fit in the brown suitcase because it was too big.
If I asked you what the bold it refers to, you’d say the trophy. It’s clear to us because we know big things don’t fit inside little things. If I replaced the word big with small then suddenly that bold it refers to the suitcase. This question is from the Winograd Schema Challenge, a set of similarly ambiguous questions that are easily answered by humans using common sense. The highest scoring artificial challenger to date scored 58% accuracy on the questions — little better than random choice, and well short of the 90% accuracy required to unlock the $25,000 prize. These questions are easy for us to answer because they relate to our mental model of the world, but they’re hard for computers to understand because their mental model is not so robust.
If we had a system that could build a good internal representation of the world like humans do, then we could show it just one or two labelled examples of trees, and have it understand what trees are. We could inspect the system and understand what it knows about trees that allow it to identify one. If the model had a richer interpretability we could also ask many different questions of the model. Not just “What is in this image?”, but all sorts of questions about the scene hierarchy, part relationships, similarities and, importantly; why particular choices were made, and how to fix problems. We could also continue to expose the system to new information, and have it perpetually improving upon its internal representation.
At Syntropy, we are developing new neural network architectures, inspired by machine learning and neuroscience, that allow a computer to learn a good internal representation of the world. While this goal isn’t entirely new, we distinguish ourselves by following some guiding principles and key insights that we believe will result in better solutions. These are explained below.
First, we’ve constrained ourselves to only using the same data available to humans. That means no (or very few) labels, especially early in training. This is called unsupervised learning. It’s clear that unsupervised learning is necessary to build a good world model, there’s just no way we could learn enough about the world using human-provided labels. Geoffrey Hinton confirms this with numbers in his 2014 AMA.
The brain has about 10¹⁴ synapses and we only live for about 10⁹ seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 10⁵ dimensions of constraint per second.
Humans also experience data not as discrete, unrelated inputs, but as a stream of sequential data points. When you look around the room, your eyes are taking in sequential frames of vision, and even when you lock your gaze, your eyes are doing microsaccades (small, involuntary movements) around your point of focus. All our sensory modalities deliver signals in this way — a constant stream of continuously transforming data. Sequential data provides us with weak labelling, as each successive frame is likely to contain the same things. It also provides information about how things can legally change over time. Other methods, like Slow Feature Analysis, and learning to generate future frames, utilise the same types of data. However we believe the associated models don’t capture the information in the best way.
Secondly, we enforce explicit structure inside our models, aiming to achieve easier inspection of the internal workings, and an intuitive mapping of the hierarchies of our visual world. Almost all models today are learned end to end without any structural constraints, at the expense of any understanding of the model’s internals. This means that not only can they not be understood in terms of process or function, but they have to be tested with inefficient, indirect methods. A model with an interpretable structure implies an understanding that allows empirical analysis similar to regular software development.
Finally, our overarching, but loosest principle, is that we are guided by neuroscience. The constraints above are analogous with what we currently know about the brain and it’s connection patterns, and our architectures are, for the most part, biologically plausible. While this isn’t a constraint we feel we need to follow by the book, we believe that there is value not only in building new useful models, but also in working towards an understanding of how the brain functions. To put it another way, we aren’t specifically trying to replicate the brain, but if there is already a machine that exhibits the behaviours we are trying to emulate, then it makes sense to refer to it for inspiration.
If we can build useful systems that adhere to the above principles, then any good solution we arrive at should exhibit solutions to all the problems listed earlier. This will put us one step closer to achieving our mission.
We’ll be releasing a series of explainers of our work over the coming months. If you think you’ll be interested in them then please subscribe, or follow us on Twitter. If you have feedback after reading this, please comment, or reach out via email (info at syntropy dot xyz) or Twitter. Finally, if you’re interested in our mission and ideas, please get in touch — we are always looking to expand our team. The first technical article can be found here: https://medium.com/syntropy-ai/unsupervised-learning-invariant-manifolds-fcd0c4d3e7ef