By Sergey Nikolenko, Chief Research Officer at Neuromation
Sergey’s a researcher in the field of machine learning (deep learning, Bayesian methods, natural language processing and more) and analysis of algorithms (network algorithms, competitive analysis). He has authored more than 120 research papers, several books, courses “Machine learning”, “Deep learning”, and others. Extensive experience with industrial projects (Neuromation, SolidOpinion, Surfingbird, Deloitte Analytics Institute).
This article unpacks how certain parts of the brain can learn to perform tasks they weren’t originally designed to do.
Neuroplasticity is another part of this issue. Scientists conducted experiments demonstrating how different areas of the brain can easily learn to do things for which they’re seemingly not designed. Neurons are the same everywhere, but there are areas in the brain responsible for different things. There’s the Broca area responsible for speech, an area responsible for vision (actually, a lot of areas — vision is very important for humans), and so forth. Nevertheless, we can break down these notional biological borders.
This man is learning to see with his tongue. He attaches electrodes to his tongue, puts a camera on his forehead, and the camera streams an image on the electrodes pricking his tongue. People stick that thing on them and walk around with it for a few days, with their eyes open, naturally. The part of the brain that receives signals from the tongue starts to figure out what’s going on — this feels a lot like something that comes from my eyes. If you abuse somebody like that for a week and then blindfold him, he’ll actually be able to see with his tongue! He is now able recognize simple forms and doesn’t bump into walls.
The man in this photo has turned into a bat. He’s walking around blindfolded, using an ultrasonic scope whose signals reach his tactile neurons through his skin. With a sonar like this, a human being can develop echolocation abilities within a few days of training. We do not have a special organ that can discern ultrasound, so you have to attach a scope to your body. However, we can relatively easily learn to process this information, meaning that we can walk in the dark and not bump into any walls.
All of this shows that the brain can adapt to a very large number of different data sources. Hence, the brain probably has a “common algorithm” that can extract meaning from whatever it takes in. This common algorithm is the Holy Grail of modern artificial intelligence (a recent popular book on machine learning by Pedro Domingos was called The Master Algorithm). It appears as though deep learning is the closest we have come to the master algorithm of all the things done in the field up until now.
Naturally, one has to be cautious when making claims about whether all of this is like what the brain does. “Could a neuroscientist understand a microprocessor?”, a recent noteworthy article, tries to elucidate how effective current approaches in neurobiology are at analyzing a very simple “brain”, like a basic Apple I processor or Space Invaders on Atari. We will return to this game soon enough and won’t go into much detail about the results but we do recommend reading the paper. Spoiler alert: modern neurobiology couldn’t figure out a single thing about Space Invaders.
Unstructured information (texts, pictures, music) is processed in the following way: there is raw input, then features that bear content take shape, and then classifiers are built based on those features. The most complicated part of this process is understanding how to pick good features out of unstructured input. Up until recently, systems for processing unstructured information have worked as follows: people have attempted to select good features manually and then assess the quality of relatively simple regressors and classifiers based on these features.
Take Mel Frequency Cepstral Coefficients (MFCC), which had been commonly used as features in speech recognition systems, for example. In 2000, the European Telecommunications Standards Institute defined a standardized MFCC algorithm to be used in mobile phones; all of these algorithms were laid out by hand. Up until a certain point, manually-extracted features dominated machine learning. For instance, SIFT (Scale Invariant Feature Transform), which enables one to detect and describe local features in images based on Gabor filters and the like, was commonly used in computer vision.
Overall, people have come up with many approaches to feature extraction but still cannot duplicate the brain’s incredible success. Moreover, the brain has no biological predetermination, meaning that there are no neurons genetically created only for producing speech, remembering people’s faces, etc. It looks like any area of the brain can learn to do anything. Regardless of the brain’s activity, naturally, we would like to learn to select features automatically to create complex AI and large models containing neurons linked to one another for transmitting signals containing all sorts of different information. Most likely, humans lack the resources necessary to develop the best possible features for images or speech manually.
Artificial neural networks
When Frank Rosenblatt introduced his perceptron, everyone started imagining that machines would become truly smart any day now. His network learned to recognize letters on photographs, which was very cool for the late 1950s. Very soon after, neural networks made up of many perceptrons were developed; they could learn with backpropagation (a method used to calculate the gradient descent called the backward propagation of errors). Basically, backpropagation is a method used to calculate gradients or error functions.
The idea of automatic differentiation was floating around back in the 1960s even, but Geoffrey Hinton, a British-Canadian computer scientist who has been one of the leading researchers on deep learning, rediscovered backpropagation and expanded its scope. Incidentally, George Boole, one of the founders of mathematical logic, was Hinton’s great-great-grandfather.
Multi-layer neural networks were developed in the second half of the 1970s. There weren’t any technical barriers in place at that time. All you had to do was take a network with one layer of neurons, then add a hidden layer of neurons, and then another. That got you a deep network, and, formally speaking, backpropagation works in exactly the same way on it. Later on, researchers started using these networks for speech and image recognition systems. Then recurrent neural networks (RNN), time delay neural networks (TDNN), and others followed; however, by the end of the 1980s it became evident that there were several significant problems with neural network learning.
First off, let us touch upon a technical problem. A neural network needs good hardware to learn to act intelligently. In the late eighties and early nineties, research on speech recognition using neural networks looked something like this: tweak a hyperparameter, let the network train for a week, look at the outcome, tweak the hyperparameters, wait another week, rinse, repeat. Of course, these were very romantic times, but since tuning the parameters in neural networks is nearly as important as the architecture itself, too much time or too powerful hardware was needed to achieve a good outcome for each specific task.
As for the core problem, backpropagation does work formally, but not always in practice. For a long time, researchers weren’t able to efficiently train neural networks with more than two hidden layers due to the vanishing gradients problem: when you compute a gradient with backpropagation, it may decrease exponentially as it progresses from the output to input neurons. The opposite problem — exploding gradients — would crop up in recurrent networks; if one starts to unravel a recurrent network, the gradient may spin out of control and start growing exponentially.
Eventually, these problems led to the “second winter” of neural networks, which lasted through the 1990s and early 2000s. As John Denker, a neural networks researcher, wrote in 1994, “neural networks are the second best way of doing just about anything” (the second half of this quote isn’t as well-known:.”…and genetic algorithms are the third”). Nonetheless, a true revolution in machine learning occurred ten years ago. In the mid-2000s, Geoffrey Hinton and his research group discovered a method of training deep neural networks. Initially, they did this for deep belief networks based on Boltzmann machines, and then they extended this approach to traditional neural networks.
What was Hinton’s idea? We have a deep network that we want to train. As we know, layers close to the network’s output can learn well using backpropagation. How can we train what’s close to the input, though? At first, we will train the first layer by unsupervised learning. After that, the first layer will already be extracting some features, looking for what the input data points have in common. After doing that, we pre-train the second layer, using results of the first one as inputs, and then the third. Eventually, once we’ve trained all the layers, we’ll use the system as a first approximation and then fine-tune the resulting deep network to our specific task by using backpropagation. This is an excellent approach… and, of course, it was first introduced back in the seventies and eighties. However, much like regular backpropagation, it worked poorly. Yann LeCun’s team achieved great success in the early 1990s in computer vision with autoencoders, but, generally speaking, their method didn’t work better than solutions based on manually-designed features. In short, Hinton can take credit for making this approach work for deep neural networks (and it would be too long and complicated to explain exactly what he did).
However, researchers had sufficient computational capabilities to apply this method by the end of the 2000s. The main technological revolution occurred when Ruslan Salakhutdinov (also advised by Hinton) managed to shift the training of deep networks to GPUs. One can view this training as a large number of relatively independent and relatively undemanding computational processes, which is perfect for the highly parallel GPU architectures, so everything started working much faster. By now, you simply have to use GPUs to train deep learning models efficiently, and for GPU manufacturers like NVIDIA deep learning has become a primary application that carries the same weight as modern games. Take a look at CEO NVIDIA’s pitch here.