Deep learning has been the sexiest term in machine learning. For business leaders accustomed to taking a “deep dive” on an issue, or imagining a marine explorer surveying the ocean’s depths, the term deep learning implies that it is the best, most advanced technique available.
Moreover, deep learning involves the creation of “neural networks.” Separate from its actual meaning, the term is a marketer’s dream, since it semantically associates the technology with the power and mystery of the human brain.
In reality, deep learning and neural networks are poor substitutes for the fullness of the human brain, and they need not be so mysterious. While variations of deep learning are the most advanced techniques for tackling several types of problems, deep learning has some limitations, and it’s not the best technical solution in every case (including for the aforementioned client). In this article, we’re pulling back the curtain. Once you grasp a few key deep learning concepts, you still may choose to embrace the term for marketing purposes. But at least you’ll have a clearer understanding of what it’s all about. We’ll start with some historical context.
A Brief History Lesson
First, a brief history. Neural networks are not new; the basic concept is over fifty years old, and another key component, back-propagation emerged in 1986. Today’s cutting-edge technologies like TensorFlow still use back-propagation today. The convolutional neural network was developed in 1989.
The late 1980s were, in a sense, an “AI spring.” Innovation was blooming, and hope was in the air. Neural networks began entering health systems and major corporations eager to take advantage. And they failed catastrophically. The amount of data, processing power, and business understanding were insufficient to allow for their success. Just as 1950s teenagers weren’t ready for Marty McFly’s edgy rock ’n’ roll, businesses weren’t ready for neural networks. But their kids, so to speak, were going to love it.
The “AI winter” thawed a bit as the technology improved and ventured back into the wild, but a resurgent cold streak took hold from 2005–2010. Neural networks were banished. If you said the word neural network at a machine learning conference in 2010, you’d be laughed out of the room. To be taken seriously, proponents of neural networks changed the name to deep learning.
Over the last decade, the winter has turned into a blazing-hot summer with no apparent end in sight. The scale of today’s data and the computing power of graphic processing units makes it possible to finally, really put deep learning in practice after all these years.
Classes of models have been recognized to be good for particular tasks. For example, a model called a convolutional neural network (CNN) is well suited to analyzing images. A recurrent neural network (RNN) is well suited to modeling or analyzing sequences like text or audio.
So — what is it, really? Let’s find out.
A Brief Algebra Refresher
At the simplest level, deep learning is based on math, so let’s build up the concepts we need by taking a trip back to algebra class for a moment. Remember this classic equation?
The idea is that you take some input x, plug it into the formula where m and b are actual numbers, and you get the output y. If you know what values of m and b, you can draw the function as a line on a graph and then, for any x you want, you can simply consult the graph to find the corresponding y.
If you don’t know the value of m and b, but you have a lot of examples of x, y coordinates, you can plot those points on the graph, figure out the line that best represents them all, and then consult that line to figure out new values of y for any new x values you want to check. You could, of course, represent this graph as y=mx+b, choosing the m and b coordinates that form the line you want.
Oh, and congratulations, by the way — you’ve just done some machine learning, albeit in its absolutely simplest form. Using linear regression, you’ve used sample data to find an algorithm (in this case, the ultra-simple y=mx+b) that can predict the value of new data. Not very deep, yet. But it’s a start.
A Regression Progression
Let’s give our linear regression example some specifics. Imagine you’d like an algorithm that will predict the lifespan of a woman looking to buy insurance, but you could only plug one attribute of that person into your formula. Age is probably a good choice here, so x represents the person’s age. The age will be multiplied by some number m, and then offset by some number b to make the math work out properly.
But of course, age shouldn’t be the only consideration. There are many other attributes of the woman that you might want to factor in. In machine learning, these attributes are called “features”, and in this case they may include weight, height, years spent smoking, hours spent exercising, etc. We can call these features x1, x2, etc, and the actual equation looks like this:
y=m1x1 + m2x2 + m3x3 + . . . + b
To tweak this only slightly further, let’s change the m from algebra into a w, because in machine learning, this element is called a weight. We’ll keep the term b intact, as here it stands for bias. This is not the kind of bias that gets companies in trouble, but merely an adjustment made to the weighted sum that’s an important part of linear regression.
y= w1x1 + w2x2 + w3x3 + w4x4 . . . + b
Features connected with larger weights are more important to the final result. For instance, if x1 is the years spent smoking, its corresponding weight should be much larger than, say, the weight for the number of languages spoken. A polyglot’s mental flexibility may help extend their lives a bit, but their smoking habits are probably a much bigger factor in how many more years they can be expected to live.
Algebra lovers may have now realized that our function is now in too many dimensions for a human to visualize. Remember, the xs in the function are all different features — if it’s helpful, you could imagine replacing x2 with y, x3 with z, etc. Once you get beyond z, you’ve moved beyond three-dimensional space. Conceptually, however, the function is still pretty simple. The machine learning process remains the same: If you have a big set of features and their corresponding outputs, you can train a machine to determine the weights that best fit that data. The way to do this is called gradient descent, although the details of how this works aren’t important for our purposes. Ultimately, you’ll have a weighted equation that you can use for future feature inputs that you’ve never seen before. You’re still doing a form of linear regression.
But what if your problem requires not an absolute number, like years of life left, but rather a probability like the chance a person will get sick within the next 30 days? Not a problem. We can take the same weighted function and run that through a special function called a sigmoid function. We use the symbol, “σ” to represent this function. Put it together and it looks like this:
y= σ (w1x1 + w2x2 + w3x3 + w4x4 . . . + b)
For our purposes, it’s not critical to understand how the sigmoid function works, and more important to understand that it will produce a number between 0 and 1. In our example, a result of 0.97 would indicate high probability that a patient would get sick within the next 30 days. Machine learning often deals with probabilities: predictive text generators, for instance, predict the probability of what the next word in a sentence will be (and then suggest the words with the highest probability). Algorithms to classify objects assess the probability that a given image shows a dog or a cat.
Running the sigmoid function turns the process from linear regression to logistic regression. The machine learning challenge here is the same: figure out what the weights of this equation should be, given a large enough sample set of known x and y values. Still not deep learning, but we’re almost there. And we’ve already picked up a few terms you can use to impress your colleagues.
Hiding in Plain Sight
As useful as logistic regression is, it has a fundamental flaw: with only a set of weights, it has just one way of describing the data. The real world is much messier than a simple textbook example; there may be features of the data we can’t see. Features of the data may interact with each other in important ways. To take a simple example, the ratio of years spent smoking to years spent in college may turn out to be important to predicting lifespan. The point is, our basic models of how the world works are probably missing some very important details — and we don’t even know what those details are.
It turns out that it is possible to account for these “hidden” or “latent” features mathematically. The key is to use logistic regression to calculate multiple sets of weights describing a data set, as opposed to just one. There are, it turns out, often different ways to skin the cat. If you think of a set of weights as a filter for what’s important in a data set, then calculating different weights is essentially applying more than one filter.
The output of those equations can then become a new set of features — features that, unlike the first set, were until now unknown. Those features can then be used to calculate another set of weights in our final equation that describes the original training data.
Let’s say we start with three features, and we want to calculate four new features using four sets of weights before calculating the final outcome. This figure shows how this looks graphically:
Let’s pause for a moment and point out a few things:
1) This illustration shows a neural network. And while that name may conjure up a sci-fi robot brain, you’ll notice that it’s made entirely of understandable math.
2) You may also notice that the math we’ve been doing isn’t particularly more complex than logistic regression. In fact, we’ve basically just applied logistic regression to . . . logistic regression. Machine learning requires a lot of math, but that math isn’t particularly hard.
3) This neural network only has one layer. But if we can do this for one layer . . . why not add another one? Let’s calculate multiple sets of weights for the features in Layer 1, use the outputs of those equations as the features for Layer 2, and then use those features to calculate the final set of weights? By adding layers, we make our machine learning model deeper, and we can account for more and more non-linear relationships. And now, finally, we’re ready to say it: we’re doing deep learning.
How do we choose the number of layers and nodes in the network above? That’s part of what data scientists must figure out, and it’s not always a straightforward process. It can be a matter of repeated experimentation: a scientist builds a neural network that fits the training data and then tests it on other data to guard against overfitting. The process is inherently uncertain, which helps explain why data scientists sometimes give imprecise answers about how long a process might take.
Deep Learning Lessons
While we’ve tried to keep this section as simple as possible, but we get it: not everybody loved algebra class, and for some reasons, seeing so many equations causes an automatic synaptic shutdown. Here are key business lessons to walk away with:
Deep Learning Is the Best Technique — Sometimes
For text and image data, deep learning techniques are currently the state of the art. But for tabular data, other, more traditional methods of machine learning can work just fine. In health data systems, for instance, basic logistic regression might be more than enough to solve a problem.
One way to understand this intuitively is to understand the concept of feature engineering. In traditional machine learning, data scientists need to actively select and test all the features they believe are important to a problem. For tabular data, represented by rows and columns, the features to choose from are built right into the data set. Feature engineering makes sense.
For text and image data, however, the relevant features, and the relationships between them, aren’t as obvious. How do you know what groups of pixels or words to engineer? This is why deep learning is a better approach: it doesn’t require as much feature engineering since the neural network figures out hidden features itself, layer by layer.
To bring William of Ockham’s famous Razor into the conversation, if you have two models that are equally good at describing the data, pick the simplest one.
Deeper Isn’t Always Better
A passing awareness of deep learning might lead you to believe that the more layers in a neural network, the better it will be. After all, if your understanding of a subject is deeper than mine, you are smarter about that subject.
A neural network that is too deep, however, is in danger of overfitting. Add in layer after layer, and you can eventually get a network that fits your training data perfectly, but that poorly fits any new data it receives. To use an apter human analogy, it’s like a person going so deep on a subject that they lose touch with the real world.
Deep Learning Needs A LOT of Data
If traditional machine learning is data-hungry, deep learning is data ravenous. Remember that all of the layers of a neural network have different parameters. The more parameters you have, the more data it takes to train it. However, by taking advantage of transfer learning, not all of that data needs to be specific to the problem.
Sorry, No Magic
There is nothing magical, or even particularly smart, about neural networks. Don’t let the word “neural” confuse you into thinking that they come anywhere close to the flexibility and power of the human brain. With proper data and training, a neural network can expertly perform a very limited range of tasks. But if you give a dog/cat detector a picture of a person, it won’t know what it’s looking at. In fact, you could train a network on a billion pictures of cats, but, if none of those pictures were upside down, the network would be flummoxed by the first reversed cat it sees.
When data scientists work on a deep learning project, they need time to make alterations attributes like nodes and layers, evaluate the model, and then make still more tweaks. They need time to try, fail, and try again. Ironically, it can be hard to predict how long a predictive model will take. The experimental nature of this process may surprise or frustrate executives used to leading traditional software development projects.
Robbie Allen is a Senior Advisor to Infinia ML, a team of data scientists, engineers, and business experts putting machine learning to work.