Machine learning is a lot like teenage sex
To play off of Dan Ariely, machine learning is a lot like teenage sex. Everybody talks about it. Only some really know how to do it. Everyone thinks everyone else is doing it. So, everyone claims they’re doing it.
What does it mean to train an algorithm? What steps does a neural network actually take to being able to actually predict something? Machine learning often leave us with more questions than answers.
Although advances in computing power, data management, and cloud computing have made machine learning easy, understanding it is still hard. There is little accessible and understandable material that covers the mechanics behind machine learning. Here, we pop the hood on machine learning and explain its engine in simple terms.
In a jungle of terms such as supervised vs. unsupervised or regression vs. classification, it’s hard to cut through the jargon to figure out how everything works. However, beneath these complex layers exist a few central ideas to machine learning that can propel your understanding from big picture to a deeper, more mechanical, understanding.
Machine learning is concerned with finding a rule (think of a function) that predicts output y from input x, using parameters, also called weights, w. When we “train” an algorithm, we take our x, y and w to define a cost function to minimize. Typically, we define the cost function to be some variation of the difference between the predicted values and the actual values. This function defines how “wrong” we are with our prediction — which explains why it is called the cost function. Errors are costly! Because our x and y are known before training, we try to find which w finds us the lowest point in our cost function. Finding this minimum of locates our optimal w that produces predictions as close as possible to the true answer.
A Basic Machine Learning Example
Let’s take a simple example of linear regression, taught in most basic statistics classes. Think scatterplots. When we only have one predictor, the function we are trying to fit is of the form y = intercept + slope*x. We have data, x, and the value we are trying to predict, y. Our cost function in this case is a function which represents the squared differences between predicted points and their true values. Conceptually, we want to fit a line in a cloud of points where we minimize the sum of the squared differences. This line is ultimately defined by w, which represents the slope and intercept of our line. Essentially, we find the set of intercept and slope that provide us with the lowest error.
This example, a 2-dimensional linear regression, is among one of the simplest machine learning cases. In reality, other machine learning concepts like regularization or neural networks create considerably more complex scenarios. Many times this complicates finding a solution that satisfies the cost function. These problems create cost functions which can’t be minimized by hand — we need to use an algorithm to approximate the minimum of the cost function. One of the most common algorithms to do this is called stochastic gradient descent (SGD). SGD follows a path to find the minimum of a function, in our case, our cost function.
To start, we define an initial set of parameters. They can be anything. For linear regression, say the slope and intercept both start equal to zero. So, in this case, we say that the initial equation of the line is 0 + 0*Study Time. We then sum the squared differences to come to an estimate of cost. From this point, we then find the gradient. That’s the more mathematical way to say derivative. We travel towards the negative value of the gradient — it means we are going “downhill”. Downhill means we are approaching a minimum. We repeat this process until we hit what appears to be a minimum value. The stochastic part in stochastic gradient descent helps ensure that we don’t get caught at a “local” minimum, i.e., a valley that isn’t the actual total minimum of the function. The algorithm does so by only using a random (hence, stochastic) portion of data, at each iteration, to calculate the gradient.
What about neural networks?
These concepts may be great and handy, but what do they have to do with much sought after neural networks (or even better, a deep neural networks)? Understanding the already half-century old concept of neural networks simply requires the exact concepts we learned about above. Strides in computing and data collection have now made neural networks feasible, and companies are applying them in scenarios such as loan disbursement or scheduling meetings.
Many articles will paint picturesque descriptions of neural networks as mathematical representations of neurons in our brains. Do you know how neurons work? Most people don’t know. This comparison isn’t very useful to understanding neural networks, so let’s go back to our machine learning building blocks. At the basic level, neural networks consist of three sets of layers: an input layer, a hidden layer(s), and an output layer.
Let’s say we want to predict how long it takes for someone to get to work. We decide to use two variables to figure this out, the distance away from work the person is and the number of cars on the road when the person leaves. We would structure our network as shown below. The fact that we chose three nodes in one hidden layer comes from setting hyperparameters — these are numbers that define the structure of our network, such as the number of hidden layers or nodes in our hidden layers, before we train it. Training a neural network doesn’t find these numbers, and finding the correct set of hyperparameters is often hard to find. Additionally, the more complex of a network we have, the more complex functions it can model.
First, our data comes in through our input layer. Then, some weight is applied to it, which is then pushed through an activation function. An activation function simply takes input defines the output along those inputs. Setting all input values to 1 is an activation function. Multiplying every input by 2 is an activation function. There are, of course, much more intense activation functions.
A commonly used activation function is the sigmoid activation function, which is the “S”-shaped function in our single hidden layer. We apply the sigmoid function, in this case, to all of the data that comes through each node. Once we have calculated everything, we then apply another set of weights to calculate our predictions. Thus, our output (our predictions) are decided by our data and our weights.
When we train our network, we are actually minimizing our cost function. We are saying that given our data, there is some set of weights which minimizes our errors. The layers, combined with their weights and activation functions, defines some rule. Armed with stochastic gradient descent and a defined cost, we are able to march down the hills to settle in the valley (the minimum) of our cost function. This determines our optimal set of weights. Knowing these weights, we can then make predictions on unseen, “test”, data to see how our model performs in the wild.
Most machine learning algorithms work with the same basic building blocks — there is some mathematical “rule” and cost defined. The cost is optimized to find the best weights, or parameters, to predict with. With endless data and easy to access computing power, the future of machine learning has never been easier.
Chances are, if you reached this far, you avoided teenage sex anyways.