Machine Learning 101

In this blog post we’ll briefly cover the following topics to give you a very basic introduction to machine learning:

  • What is machine learning?
  • Training machine learning models.
  • Optimising parameters.
  • Neural networks.

Don’t worry if you’re not an expert — the only knowledge you need for this blog post is basic high school maths.

What is machine learning?

The Oxford Dictionary defines Machine Learning as:

“The capacity of a computer to learn from experience”

The goal of machine learning is to come up with algorithms that can learn how to perform a certain task based on example data.

Here’s an example. Let’s say we want to write a program to play the game Go. We could write this program by manually defining rules on how to play the game. We might, program some opening strategies and decision rules — that it’s better to capture a stone than not, for example.

But there’s a problem. Programming these rules manually means that they can quickly become quite complex, and are limited by the strategies we as programmers can come up with. A better solution is to build machine learning algorithms. Machine learning can learn how to play Go based on examples and experience, just like humans would. This is what DeepMind did with their AlphaGo program, a machine algorithm based on deep learning that turned out to be so good, it won against the (human) Go world champion.

Training machine learning models

Machine learning algorithms train models based on examples of labeled data. A machine learning algorithm typically defines a model with tunable parameters and an optimisation algorithm, as illustrated below. The model takes input in the form of data (x) and generates an output (y) based on the input data and its parameters. The optimisation algorithm tries to find the best combination of parameters so that given the example x the model’s output y is as close to the expected output as possible. The trained model will represent a specific function f that given x produces output y. So: y=f(x).

Pipeline of training a machine learning model.

Optimisation

There are many ways to find the best combination of parameters so that the output y of model f is as close to the expected output as possible given input x. One way would be to try out all possible combinations of parameters and select the combination that gives the best results. This might work if there are only a limited number of parameter combinations, but for typical machine learning models that have thousands or even millions of parameters, it’s completely impractical. Luckily (and thanks to the invention of 17th-century mathematician Newton), there’s a much better way of finding the optimal solution for some types of models.

Newton and Leibniz — https://xkcd.com/626/

That invention of Newton is the derivative (also known more generally as gradient). The derivative of a function represents how the function changes with respect to one of its parameters, and points in the direction of the increase of the function. If we have a function f that has parameter p, then the change, df, of the function f with respect to the change, dp, of the parameter p is noted as df(p)/dp.

Derivative (gradient) df(p)/dp of f(p) = p⋅sin(p^2) for different values of p.

So how can this derivative be used to make the model’s optimisation more efficient? Assume that we have some data (x, t) so that input x corresponds to target t. This data is plotted as follows:

Labelled data (x,t)

If we now want to create a model that best approximates target t for given input x for all given examples, then we can try to fit a straight line through the origin (this is also known as linear regression). This straight line can be represented by the function y=f(x) with f(x)=p⋅x where p is the only parameter of the model (note that p represents the slope of the line). This model can be represented visually as:

Representation of our model y=f(x)

To find the parameter p so that y=x⋅p is as close to t for all given examples (x,t) we have to define a measure of “closeness” in a mathematical way. This measure is also known as a cost function. A typical cost function for this problem is to sum the squared values of all absolute differences between target t and model output y: |t-y|² for all examples (x,t). The final cost function becomes ∑|t - (x⋅p)|² where the sigma represents the sum. Because this example is quite simple, we can actually visualise this cost function easily for all parameters p:

Cost function for our example.

To find the best parameter p we need to minimise the cost function. Remember that our model has a parameter p, takes input x and produces output y. We can write this model as y=x⋅p. Since the cost is ∑|t-y|² we can substitute y, and also write the cost function as ∑|t - (x⋅p)|². If we want to minimise this function and make the outputs y as close to the targets t as possible, we can try out all possible values of p for each input sample (x,t) and select the value of p where the sum of the cost over all input samples is the lowest. Trying out all possible values of p in this case would be possible, but would soon become unfeasible the more parameters the model has. This is where the derivative comes into play. With the derivative, we can simply select a random starting parameter value for p, and start following the derivative in the opposite direction to find the lowest point on the cost function. This process of descending down while following the derivative (gradient) is also known as gradient descent. The process is illustrated below, where we start at p=0.3 and follow the gradient for 12 steps while improving the fit of the model to the data (line fitted on right figure). We stop fitting the model when the cost doesn’t decrease much anymore, so the final parameter p found is 1.94 with cost 0.451. Note that the final line fits the data (x,t) much better than the initial line.

Gradient descent optimisation.

Neural networks

This, in essence, is what happens if we train a neural network model. However, more typical neural network models are made up of much more complicated functions than our y=x⋅p model. There is a large variety of neural network models, but typically they are all differentiable and can be optimised with gradient descent as we’ve illustrated in this blog post.

A typical neural network used in computer vision, for example, will consist of multiple layers. Each layer will have hundreds or thousands of parameters and will be followed by a nonlinear function. Having multiple layers in a neural network is where the term “Deep Learning” comes from. The benefit of using multiple layers in the model is that each layer can use the information extracted in the previous layer to build up a more complex representation of the data. It’s because of this that neural networks have been shown to be so powerful, successfully trained to recognise cats in videos, recognise speech, and even play Atari video games.

If you’d like to play around with small neural network examples, try Google’s Tensorflow Playground, or if you’re more technically minded and like to learn more, you can try to implement your own models with the help of my tutorial on how to implement neural networks.