Machine Learning 101

In this blog post we’ll briefly cover the following topics to give you a very basic introduction to machine learning:

• What is machine learning?
• Training machine learning models.
• Optimising parameters.
• Neural networks.

Don’t worry if you’re not an expert — the only knowledge you need for this blog post is basic high school maths.

What is machine learning?

The Oxford Dictionary defines Machine Learning as:

“The capacity of a computer to learn from experience”

The goal of machine learning is to come up with algorithms that can learn how to perform a certain task based on example data.

Here’s an example. Let’s say we want to write a program to play the game Go. We could write this program by manually defining rules on how to play the game. We might, program some opening strategies and decision rules — that it’s better to capture a stone than not, for example.

But there’s a problem. Programming these rules manually means that they can quickly become quite complex, and are limited by the strategies we as programmers can come up with. A better solution is to build machine learning algorithms. Machine learning can learn how to play Go based on examples and experience, just like humans would. This is what DeepMind did with their AlphaGo program, a machine algorithm based on deep learning that turned out to be so good, it won against the (human) Go world champion.

Training machine learning models

Machine learning algorithms train models based on examples of labeled data. A machine learning algorithm typically defines a model with tunable parameters and an optimisation algorithm, as illustrated below. The model takes input in the form of data (x) and generates an output (y) based on the input data and its parameters. The optimisation algorithm tries to find the best combination of parameters so that given the example x the model’s output y is as close to the expected output as possible. The trained model will represent a specific function f that given x produces output y. So: y=f(x).

Optimisation

There are many ways to find the best combination of parameters so that the output y of model f is as close to the expected output as possible given input x. One way would be to try out all possible combinations of parameters and select the combination that gives the best results. This might work if there are only a limited number of parameter combinations, but for typical machine learning models that have thousands or even millions of parameters, it’s completely impractical. Luckily (and thanks to the invention of 17th-century mathematician Newton), there’s a much better way of finding the optimal solution for some types of models.

That invention of Newton is the derivative (also known more generally as gradient). The derivative of a function represents how the function changes with respect to one of its parameters, and points in the direction of the increase of the function. If we have a function f that has parameter p, then the change, df, of the function f with respect to the change, dp, of the parameter p is noted as df(p)/dp.

So how can this derivative be used to make the model’s optimisation more efficient? Assume that we have some data (x, t) so that input x corresponds to target t. This data is plotted as follows:

If we now want to create a model that best approximates target t for given input x for all given examples, then we can try to fit a straight line through the origin (this is also known as linear regression). This straight line can be represented by the function y=f(x) with f(x)=p⋅x where p is the only parameter of the model (note that p represents the slope of the line). This model can be represented visually as:

To find the parameter p so that y=x⋅p is as close to t for all given examples (x,t) we have to define a measure of “closeness” in a mathematical way. This measure is also known as a cost function. A typical cost function for this problem is to sum the squared values of all absolute differences between target t and model output y: |t-y|² for all examples (x,t). The final cost function becomes ∑|t - (x⋅p)|² where the sigma represents the sum. Because this example is quite simple, we can actually visualise this cost function easily for all parameters p: