Introduction to Machine Learning Pt 1
written by Stephen Wilson
This is the first blog post in a series (brought to you by Scout24’s Data Science team) with the aim of providing some knowledge and understanding surrounding data science and machine learning. As data scientists at Scout24 we work closely with market segments across our real-estate and automotive platforms to build machine learning solutions such as personalised recommendations and price validation with the overall goal of providing relevant and useful content that inspire your best decisions (we are in the business of connecting people, cars and homes). In the course of our work, we encounter lots of people who are really interested in finding out more about machine learning and what advantages it can bring to business, but who don’t know where to start or who find some of the introductory material a little intimidating.
This initial blog series is for them and is intended to give a gentle introduction to some of the core principles and concepts involved (hopefully) without getting overly technical, but still with enough detail to get the intuition underlying them. Over the next few posts we’ll cover:
- what machine learning is
- supervised and unsupervised machine learning techniques
- worked out toy examples of some machine learning problems to give an intuitive understanding of what is involved
We’ll also provide some links to some further resources, for those who want more.
MACHINE LEARNING: WHAT IS IT?
A popular definition you will probably come across if you Google “what is machine learning” is that it is the “field of computer science that allows a system to learn from data without being explicitly programmed”. But what exactly does this mean? A system, in this context, is a computer or a network of computers (the “machines” in “machine learning”). When we speak of “learning from data” we usually mean recognising some pattern or structure in a data set. Imagine someone showed you a mixed collection of images from AutoScout24 and ImmobilienScout24 without telling you where they came from. It probably wouldn’t take you long to recognise a pattern and realise that the images fell into two broad categories: cars and homes.
Machine learning describes how we can get computers to uncover similar patterns in data without having to give explicit task-specific instructions (i.e. without having to program rules like “if the image shows an object with wheels, it is a car”).
Here’s another example: At ImmobilienScout24 we have many listings for apartments for sale. Each listing contains a price as well as a description of the apartment in terms of a number of features, e.g. the apartment size, its location, the number of rooms and so on. We might ask: Is it possible for a machine learning system to automatically predict the listing price of an apartment if we provide only the descriptive data about apartment size, location etc as input?
Yes, it turns out that this is possible. In order to make this happen we need to prepare a data set. Here is a small toy example showing two property listings which gives an idea of how a portion of such a data set might look.
Now suppose that we have some new listings from some new customers, but they don’t know how much they should list their property for on ImmobilienScout24. We might like to provide a service to them that can suggest listing price that is that is reasonably accurate with regard to the current market.
To do this we can provide the dataset to the machine learning system along with some general instructions to automatically learn the relationship between input features like size, location, number of rooms and the target output of the listing price.
The general instructions we provide will usually include:
- a starting point so the system knows how to begin
- which columns in the data set are to be used as the inputs
- which column is the target output
- a way for the system to check if it has made any errors
- a way for the system to correct any errors in such a way so that they become smaller and smaller with each iteration
- a way for the system to know when to stop
Usually when the size of the errors no longer changes from iteration to iteration we can say that the system has found the best possible fit for the data and the machine learning task has completed.
Because the instructions we provide to the system are broad and generic, we don’t need to explicitly program it with rules that are specific to the task we are trying to solve. This means that we can apply the same approach to a wide range of different problems. The generic nature of this approach is one of the most powerful aspects of machine learning.
TYPES OF MACHINE LEARNING PROBLEMS
The listing price prediction problem given above is an example of what is called “supervised learning”. Supervised learning is when the task is to learn a mapping from a set of input features to some output variable. In our example, the input is the set of features describing each apartment (size, rooms, location) and the output is the listing price. The data we want the system to learn from contains both the input and output and the learning task is guided (“supervised”) by this because the system can check predictions it makes against the “ground truth” listing prices we provide, make small adjustments and iteratively improve.
Another example of supervised learning might involve a task where we would like to learn the mapping from a set of input features to an output variable that can only take on a fixed number of values. At ImmobilienScout24 we often want to be able to look at how someone interacts with the platform and whether they belong to a specific user group or not. For instance, we might like to be able to automatically predict the probability that someone on our platform is a homeowner or not, so that we can provide a more personalised service and suggest products and services to them that they might find useful.
Again, we first need to prepare a dataset. It will take the same basic form as the one in the listing price example, above, with rows and columns. For this task however, each row would correspond to a single user on ImmobilienScout24. The columns would represent the input could be the behaviour of a user on the platform over several sessions (what pages they visit, what they click on, whether they request additional information) and the output could take on two fixed values: “homeowner” or “not a homeowner”.
If the output of a supervised task is a continuous variable (a variable that can take on any value) such as the listing price, the technique we apply to learn the mapping between input and output is called regression. If the output can only take on fixed, discrete values (homeowner/non-homeowner, 1/0) then the technique we apply is called classification.
In contrast to supervised learning, there is also “unsupervised learning”, which describes machine learning tasks where we do not have labelled output. The task is therefore, not to learn a mapping between a set of inputs and outputs, but rather to uncover structure and relationships in the data instead. A common application of unsupervised learning is to automatically divide a dataset into groups, based on how similar the data are to each other. We might like, for example, to identify groups of users based on what type of AutoScout24 listings they view. Again, we need to prepare a dataset. Each row in the dataset corresponds to the viewed listing history for a single user. However, unlike the supervised learning examples above this time we don’t know beforehand to which group a consumer belongs, so we can’t provide a final column containing a label. Instead, we can apply one of several techniques collectively called “clustering” which allow us to measure the similarity between the rows in the dataset and group the most similar ones together into subgroups or “clusters”.
Ok, now that we have outlined some of the broad types of machine learning problems, let’s look at some more detailed examples so that you can get the intuition behind each of them. We’ll return to the listing price problem above and describe what regression involves.
A REGRESSION EXAMPLE
Recall that as we are trying to predict a continuous value (the price) this is an example of regression. In fact, this specific type of problem is usually called a linear regression, so-called because we assume that there is a linear relationship between the inputs and output. To keep things simple, in this example we’ll just consider a single input (the size of the apartment in square metres) and use this to try to predict the listing price.
The table below shows a constructed dataset that we can use as an example to illustrate some of the principles involved.
When we plot the apartment size along the x-axis and the listing price on the y-axis we can see that there does appear to be some sort of relationship in the data. As the size of the apartment increases, the listing price also seems to increase. If there were a perfect linear relationship between the input and output, then you would be able to draw a straight line through all the data points. That is clearly not the case in this example. The goal of linear regression is to find the best fitting line through the data, such that all data points are as close as possible to it.
Now for a quick refresher. You might remember from school that we can write the equation for a line as: y =b+mx
Knowing the equation of a line means that we can calculate the value of y if we are given x (and vice versa). If the equation for the line is y=3+2x then we can work out that when x is 2, y is 7.
y = 3 + 2x
= 3 + 2 * 2
= 3 + 4
The principle behind linear regression is the same. In a regression problem we assume that the output can be approximated by a linear combination of the inputs plus some constant. We further assume that there will be some irreducible error that we need to take into account as well, so we add that to our equation:
listing price = (some constant value) + (apartment size * some weight) + error
or a little more formally as:
Y= β₀ + β₁* X + ϵ
where Y denotes our target variable, X our input variable and β₀ and β₁ represent the model weights or coefficients.
The goal in our linear regression problem is to find values β₀ and β₁ such that when we apply them to Xi (the size of a single apartment) we can predict a value for Yi (the listing price) that is as close as possible to the actual listing price in our training data. The role of the error term and what it can mean for your model is fascinating, but beyond the scope of an introductory blogpost. For now, you can understand that our model is an approximation of the real world and not a perfect. The error term represents what we don’t know about the relationship between our input and output.
So how do we find the values for β₀ and β₁? Recall that we outlined the basic principles of supervised learning already. We need:
- a starting point for the system so that it can begin to make predictions,
- a way to measure the error between the predictions when compared against the “ground truth” values,
- a mechanism to correct those errors so that the system improves with each iteration.
- a way to know when to stop,
STARTING POINT FOR THE REGRESSION MODEL
As a starting point, we simply assign some random values to β₀ and β₁ or set them to zero and plug in our values and compute the corresponding predicted values (side note: in machine learning the true values in the training set are usually denoted as *y* and the predicted values as ŷ, usually called “y hat”).
MEASURING THE ERROR
We can measure the size of the error between an individual prediction and the actual listing price by simply subtracting one from the other, to see how far off we are. However, we need a way to estimate how well the predictions are doing across the entire data set and not simply on an individual case-by-case basis.
To get an overall picture of how good the predictions are across the entire training set, we need a single metric that can measure this. One of the most common metrics used in linear regression to do this is called the root mean squared error which is a long name for a simple concept. And in fact, the long name is a good mnemonic for the steps to calculate the metric. Which is handy. If we take the final term in the name and work backwards, you can see clearly how to calculate this metric.
- Error — We have already touched on this. To find the error between an individual prediction and its actual price, we simply subtract one from the other: (y — ŷ )
- Squared — We square the error (multiply it by itself) to get rid of any negative sign that may be present after calculating the error. We do this because we don’t care if the prediction is above or below . We just want to know how “far away “ it is. Squaring the error will always give us a positive value
- Mean — We add up all the squared errors in the training set and take the average, because we are trying to estimate how well the model’s predictions are doing across the entire data set and taking the average squared error gives us one way to do this. This operation can be represented using the notation
- Root — Because we squared the error in step 2 to get rid of any negative sign, we take the square root as the final step which gives us our final metric
I am one of the Data Scientists in Residence and work mainly for Scout24’s real estate platform ImmobilienScout24. I have a PhD in Computer Science and a background in computational linguistics and speech processing. The Data Science Team is hiring! If you have a passion for machine learning and data science (and like to inspire that same passion in others) then come and join our team. Open positions can be found here.