What is linear regression and how to use it with Python Scikit
Linear regression is one of teacher-based learning algorithms. This means, we run our model in 2 iterations:
- We train our model on a dataset with known answers (and test it to estimate model efficiency).
- Then we can use trained model to predict values for a dataset with unknown answers.
How linear regression works
The idea of liner regression is pretty simple. Let’s image the following dataset (simple set of x and y values):
Now what if we tried to draw a line that will have minimum distance to each of our points. We’ll end up having something like that:
This red line is a best fit line for our set of points. This is basically a calculated function (e.g. y=2x+5
). The process of finding this function is called linear (because we get a line as a result) regression (because we simplify the set of points to a function).
Now, if we try to calculate our y
values again for our known points, we get values on a best fit line:
Yes, we will see some error rate (depends on the quality of our model). But instead of having a limited set of points we now have continuous line, that allows us to predict y
values for new x
coordinates. And that is exactly what we need — predict new values:
Using linear regression in Python
Prepare dataset
Let’s generate 100 points based on a known function with random deviations to simulate “randomness”:
Create and train model
Smart people usually take 75% of the train dataset to actually train model and leave 25% to estimate it’s quality. Let’s create and train model on first 75 points of our dataset:
Predict
In order to predict values we can now use predict()
method of our trained model:
In order to estimate our model quality let’s plot our initial data points together with predicted and actual values:
What we will see is:
Green points are our 75 train dataset points. Blue points — are 25 points valid values used for testing our model. And red points — are predicted values for 25 test points. As we can see predicted values are aligned towards a single line — best fit line which our trained model found.
Summary
Linear regression is one of the basic teacher-based machine learning algorithms. It can easily be used with Python Scikit module:
Read the simple explanation of what is a neural network.