What is linear regression and how to use it with Python Scikit

Denys Golotiuk

Published in

DataDenys

3 min readJun 23, 2022

Linear regression is one of teacher-based learning algorithms. This means, we run our model in 2 iterations:

We train our model on a dataset with known answers (and test it to estimate model efficiency).
Then we can use trained model to predict values for a dataset with unknown answers.

How linear regression works

The idea of liner regression is pretty simple. Let’s image the following dataset (simple set of x and y values):

Now what if we tried to draw a line that will have minimum distance to each of our points. We’ll end up having something like that:

This red line is a best fit line for our set of points. This is basically a calculated function (e.g. y=2x+5). The process of finding this function is called linear (because we get a line as a result) regression (because we simplify the set of points to a function).

Now, if we try to calculate our y values again for our known points, we get values on a best fit line:

Yes, we will see some error rate (depends on the quality of our model). But instead of having a limited set of points we now have continuous line, that allows us to predict y values for new x coordinates. And that is exactly what we need — predict new values:

Using linear regression in Python

Prepare dataset

Let’s generate 100 points based on a known function with random deviations to simulate “randomness”:

Generate random 100-points dataframe

Create and train model

Smart people usually take 75% of the train dataset to actually train model and leave 25% to estimate it’s quality. Let’s create and train model on first 75 points of our dataset:

Train model on first 75 points of our dataset

Predict

In order to predict values we can now use predict() method of our trained model:

Predicting values using linear regression model

In order to estimate our model quality let’s plot our initial data points together with predicted and actual values:

What we will see is:

Green points are our 75 train dataset points. Blue points — are 25 points valid values used for testing our model. And red points — are predicted values for 25 test points. As we can see predicted values are aligned towards a single line — best fit line which our trained model found.

Summary

Linear regression is one of the basic teacher-based machine learning algorithms. It can easily be used with Python Scikit module:

Read the simple explanation of what is a neural network.