Linear Regression in Python

Lucas Moncada
16 min readDec 14, 2021

Breaking Down the “Hello World” of Machine Learning

Linear regressions are often one of the first things an aspiring data scientist/machine learning engineer learns at the beginning of their journey. Thus, I thought I would write an article to an imaginary version of myself learning about linear regression for the first time. First I break down the basics of what linear regressions are, and how they might be used practically. After this, I show a more low-level implementation in code of a linear regression model to facilitate diving into the specifics. Here we go!

What is a Linear Regression?

A linear regression is a model that attempts to learn the relationship between one or more features (aka inputs) and a numerical continuous target variable. Furthermore, the goal of a linear regression model is to predict this continuous target variable when given some input. Let’s look at a straightforward example, using an employee's years of experience to predict their salary.

Linear Regression Intuition

Forget about linear regressions for a second and consider how an individual’s relevant experience might be related to salary in the workplace. A good guess would be to predict that as experience increases, salary increases as well. Maybe it looks something like this.

This is interesting, but the obvious next question is by how much? Well, that is actually a big part of the point behind linear regression, we want to find out by how much salary increases as experiences. Let us call this “how much” number W. The mathematically inclined will likely be led to imagine some equation alike below.

This is an interesting idea. So as a crude example, if one’s salary increased by $10,000 per year of experience you obtained as a Data Scientist, you might imagine an equation like this.

Moreover, you could expect a relationship like the below graph.

Cool, that looks good, right?… Right? Well, if you more closely at the x-axis you might realize something peculiar, this is only for experienced data scientists. What does this same relationship look like for the new grad data scientists?

Now the problem is visible. Our current formula does not account for an employee’s base salary! This is certainly not the relationship a reasonable employer could use. So how might we account for that? Again, those of you who have taken high-school-level functions may have a kneejerk reaction: “add a bias.” What does this mean? Well quite simply, we can add a fixed amount of “Salary” before calculating the Experience multiplied by W. We will call this fixed/base amount of salary B. Mathematically, we can write this as

Now as an example of this formula in action, let’s assume a base salary as $80,000 and change W from +$10,000/year to +$5,000/year of experience. Hence, the mathematical representation is

This results in the following graph.

Great! Now that we have defined this heuristic, we now have to figure out what a linear regression is right? On the contrary, this intuition that we have ended upon is known as a Simple Linear Regression. We can generalize our formula by replacing Experience by X, resembling some form of input, and substituting in y instead of Salary.

Great, now take a second to infer how one might write this equation if we were using just a TV Screen’s Area to try and predict the price of that TV.

We could write that formula something like:

Using a Linear Regression Model

Now that we have an intuition on what a linear regression model is, how can we use it. Here I will be showing how to use linear regression to predict the salary when given an employee's experience. I will be using a more extensive dataset this time to better represent a more applicable example. First I will show the implementation using Python’s excellent machine learning library, Scikit-learn (aka sklearn); then later in this article, I will dive into what is actually going on under the hood.

Our main steps to using this linear regression model, while employing some general machine learning techniques, will be:

  1. Get the data and split it into a training set and test set
  2. Initialize our Linear Regression Model
  3. Fit our Linear Regression Model on the data
  4. Use our Linear Regression Model to predict unseen data

1. Getting the Data and Splitting it

First, we should get our data and take a quick look at it.

# Import the relevant libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Salary_Data.csv') # get the datasetYearsExperience = dataset['YearsExperience']
Salary = dataset['Salary']

Visualizing the data in a scatter plot we get a similar-looking graph as above, except instead of a linear regression line, we are just showing the data points.

Now let us split our data into a training set and test set. If you are unfamiliar with this idea, do not worry, it will make more sense later. For now, just realize that we are saving some portion of the experience and salary data pairs in a test set, to use in a later step. We will be working with the training set in the next couple of steps.

from sklearn.model_selection import train_test_splitExperience_train, Experience_test, Salary_train, Salary_test = train_test_split(YearsExperience, Salary, test_size=0.2,
random_state = 123) # Taking 20% of the data into the test set

Now we have a training set comprised of Experience_train and Salary_train. After preparing our data, we can get to the linear regression modelling.

2. Initializing a Linear Regression Model

What does it mean to initialize a model? Well, this step consists of telling Python (and sklearn) what specifics our model will entail when we create it. Simple linear regression models do not have many specifics to alter, but one example is not including the bias variable that served as our base salary in the prior example. Here, we will create a default linear regression model; however, I implore you to check out the sklearn documentation for their linear regression implementation¹.

Here is the Python code for initializing a linear regression.

from sklearn.linear_model import LinearRegressionlinear_regression = LinearRegression()

That’s it for initializing the model. Very easy with sklearn, right?!

3. Fitting the Linear Regression on the Data

This is a critical step of a linear regression model and of all supervised machine learning models (data with labels). Let’s quickly recall our formula for linear regression.

Here we will allow the model to learn from the data, aka the experience and salary pairs, in order to find the best B and W for this data. What does that mean? Basically, we will try a bunch of combinations of setting the bias variable B, and the slope variable W, to find the best fit for our data. Okay, but what does it mean to have the best fit on our data. Again, we figure this out by just thinking logically.

Consider the following graphs, and try to articulate why you think one of the linear regression lines better fits the data.

The green line indicates the linear regression
The green line indicates the linear regression

Did you get an answer? Try to consider which line better represents the data. Clearly, it is the first image, but why? One might say because the line is closer to the data… and you would be right. The formal idea is actually called ordinary least squares, and the main concept is simply looking at the differences between the line and the actual data points. We will look into this a bit further later in the article, but for now, let us see how to implement this in sklearn.

linear_regression.fit(Experience_train.reshape(-1, 1), Salary_train)

Wow just one line, that’s awesome! Thanks, David Cournapeau (founder of sklearn). Also, note that we used reshape(-1, 1) on the experience values, this is because the fit() method expects a 2-D array, and the above way is just a simple strategy to do convert a 1-D array to 2-D. What does our linear regression look like now? See below for the fit.

Looks great, better than the above two fits! Now we have a model that understands our data, what next? Finding out if our model performs well on unseen data.

4. Predicting New Data with our Linear Regression

It is very important in machine learning to measure if one’s model is predicting well on not just the data one is using to train a model, but also, how the model performs on new, unseen data. Now, remember that bit of data we saved in step 1 and called the test data? Well in this step we use this data that was excluded from the data used to train the model and see how well our linear regression performs on this data. Thus, let us write the code to make predictions on the test set.

Salary_predictions = linear_regression.predict(Experience_test.reshape(-1, 1))

Once again, very simple. Note the reshape() function being used once again to transform the 1-D array into the required 2-D array format. When we visualize how this model performs on the new data we see the below graph.

This looks pretty good, the line seems to cut through the data points fairly well. This means our model is generalizing well on unseen data!

In summary, we got our data and split it into a training and test set according to typical ML practices. Next, we initialized a default linear regression model using sklearn. Then we used our data to train the linear regression to understand the relationship between experience and salary in our dataset. Finally, we used the model to predict unseen data to check if the linear regression generalizes well.

Linear Regressions — Under the Hood

As a software engineer, I always better-understood machine learning concepts when seeing them implemented from scratch in code. This is an idea Jeremy Howard utilizes in his fantastic course and book, Deep Learning for Coders². Hence, we will attempt this approach here with linear regressions.

Formally defining the efficacy of the model fit

Earlier in this article, we discussed an over-simplified idea of when a linear regression properly fit the data; we eye-balled how close the linear regression line came to the observed data points. This, however, is obviously not formal enough to instruct Python. Thus, we will look at one very common way to define how well the model fits the data called Ordinary Least Squares.

The ordinary least squares method of measuring how effective a fit is is when one compares a model’s individual predictions to the actual label of the data. For instance, if my model predicted that an individual with 5 years of experience made $80,000, but they actually made $75,000, my model would be overpredicting this observation by $5,000. This is similar to the ordinary least squares formulation which is the sum of all the errors of the model squared.

By taking the square of the differences we make sure that over predictions and under predictions do not cancel each other out. This also penalizes large errors much more than smaller errors (i.e. 0.⁵²=0.25 and ⁴²=16). Defining this function in Python can look like this.

def OLS_loss(predictions, observations):
return np.sum((predictions-observations)**2)
# Example Input
preds = np.array([1, 2, 3])
obs = np.array([2, 4, 6])
OLS_loss(preds, obs) #-> 14

Notably, it is useful to know that practitioners often refer to measuring the efficacy of a model’s fit as a loss function.

How the Model Learns

When a machine learning engineer or data scientist speaks of a model learning, they often mean parameters being adjusted to better represent the training data. Moreover, when remembering our equation defined previously, the two parameters that the model should properly set are the B, the intercept, and the W, the coefficient of the X variable.

How are these parameters changed when given training data? One very common idea in machine learning is to use something called gradient descent.

Nonetheless, let us forget about this lingo for a moment and work from first principles. In the last subsection, we described how we can formally define how effective a model’s fit is on our training data. That is great, and as we can see from our definition, we want to minimize this function as much as possible to have the least errors. One might suggest trying many different combinations of B and W and then taking the parameters with the lowest OLS score. Great suggestion! Unfortunately considering both of the parameters are continuous values this strategy is clearly dubious in this simple example and intractable with any scale of parameters for instance.

Maybe we could try a random combination and then try to improve the parameters from there. Let us set B and W both equal to 1 and see what our model fit looks like.

As one can see, this linear regression line is not near the actual data points and is thus not a great model. Our OLS returns 42333966023 which is indeed a very large number relative to our salary magnitudes. Obviously, we can see that we need to increase the B and W, but how can we explain this to Python. Considering we want to modify some parameters given some changing value (our loss function) we can look to calculus, the study of continuous change. Specifically, we would like to know how the OLS loss function changes as we manipulate B and W; which can be denoted with the following complex appearing equation.

If this is beyond your mathematical understanding, feel free to just understand that the top and bottom matrix elements are functions for calculating how the OLS loss changes as B and W change, respectively. Great! So now we can use these two functions to modify B and W.

Simple Example for Illustration

A good visual image to have of this process is a ball rolling down a hill to get to the lowest possible point. The lowest possible point is where the OLS loss is the minimum possible, and we get there by using the above gradient equations to lead us in making the right changes to our parameters.

However, instead of drawing out this curve of OLS seen in the above image, we use a much more “greedy” approach to save on computations expenses. We simply consider the current slope of the gradient function and move in the direction that minimizes this. Thus, the following graph is more indicative of how our model learns.

Consider now that the purple slope is what we use to modify the example parameter. Specifically, the larger the slope, the more we adjust the parameter. When we continue this learning process a few times it might look something like this.

Here we can see that the OLS loss gets closer and closer to our goal state of the lowest possible OLS loss. Also notice that the purple lines are decreasing in slope as the parameter changes are decreasing. If you haven’t realized already, we have actually arrived upon the gradient descent algorithm I alluded to through intuitive steps. Now let’s apply this to our experience and salary example and see what the code looks like.

Gradient Descent Learning Implementation

At this point, we are nearly ready to implement the code for how a linear regression model can “learn” from data, known commonly in the field as fitting a model on data. The one thing we left out is something called a learning rate, which is basically how large of updates to the parameters a model will make based on the gradient. Typically, this is just a multiplicative factor denoted as alpha (α) as seen below.

Where does this learning rate value come from? This is an excellent question and to my knowledge, the answer to this is still developing; however, currently, practitioners typically try to find the right learning rate iteratively, using some reasonable starting point. There are even some algorithms out there to find the right learning rate for more complex models³. Nonetheless, let’s see what the code looks like for an individual fit.

def single_fit(X, y, B, W, lr):
y_predictions = B + X * W # Predictions for the gradients
grad_b = -2*np.sum(y_predictions - y) # Gradient wrt B
grad_w = -2*np.sum((y_predictions - Salary_train)*Experience_train) # Gradient wrt W
B += grad_b*lr # Updating B param
W += grad_w*lr # Updating W param
return B, W

This should make sense given what we have discussed before. If you are still confused try specifying what does not make sense and feel free to leave a comment/question. Let us see how our linear regression model changes after we fit the model a few times using a learning rate of 0.00025.

We can see that the model is getting closer and closer to the observations and thus are learning from the data. Given all goes well, your model should converge on some parameters after fitting the model on the data several times. Thus, you might end up with a model similar to the following image.

As can be seen here by the very similar 20th fit and 50th fit, this model converges after several training fits. The converged state seems to fairly effectively fit the data, meaning we are done right? Close, but not quite yet. We should see how our model generalizes on new data

Making Predictions on New Data

Computing the predictions is actually very simple once you have your test data set up (as we did previously). We just have to make a forward pass on our test data using our linear regression equation. Below is the relevant code snippet.

def predict(X, B, W)
return B + X*W

If this is a familiar equation then you are paying great attention, we have been using this equation throughout already!

Now that we have our predictions let us see if we can get some intuitive idea of how well our model is performing on new data. Of course, we can use still visualize our model since we are using a simple example, so let’s do that first.

This looks reasonable! But what about if the data is in more than one dimension and we cannot visualize it? Well, one approach of understanding our model’s performance is using the mean absolute error (MAE). Explicitly, this equation returns the average error that our model makes on the data. The formal definition is the following intuitive equation.

Where n is the number of observations

The code for this formula is straightforward.

def MAE(y_predictions, y_true):
return np.sum(abs(y_predictions - y_true))/len(y_true)

Awesome, when we pass in our model’s predictions we get a mean absolute error of ~4500. This simply means that on average our model predictions of an individual’s salary are off by ~4500. Not terrible, but you could argue that this isn't that amazing either.

Summarizing what we have

This has been hopefully a helpful look under the hood of the linear regression model so that you can better understand the ideas behind it. To summarize what we have learned, let’s organize our components into a linear regression class.

class LinearRegressionOLS():

# Setting the relevant model characteristics
def __init__(self, B=10000, W=5000, lr=0.00025, n_iters=20):
self.B = B
self.W = W
self.lr = lr
self.n_iters = n_iters

# Training loop for the model to learn on the data
def fit(self, X, y):
self.costs = []

for i in range(self.n_iters):
y_predictions = self.B + X * self.W # Predictions for the gradients
grad_b = -2*np.sum(y_predictions - y) # Gradient wrt B
grad_w = -2*np.sum((y_predictions - y)*X) # Gradient wrt W
self.B += grad_b*lr # Updating B param
self.W += grad_w*lr # Updating W param

return self

# Making predictions on data using the current model
def predict(self, X):
return self.B + X*self.W

# Return the Mean Absolute Error
def MAE(self, y_predictions, y_true):
return np.sum(abs(y_predictions - y_true))/len(y_true)

To use this linear regression class we can implement very similar code as in the previous section of this article with sklearn’s linear regression.

linear_regression = LinearRegressionOLS()
linear_regression = linear_regression.fit(Experience_train, Salary_train)
predictions = linear_regression.predict(Experience_test)
print('Mean Absolute Error:', linear_regression.MAE(predictions, Salary_test))

Now you know the fundamentals of a linear regression model! The grasp of this fundamental algorithm is useful to understand more complex models such as Decision Trees and Neural Networks.

Thanks so much for reading! I hope this article was of some use to you, please feel free to leave any comments, questions, and feedback! If you are a beginner at machine learning and are wondering what some next steps are, I would recommend looking at multiple linear regressions and logistic regressions. Furthermore, one book I loved as a very solid introduction to machine learning was Python Machine Learning by Sebastian Raschka, Vahid Mirjalili.

¹ https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

² https://www.amazon.ca/Deep-Learning-Coders-fastai-PyTorch/dp/1492045527/ref=asc_df_1492045527/?tag=googleshopc0c-20&linkCode=df0&hvadid=335305582969&hvpos=&hvnetw=g&hvrand=15187852249388123725&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9001075&hvtargid=pla-917301026067&psc=1

³ https://fastai1.fast.ai/callbacks.lr_finder.html#Learning-Rate-Finder

--

--