Linear Regression from Scratch: Statistical Approach

Let’s do this without optimizer. Just a bunch of statistical formulas.

Published in

Becoming Human: Artificial Intelligence Magazine

7 min readSep 14, 2020

Hello world! Welcome back to my post! Today I wanna share another project that I just done: a linear regression. But wait, this one is different! Usually I use either Scikit-Learn or Keras module to do machine learning stuff, but now, I decided to solve this regression problem using statistical approach. That’s essentially why we do not need optimization algorithm for this task — which makes this project relatively simple compared to other machine learning tasks.

Note: I share the full code used in this project at the end of this article.

Students exam score dataset

Before we jump into the code, I want you to know that the dataset used in this project is taken from here. The dataset itself is very small that it only contains 25 samples which we are going to divide into train/test set. Let’s first start to load the modules and these data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as pltdf = pd.read_csv('student_scores.csv')
df.head()

As you can see the code above, I use Pandas to load the csv file and Matplotlib to display the regression line in our dataset. The computation itself is going to be done using Numpy. Anyway, after we run the code above, the following output should appear:

Basically, the table beside tells us that we got 2 columns in our dataset, namely Hours (which tells how long a student spend their time for studying) and Scores (which stores the score they gained on an exam). Our objective here is to predict the exam score of a student based on their studying duration. Therefore, to make things look more straightforward, I wanna take the values of Hours column to be the x data while the Scores are going to be y data. I will also take the first 15 samples as the training set, while the rest are used for testing purpose.

x_train = df['Hours'].values[:15]
y_train = df['Scores'].values[:15]x_test = df['Hours'].values[15:]
y_test = df['Scores'].values[15:]

Now as we have already defined the x and y, now that we are able to display it using a scatter plot. The blue circles and green dots in the figure below represent train and test data respectively.

plt.figure(figsize=(10,7))
plt.title('Train/test data distribution')
plt.scatter(x_train, y_train, c='b', s=36)
plt.scatter(x_test, y_test, c='g', s=14)
plt.xlabel('Hours of study')
plt.ylabel('Score')
plt.show()

Linear regression

The point of doing linear regression is to create a straight line which is able to approximate all data points with minimum error assuming that the independent variable (x) has a linear relationship to its dependent variable (y). The regression line itself can be defined using a simple linear function, in which our objective is to figure out the value of b_0 and b_1 in order to create the best-fit line to our samples.

Regression line function.

In fact, there’s already a formula to compute both b_0 and b_1 by calculating the existing data point coordinates. Here is the formula to compute those values:

Formula to calculate b_0.

In this case, x_bar represents the mean of all studying duration in hours while y_bar represents the average score of all students. Meanwhile, both x_i and y_i are used to denote the value of each sample. Below is the code implementation to the both formulas.

def coefficients(x, y):
    x_mean, y_mean = np.mean(x), np.mean(y)
    
    b1 = sum([(x_i-x_mean)*(y_i-y_mean) for x_i, y_i in zip(x,y)]) / sum([(x_i-x_mean)**2 for x_i in x])b0 = y_mean - b1 * x_mean
    
    return np.array([b0, b1])

We can see here that the coefficients() function takes 2 arguments, where both x and y are array of samples in our training set. The output of this function is an array with 2 elements which holds b_0 and b_1 respectively. Now it’s time to feed the function with our x_train and y_train data.

b0, b1 = coefficients(x_train, y_train)
print('b0 :', b0)
print('b1 :', b1)

If we run the code above, we should get the following output.

The values of b_0 and b_1 after being trained on our train dataset.

Now as the value of the two variables have been obtained, we can start drawing the regression line. The function of np.linspace() at the first line in the code below works by creating a list of 100 numbers ranging from 0 to 10 (inclusive). Next, each of the element in x_line is going to be processed with our b_0 and b_1, which is then stored in y_line list. This operation is essentially based on the formula to create a straight line, where b_0 is the intercept and b_1 is the slope. Finally, as both x_line and y_line have been defined, we can plot the regression line using plt.plot() function.

x_line = np.linspace(0, 10, 100)
y_line = b0 + b1*x_lineplt.figure(figsize=(10,7))plt.scatter(x_train, y_train, c='b', s=36)
plt.plot(x_line, y_line, c='r')plt.title('Regression line on train data')
plt.xlabel('Hours of study')
plt.ylabel('Score')
plt.show()

Furthermore, we can also display the regression line on our test data. However it’s important to remember that the line is fitted using our previous training data — without taking into account the samples in our test set. So there’s a probability that the regression line below might not be the best-fit model to predict the test data.

plt.figure(figsize=(10,7))plt.scatter(x_test, y_test, c='g', s=14)
plt.plot(x_line, y_line, c='r')plt.title('Regression line on test data')
plt.xlabel('Hours of study')
plt.ylabel('Score')
plt.show()

The same regression line displayed on test data distribution.

Prediction

The core of linear regression model is the b_0 and b_1 values, which we just obtained in the previous part. Now we are going to predict both x_train and x_test using the predict() function which I defined below.

def predict(data, coef):
    predictions = [coef[0] + coef[1]*x_i for x_i in data]
    return predictions

The data argument above is just an array which contains all data that we are going to predict, whereas the next argument (coef) is also an array containing the value of b_0 and b_1. The output value of this function is an array of prediction values.

Now let’s first use the function to predict our train data.

train_preds = predict(x_train, [b0, b1])
train_preds

Next, we will do the exact same thing on test data.

test_preds = predict(x_test, [b0, b1])
test_preds

In the next section we are going to find out the error values obtained from our prediction on train and test samples.

Error values

In classification tasks, we commonly use accuracy score to evaluate the performance of our model. However, we can not do that in regression problems since the labels are continuous numbers. What we can do in regression task is just to compute error value which denotes how close are our predictions towards the actual values. There are several formulas that we can choose to do that, such as MAE (Mean Absolute Error), MSE (Mean Square Error), RMSE (Root Mean Square Error), etc. In this case, I decided to use RMSE.

The RMSE formula above is pretty straightforward to understand. If we turn that into a function, then the input argument is only 2: predicted and actual values, in which those are represented by yhat and y respectively. The index i in the formula indicates that the calculation is done iteratively on each predicted-actual pair. Lastly, the n variable here represents the number of samples that we compare. Below is the code implementation which works exactly the same as the RMSE formula above. Additionally, it’s important to keep in mind that the array size of actual and predicted have to be in the same length in order to work.

def rmse(actual, predicted):
    return np.sqrt(sum([(pred_i-act_i)**2 for act_i, pred_i in zip(actual,predicted)])/len(actual))

Now let’s calculate the error values produced by the prediction of both train and test data. We can see the figure below that the value of the test error is slightly greater than train error. In fact, such thing is normal since we are fitting the regression line only with the train data.