Linear Regression from Scratch in Python | PYTHOLABS

Pytholabs Research
5 min readJan 30, 2019

update : We have introduced an interactive learning platform to learn machine learning / AI , check out this blog in interactive mode.

In this post we will understand one of the easiest algorithms in machine learning and create it from scratch in Python using numpy.

Supervised machine learning is broadly classified into 2 tasks :

Regression : for predicting continuous variables i.e, Temperature,etc

Classification : for predicting descrete variables i.e, dog/cat

Linear Regression we establish a linear relationship between the input variables(X) and single output variable(Y). When the input(X) is a single variable this model is called Simple Linear Regression and when there are multiple input variables(X), it is called Multiple Linear Regression.

A linear equation is always a polynomial of degree 1 (for example x+2y+3=0). In the two dimensional case, they always form lines; in other dimensions, they might also form planes, points, or hyperplanes. Their “shape” is always perfectly straight, with no curves of any kind. This is why we call them linear equations.

Simple Linear Regression

Model Representation

In this problem, we have an input variable — X and one output variable — Y. And we want to build a linear relationship between these variables. Here the input variable is called Independent Variable and the output variable is called Dependent Variable. We can define this linear relationship as follows:

Y = β0​ + β1 * ​X

# code in python 
predict = lamda x, b0, b1: b0+b1*x

The β1​ is called a slope or coefficient and β0​ is called intercept/bias coefficient. β0 gives an extra degree of freedom to this model. This equation is similar to the line equation

y = m*x + b

with m=β1​(Slope) and b = β0​(Intercept). So in this Simple Linear Regression model we want to find the best fit line for our dataset, a line between X and Y which estimates the relationship between X and Y.

But how do we find these coefficients? That’s the learning procedure. We can find these using different optimization approaches.

Optimization refers to the task of minimizing/maximizing an objective function f(x) parameterized by x. In machine/deep learning terminology, it’s the task of minimizing the cost/loss function J(w)

Types of optimization algorithms:

  • One step Optimization algorithm . (ex - Ordinary Least Square Estimator)
  • Optimization algorithm that is iterative in nature and converges to acceptable solution regardless of the parameters initialization such as gradient descent.

We will use Ordinary Least Square Estimator in Simple Linear Regression and Gradient Descent Approach in Multiple Linear Regression in post.

Ordinary Least Square Method

Earlier in this post we discussed that we are going to approximate the relationship between X (temp)and Y (IceCream Sale)to a line.

X = 2 * np.random.rand(100,1) 
y = 4 +3 * X+np.random.randn(100,1)

Let’s create a few random data points. And we plot these scatter points in 2D space, we will get something like the following image.

3 different models fitted to the dataset

Here we are referring to SSr = Sum of Squared Residuals Error

you can observe 3 lines in the image. the blue line is the best model , since its closest to the data points . A good model will always have least error. We can find this line by reducing the error. The error of each point is the distance between line and that point. This is illustrated as follows.

And total error of this model is the sum of all errors of each point. ie.

m = total no. of samples in dataset

You might have noticed that we are squaring each of the distances. This is because, some points will be above the line and some points will be below the line. We can minimize the error in the model by minimizing SSr. And after the mathematics of minimizing error, we will get :

OLS Estimator

xˉ is the mean value of input variable X and yˉ​ is the mean value of output variable Y.

Now we have the model. You can find the full derivation of OLS Estimator here. Now we will implement this model in Python.

Evaluating The Model

Let's check how good was our model by using 2 metrics:

RMSE Score :

#MSE (Mean squared Error)predict = lamda x, b0, b1: b0+b1*x
mse = np.sum(y - predict(x,b0,b1))/len(y)
##RMSE
rmse = mse**(1/2)

R² Score :

def r2(y_,y):
sst = np.sum((y-y.mean())**2)
ssr = np.sum((y_-y)**2)
r2 = 1-(ssr/sst)
return(r2)

Adjusted R² Score :

The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.

adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)

Sklearn Approach :

Sklearn is a collection of various popular classification,regression and clustering algorithms including support vector machines,random forests, gradient boosting, k-means and DBSCAN

We are using Pandas to handle our dataset effeciently . It provides high-performance data manipulation and analysis tools using its powerful data structures.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pandas as pd
# load the dataset
data = pd.read_csv('height_weight.csv')
#lets see how it looks
data.head()
Our Dataset
#Lets prepare our data so model can be fitted
#Cannot use Rank 1 matrix in scikit learn
X = data.Weight.values
X = X.reshape(-1,1)
#Target variable
y = data.Height.values

— —

# Initializing Model
reg = LinearRegression()
# Fitting training data
reg = reg.fit(X, y)
# Y Prediction
Y_pred = reg.predict(X)

Model evaluation :

# Calculating RMSE and R2 Score
mse = mean_squared_error(y, Y_pred)
rmse = np.sqrt(mse)
r2_score = reg.score(X, y)
print(np.sqrt(mse))
print(r2_score)
rmse & r2 score

For amazing courses look onto : https://pytholabs.com/

Reference :

--

--