Linear Regression From Scratch

Diego Hurtado
23 min readJan 15, 2023

--

Linear Regression Model Representation & Implementation From Scratch in Python

Introduction

The goal of this notebook is to implement linear regression from scratch using the mathematical Model Representation with Python. first, solving a simple linear regression with one variable, then solving a multiple linear regression model for housing prediction and comparing the results with the pre-build classes of sklearn, and finally using a more complex dataset: the House Prices — Advanced Regression Techniques to Predict sales prices.

All the process will be documented in a Jupiter Nootebook

Linear Regression

Linear regression is a statistical method used to model the linear relationship between a dependent variable, denoted as Y, and one or more independent variables, denoted as X. [1]

The goal of linear regression is to find the line of best fit that minimizes the distance between the points and the line.

The equation for a simple linear regression model with one predictor variable is:

Y = b0 + b1 * X

Where b0 is the intercept term and b1 is the coefficient for the predictor variable X.

In a multiple linear regression model with multiple predictor variables, the equation can be written as:

Y = b0 + b1 * X1 + b2 * X2 + … + bn * Xn

where X1, X2, …, Xn are the predictor variables and b1, b2, …, bn are the corresponding coefficients.

Linear regression can be used to make predictions about the value of the dependent variable given a set of predictor variables. The coefficients of the model can be estimated using a variety of techniques, such as the ordinary least squares method.

Linear regression is a widely used and well-understood technique that is simple to implement and easy to interpret. It is a powerful tool for understanding the relationships between different variables and can be used to make informed decisions.

Inspirations:

The model formulation , content, implementation are based on the Machine Learning Specialization program created in collaboration between DeepLearning.AI and Stanford Online, that teaches teach you the fundamentals of machine learning and how to Build machine learning models in Python using popular machine learning libraries with NumPy

Also I am using ChatGPT to describe some concepts: E.g Explain simple linear regression with one variable

Simple Linear Regression with One Variable

Linear regression with one variable is a statistical method used to model the relationship between a single independent variable (x) and a dependent variable (y). It assumes that the relationship between the two variables is linear, meaning that y is a linear function of x. The goal of linear regression is to find the best-fitting line through the data points, which can be used to make predictions about the value of y for any given value of x. The best-fitting line is determined by finding the values of the slope and y-intercept that minimize the sum of the squared differences between the predicted and actual values of y. This method is also known as simple linear regression [1]

Dataset

The file dataset1.txt contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. [3]

Note: The Dataset Does not need any preprocesing

Notation:

m = Number of training examples

X = ‘Input’ variables / Features

y = ‘Output’ variable / ‘target’ variable

The dataset is loaded from the data file into the variables X and y:

#Read the text file (csv)
df = pd.read_csv('https://raw.githubusercontent.com/DiegoHurtad0/Linear-Regression-Model-Representation-Implementation-From-Scratch-using-Python/main/data/dataset1.txt', skiprows = 0, header = None, names = ['population', 'profit'])
#Form the usual "X" matrix and "y" vector
X = df[['population']]['population'].values
y = df[['profit']].values

m = y.size #number of training examples

#Reshape y to a mx1 matrix
y = y.reshape((m, 1))

df.head(3)

Plotting the Data

Gradient Descent

Gradient descent is an optimization algorithm used to minimize a function. It works by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. [4]

In mathematical terms, suppose we have a function f(x) that we want to minimize, where x is a vector of parameters. The gradient of the function with respect to the parameters is a vector of partial derivatives of the function with respect to each parameter

The gradient points in the direction of greatest increase of the function, so to minimize the function, we can move in the opposite direction, which is the direction of steepest descent. This is done by updating the parameters in the opposite direction of the gradient:

where η is the learning rate, which determines the step size at each iteration.

The process of updating the parameters in this way is repeated until the cost function converges to a minimum or until a pre-determined number of iterations is reached.

We use Gradient Decent to fit the linear regression parameters θ to our dataset

Update Equations

The objective of linear regression is to minimize the cost function

where the hypothesis hθ(x) is given by the linear model:

Recall that the parameters of your model are the θj values. These are the values you will adjust to minimize cost J(θ). One way to do this is to use the batch gradient descent algorithm. In batch gradient descent, each iteration performs the update

With each step of gradient descent, your parameters θj come closer to the optimal values that will achieve the lowest cost J(θ).

Implementation

We have already set up the data for linear regression. In the following cell, we add another dimension to our data to accommodate the θ0 intercept term.

# Add a column of ones to X. The numpy function stack joins arrays along a given axis. 
# The first axis (axis=0) refers to rows (training examples)
# and second axis (axis=1) refers to columns (features).
X = np.stack([np.ones(m), X], axis=1)

Computing the cost 𝐽(𝜃)

As you perform gradient descent to learn to minimize the cost function 𝐽(𝜃)J(θ), it is helpful to monitor the convergence by computing the cost.

In this section, you will implement a function to calculate 𝐽(𝜃) so you can check the convergence of your gradient descent implementation.

The function computeCost which computes 𝐽(𝜃). As you are doing this, remember that the variables 𝑋 and 𝑦 are not scalar values, but matrices whose rows represent the examples from the training set.

Computes the cost of using theta as the parameter for linear regression to fit the data points in X and y

Compute the cost of a particular choice of theta. You should set J to the cost.

def computeCost(X, y, theta):
m = y.size # number of training examples
# You need to return the following variables correctly
J = 0.0
J = 1./(2. * m) * np.sum((np.dot(X, theta) - y)**2)
return J

Gradient descent method

Next, you will implement gradient descent in a function. The loop structure has been written for you, and you only need to supply the updates to 𝜃θ within each iteration.

As you program, make sure you understand what you are trying to optimize and what is being updated. Keep in mind that the cost 𝐽(𝜃) is parameterized by the vector 𝜃, not 𝑋 and 𝑦y. That is, we minimize the value of 𝐽(𝜃) by changing the values of the vector 𝜃, not by changing 𝑋 or 𝑦. Refer to the equations in this notebook and to the lecture slides if you are uncertain. A good way to verify that gradient descent is working correctly is to look at the value of 𝐽(𝜃) and check that it is decreasing with each step.

The starter code for the function gradientDescent calls computeCost on every iteration and saves the cost to a python list. Assuming you have implemented gradient descent and computeCost correctly, your value of 𝐽(𝜃)J(θ) should never increase and should converge to a steady value by the end of the algorithm.

def gradientDescent(X, y, theta, alpha, num_iters):
# Initialize some useful values
m = y.shape[0] # number of training examples
# are passed by reference to functions
theta = theta.copy()
J_history = [] # Use a python list to save cost in every iteration

for i in range(num_iters):
theta = theta - (alpha/m) * np.dot(X.T, (np.dot(X, theta) - y))
# save the cost J in every iteration
J_history.append(computeCost(X, y, theta))

return theta, J_history

After you are finished call the implemented gradientDescent function and print the computed 𝜃θ. We initialize the θ parameters to 0 and the learning rate 𝛼α to 0.01. Execute the following cell to check your code.

# initialize fitting parameters
theta = np.zeros((2, 1))
# some gradient descent settings
iterations = 1500
alpha = 0.01

# run gradient descent
theta, J_hist = gradientDescent(X, y, theta, alpha, iterations)

print('Theta found by gradient descent:')
print('𝜃 = ' + str(theta[0][0]) + ' , ' + str(theta[1][0]) )

Theta found by gradient descent:
𝜃 = -3.6302914394043593, 1.1663623503355818

Visualizing Linear Fit

We will use your final parameters to plot the linear fit.

Prediction of Profits for a Food Truck

Now that we build our linear regression model we can test some predictions

prediction_Value = 7 (Population in 10,000s)

Predictions from Linear Regression Algorithm from scratch

predict = np.dot([[1, prediction_Value]], theta)
print('The 𝜃 value: ' + str(theta[0][0]) + ' , ' + str(theta[1][0]) )
print('For population = 70,000, we predict a profit of $', round(predict[0][0]*10000, 2))

The 𝜃 value: -3.6302914394043593, 1.1663623503355818
For population = 70,000, we predict a profit of $ 45342.45

Predictions of LinearRegression Model from sklearn

We can compare the results using the class of the sklearn LinearRegression module that contains different functions for performing machine learning with linear models

from sklearn.linear_model import LinearRegression

X = df[['population']]
y = df[['profit']]

reg = LinearRegression().fit(X, y)
print('Score:')
print(reg.score(X, y))

print('The 𝜃 value: ' + str(reg.intercept_[0]) + ' , ' + str(reg.coef_[0][0]))
print('For population = 70,000, we predict a profit of $', round(reg.predict(np.array([[7]]))[0][0] * 10000, 2))

Score:
0.7020315537841397
The 𝜃 value: -3.89578087831185, 1.1930336441895935
For population = 70,000, we predict a profit of $ 44554.55

Visualizing 𝐽(𝜃)

To understand the cost function 𝐽(𝜃) better, you will now plot the cost over a 2-dimensional grid of 𝜃0 and 𝜃1 values. You will not need to code anything new for this part, but you should understand how the code you have written already is creating these images.

In the next cell, the code is set up to calculate 𝐽(𝜃) over a grid of values using the computeCost function that you wrote. After executing the following cell, you will have a 2-D array of 𝐽(𝜃) values. Then, those values are used to produce surface and contour plots of 𝐽(𝜃) using the matplotlib plot_surface and contourf functions. The plots should look something like the following:

The purpose of these graphs is to show you how 𝐽(𝜃) varies with changes in 𝜃0and 𝜃1. The cost function 𝐽(𝜃) is bowl-shaped and has a global minimum. (This is easier to see in the contour plot than in the 3D surface plot). This minimum is the optimal point for 𝜃 and 𝜃1, and each step of gradient descent moves closer to this point.

Linear regression with multiple variables

In this part, you will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.

Linear regression in the context of house prediction, linear regression can be used to predict the value of a house based on a set of predictor variables such as size, location, age, and number of bedrooms.

The goal of linear regression is to find the line of best fit that minimizes the distance between the points and the line. The equation of this line can be used to make predictions about the value of a house given a set of predictor variables.

Dataset

The file dataset2.txt contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house. [3]

Note: The Dataset Does not need any preprocesing

Notation:

m = Number of training examples

X = ‘Input’ variables / Features

y = ‘Output’ variable / ‘target’ variable

#Load Data
df = pd.read_csv('https://raw.githubusercontent.com/DiegoHurtad0/Linear-Regression-Model-Representation-Implementation-From-Scratch-using-Python/main/data/dataset2.txt', skiprows = 0, header = None, names = ['size sq-ft', 'bedrooms', 'price'])
X = df[['size sq-ft', 'bedrooms']].values
y = df[['price']].values
m = y.size
#Reshape y to a mx1 matrix
y = y.reshape((m, 1))

df.head(5)

Feature Normalization

Feature normalization is a technique used to transform the values of numeric columns in a dataset to a common scale, without distorting differences in the range or distribution of the data. This can be useful for machine learning models that use optimization algorithms that require features to be on a similar scale in order to work effectively. [5]

There are several methods for normalizing features, but one common method is to subtract the mean of the feature from each value, and then divide the result by the standard deviation of the feature. This scales the feature values to have zero mean and unit variance.

For example, suppose we have a dataset with a numeric column x. To normalize this feature, we can apply the following transformation to each value xi in the column:

xi_normalized = (xi — mean(x)) / std(x)

where mean(x) and std(x) are the mean and standard deviation of the feature x, respectively.

Another method for normalizing features is to scale the values to a fixed range, such as [0, 1] or [-1, 1]. This can be done by subtracting the minimum value of the feature from each value, and then dividing the result by the range (max — min) of the feature.

Normalizing features can help machine learning algorithms to converge faster and perform better, but it is important to remember to apply the same normalization transformation to the test set when evaluating the model.

This technique is used in many machine learning libraries such as scikit-learn, TensorFlow, and Keras.

Source: scikit-learn documentation (https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

Feature Normalization from scratch

def  featureNormalize(X):
X_norm = X.copy()
mu = np.zeros(X.shape[1])
sigma = np.zeros(X.shape[1])
mu = np.mean(X, axis = 0)
sigma = np.std(X, axis = 0)

X_norm = X_norm - mu
X_norm = X_norm / sigma
return X_norm, mu, sigma
# call featureNormalize on the loaded data
X_norm, mu, sigma = featureNormalize(X)

Computed mean: [2000.68085106 3.17021277]
Computed standard deviation: [7.86202619e+02 7.52842809e-01]

Feature Normalization from StandardScaler class

Source: scikit-learn documentation (https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)
# call featureNormalize on the loaded data
X_scaled = scaler.transform(X)

Computed mean: [2000.68085106 3.17021277]
Computed standard deviation: [7.86202619e+02 7.52842809e-01]

Gradient Descent Multi

Previously, you implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix 𝑋. The hypothesis function and the batch gradient descent update rule remain unchanged. [3]

You should complete the code for the functions computeCostMulti and gradientDescentMulti to implement the cost function and gradient descent for linear regression with multiple variables. If your code in the previous part (single variable) already supports multiple variables, you can use it here too. Make sure your code supports any number of features and is well-vectorized. You can use the shape property of numpy arrays to find out how many features are present in the dataset.

Implementation Note: In the multivariate case, the cost function can also be written in the following vectorized form:

def computeCostMulti(X, y, theta):
m = y.shape[0] # number of training examples
J = 0.0
J = 1./(2. * m) * np.sum((np.dot(X, theta) - y)**2)
return J

Where

def gradientDescentMulti(X, y, theta, alpha, num_iters):
m = y.shape[0] # number of training examples

# make a copy of theta, which will be updated by gradient descent
theta = theta.copy()

J_history = []

for i in range(num_iters):
theta = theta - (alpha/m) * np.dot(X.T, (np.dot(X, theta) - y))
# save the cost J in every iteration
J_history.append(computeCostMulti(X, y, theta))

return theta, J_history

Selecting learning rates

In this part of the exercise, you will get to try out different learning rates for the dataset and find a learning rate that converges quickly. You can change the learning rate by modifying the following code and changing the part of the code that sets the learning rate.

Use your implementation of gradientDescentMulti function and run gradient descent for about 50 iterations at the chosen learning rate. The function should also return the history of 𝐽(𝜃)J(θ) values in a vector 𝐽J.

After the last iteration, plot the J values against the number of iterations.

Predict House Price

Now that we implement the model we can start doing some predictions

Regression Evaluation Metrics

Regression evaluation metrics are used to evaluate the performance of a regression model. These metrics provide a quantitative measure of the accuracy of the model’s predictions and allow us to compare the performance of different models.

There are several common regression evaluation metrics, including:

Mean absolute error (MAE): This metric measures the average absolute difference between the predicted values and the true values. It is calculated as the sum of the absolute differences between the predicted and true values, divided by the number of samples.

Mean squared error (MSE): This metric measures the average squared difference between the predicted values and the true values. It is calculated as the sum of the squared differences between the predicted and true values, divided by the number of samples.

Root mean squared error (RMSE): This metric is the square root of the mean squared error. It is often used because it is in the same units as the original data, which makes it easier to interpret.

R-squared (R2): This metric is a measure of the goodness of fit of the model. It ranges from 0 to 1, with higher values indicating a better fit.

There are other regression evaluation metrics that can be used, depending on the specific needs of the problem. It is important to choose an appropriate metric that reflects the goals of the model and the characteristics of the data.

Values:

size_ft = 1650 # sq-ft
bathrooms = 3

Predict House Price using our model

theta computed from gradient descent: [[340412.56301439 [109370.05670466] [ -6500.61509507]]

The predicted price of a 1650 sqft, 3 beedrooms house: 293098.5

Predict House Price using LinearRegression Model From sklearn.linear_model

X = X_scaled
y = df[['price']]

reg = LinearRegression().fit(X, y)
print('The accuracy of the Linear Regression model:')
print(reg.score(X, y))

print('The 𝜃 value: ' + str(reg.intercept_[0]) + ' , ' + str(reg.coef_[0][0]))
print('Predicted price of a 1650 sq-ft, 3 br house (sklearn.linear_model):', round(reg.predict(np.array([[(size_ft - mu[0])/sigma[0], (bathrooms - mu[1]) / sigma[1]]]))[0][0], 1) )

The accuracy of the Linear Regression model:
0.7329450180289143
The 𝜃 value: 340412.6595744681 , 109447.7964696418
Predicted price of a 1650 sq-ft, 3 br house (sklearn.linear_model): 293081.5

We can observed that prediction is not that ‘Good’: 0.73

Underfitting

Underfitting in linear regression occurs when the model is not complex enough to capture the underlying relationship between the independent and dependent variables in the data. This can happen for a few reasons:

  1. The model is too simple: If the model only has a few parameters, it may not be able to capture the complexity of the data. In such cases, a more complex model with more parameters may be required.
  2. The data is not linear: If the data has a non-linear relationship between the independent and dependent variables, a linear model will not be able to fit the data well.
  3. Insufficient data: With limited data, the model may not be able to learn the underlying relationship between the independent and dependent variables.

Underfitting can be identified by low training accuracy and high validation error, which implies that the model is not generalizing well to new unseen data.

How to solve Underfitting in linear regression model

There are several ways to solve underfitting in linear regression:

  1. Adding more features: By adding more features, the model can become more complex, allowing it to capture the underlying relationship between the independent and dependent variables more accurately.
  2. Using non-linear features: Adding non-linear features or interactions between features can help the model capture non-linear relationships in the data.
  3. Using a more complex model: Instead of using a linear model, using a more complex model like polynomial regression, decision tree regression or random forest regression may be able to better fit the data.
  4. Increasing the amount of data: With more data, the model can learn the underlying relationship between the independent and dependent variables more accurately.
  5. Regularization: Regularization is a technique that helps to prevent overfitting by adding a penalty term to the loss function to discourage large coefficients. L1 and L2 regularization are the most popular ways of regularizing linear regression.
  6. Cross-validation: Using cross-validation techniques (such as k-fold cross-validation) can help to identify and avoid overfitting by evaluating the performance of the model on unseen data.

It’s important to note that there’s no one-size-fits-all solution to underfitting and it may require trying different approaches and experimenting with different parameters to find the best solution.

In this case we can observe that we only have 2 features and the size of the sample data is low (47)

Setting up train and test split

Setting up a train and test split is a common technique used in machine learning to evaluate the performance of a model. The goal of the train and test split is to evaluate the model’s ability to generalize to new, unseen data.

In the train and test split, the original dataset is randomly divided into a training set and a testing set. The model is trained on the training set, and then its performance is evaluated on the testing set. This allows the model to be trained and tested on different data, which gives a more realistic evaluation of its performance.

The train and test split is useful because it allows the model to be trained and evaluated in a controlled way. It is important to evaluate the model on unseen data, as this allows us to gauge its performance on data that it has not seen before and determine whether it is overfitting or underfitting.

In summary, setting up a train and test split is an important step in the machine learning process as it helps to evaluate the model’s ability to generalize to new, unseen data and ensure that it is not overfitting or underfitting.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

Linear Regression with split dataset

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

#Create linear regression
reg = LinearRegression()

#Train the model using the training sets
reg.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = reg.predict(X_test)
print('The accuracy of the Linear Regression',r2_score(y_test,y_pred))
print('The 𝜃 value: ' + str(reg.intercept_[0]) + ' , ' + str(reg.coef_[0][0]))
print('Predicted price of a 1650 sq-ft, 3 br house (linear_model.LinearRegression):', round(reg.predict(np.array([[(size_ft - mu[0])/sigma[0], (bathrooms - mu[1]) / sigma[1]]]))[0][0], 1) )
print ('RMSE: ', mean_squared_error(y_test, y_pred))

The accuracy of the Linear Regression 0.8268330726601523
The 𝜃 value: 341272.94128357305 , 108050.93183004484
Predicted price of a 1650 sq-ft, 3 br house (linear_model.LinearRegression): 292732.0
RMSE: 2226159168.988815

Regularization

L1 and L2 regularization are techniques used to prevent overfitting in machine learning models. Overfitting occurs when a model is too complex and has too many parameters, which can lead to poor performance on new, unseen data. Regularization is a way of limiting the complexity of a model by adding a penalty term to the loss function. This penalty term discourages the model from learning too many parameters, which can help improve its generalization ability.

In summary, L1 and L2 regularization are similar in that they both add a penalty term to the loss function to prevent overfitting. The main difference is that L1 regularization encourages sparsity in the model by setting some weights to zero, while L2 regularization encourages weight values to be small but non-zero.

Lasso Regression (L1 Regularization)

L1 regularization, also known as Lasso regularization, adds a penalty term to the objective function that is proportional to the absolute value of the coefficients. This results in some coefficients being exactly equal to zero, effectively eliminating them from the model. L1 regularization is useful for feature selection, as it can automatically select the most important features.

from sklearn.linear_model import Lasso

lasso=Lasso(alpha=0.8)
lasso.fit(X_train, y_train)
y_pred3=lasso.predict(X_test)

print('The accuracy of the Lasso Regression',lasso.score(X_test, y_test))
print('Predicted price of a 1650 sq-ft, 3 br house (linear_model.Lasso):', round(lasso.predict(np.array([[(size_ft - mu[0])/sigma[0], (bathrooms - mu[1]) / sigma[1]]]))[0], 2) )

Ridge Regression (L2 Regularization)

L2 regularization, also known as Ridge regularization, adds a penalty term to the objective function that is proportional to the square of the coefficients. This results in the coefficients being small, but non-zero. L2 regularization is useful for preventing overfitting and improving the generalization of the model.

from sklearn.linear_model import Ridge

#Create Ridge regression
ridge = Ridge()

#Train the model using the training sets
ridge.fit(X_train, y_train)
b = float(ridge.intercept_)

# Make predictions using the testing set - Ridge Regression
test_ridge = ridge.predict(X_test)
print('The accuracy of the Ridge Regression is', r2_score(y_test, test_ridge))
print('Predicted price of a 1650 sq-ft, 3 br house (linear_model.Lasso):', round( ridge.predict(np.array([[(size_ft - mu[0])/sigma[0], (bathrooms - mu[1]) / sigma[1]]]))[0][0] , 2) )

Hyperparameter tunning

Hyperparameter tuning is the process of adjusting the hyperparameters of a machine learning model to optimize its performance. Hyperparameters are parameters that are set prior to training the model and are not learned from the training data.

Hyperparameter tuning is an important step in the machine learning process because the performance of the model can be significantly affected by the choice of hyperparameters. Different hyperparameter values can result in drastically different model performance, and finding the optimal values can be a challenging task.

There are several methods for hyperparameter tuning, including manual tuning, grid search, and random search. In manual tuning, the hyperparameters are manually adjusted by the user. In grid search, a predefined set of hyperparameter values is specified, and the model is trained and evaluated using all possible combinations of these values. In random search, the hyperparameter values are chosen randomly from a predefined range.

Hyperparameter tuning can be a time-consuming process, but it is an important step in the machine learning workflow as it can significantly improve the performance of the model.

Hyperparameter tunning Ridge Regression (L2 Regularization)

#find best alpha for Ridge Regression
from sklearn.model_selection import GridSearchCV

param_grid={'alpha':np.arange(1,10,500)} #range from 1-500 with equal interval of 10
ridge=Ridge()
ridge_best_alpha=GridSearchCV(ridge, param_grid)
ridge_best_alpha.fit(X_train,y_train)

print("Best alpha for Ridge Regression:",ridge_best_alpha.best_params_)
print("Best score for Ridge Regression with best alpha:",ridge_best_alpha.best_score_)

Hyperparameter tunning Lasso Regression (L1 Regularization)

from sklearn.model_selection import GridSearchCV

param_grid={'alpha':np.arange(0,0.1,1)} #range from 0-1 with equal interval of 0.1
lasso=Lasso()
lasso_best_alpha=GridSearchCV(lasso, param_grid)
lasso_best_alpha.fit(X_train,y_train)

print("Best alpha for Lasso Regression:",lasso_best_alpha.best_params_)
print("Best score for Lasso Regression with best alpha:",lasso_best_alpha.best_score_)

Best score for Lasso Regression with best alpha: 0.15732202460557382

#seting alpha as 20
alpha = lasso_best_alpha.best_score_

# Initialising Lasso() with above alpha
lasso = Lasso(alpha=alpha)

#fitting model
lasso.fit(X_train,y_train)

print('The accuracy of the Lasso Regression',lasso.score(X_test, y_test))
print('Predicted price of a 1650 sq-ft, 3 br house (linear_model.Lasso):', round(lasso.predict(np.array([[(size_ft - mu[0])/sigma[0], (bathrooms - mu[1]) / sigma[1]]]))[0], 2) )

The accuracy of the Lasso Regression 0.826833643868359
Predicted price of a 1650 sq-ft, 3 br house (linear_model.Lasso): 292732.05

Hyperparameter tunning ElasticNet Regression

from sklearn.linear_model import ElasticNet

# list of alpha to tune
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]}

# defining cross validation folds as 8
folds = 8

# Initialising ElasticNet()
elasticnet = ElasticNet()

#using same attributes used for Ridge tuning except estimator here would be ElasticNet
grid_cv_model = GridSearchCV(
estimator=elasticnet,
param_grid=params,
scoring='neg_mean_absolute_error',
cv=folds,
return_train_score=True,
verbose=1
)

#fitingmodel_cv
grid_cv_model.fit(X_train,y_train)

# Checking best alpha from model_cv
print(grid_cv_model.best_params_)

alpha = grid_cv_model.best_params_["alpha"]

# Defining ElasticNet with above alpha
elasticnet = ElasticNet(alpha=alpha)

# fiting elastic net
elasticnet.fit(X_train,y_train)

pred_by_elasticnet = elasticnet.predict(X_test)

print('The accuracy of the Ridge Regression is', round(r2_score(y_test, pred_by_elasticnet), 2))
print(f'ElasticNet RMSE: {np.sqrt(mse(y_test,pred_by_elasticnet))}')
print('Predicted price of a 1650 sq-ft, 3 br house (linear_model.Lasso):', round(elasticnet.predict(np.array([[(size_ft - mu[0])/sigma[0], (bathrooms - mu[1]) / sigma[1]]]))[0], 2) )

Predicted price of a 1650 sq-ft, 3 br house (linear_model.Lasso): 295701.11

The next phase is to apply this concepts to a more complex dataset

For that I am using the [House Prices — Advanced Regression Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/description) competition dataset [6]

Photo by Tom Thain on Unsplash.

Before start modeling we need to check the dataset

EDA

Exploratory Data Analysis (EDA) is an approach to analyzing and summarizing datasets to gain insights and understanding about the data. It is a crucial step in the data science process as it helps identify patterns, trends, and relationships in the data that can inform later steps in the process such as building models or making decisions. EDA typically involves visualizing the data using graphs and plots, as well as applying statistical techniques to summarize and understand the data. It is called “exploratory” because the analyst is often unsure about what they are looking for in the data and is using EDA to help them discover what the data can tell them.

We have 1460 rows and 81 columns

Normalization, remove outliers and Skewness

Filling Missing Values

from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator

class DataImputer(BaseEstimator, TransformerMixin):
def __init__(self, strategy, filler):
self.strategy = strategy
self.fill = filler
"""
Impute missing values.

"""
def fit(self, X, column, strategy):
if strategy == 'mean':
X[column] = X[column].mean()
elif strategy == 'median':
X[column] = X[column].median()
elif strategy == 'mode':
X[column] = X[column].mode().iloc[0]
elif strategy == 'zero':
X[column] = X[column].fillna(0)
elif strategy == 'none':
X[column] = df[column].fillna("None")

return X[column]

Filling missing values

imputer = DataImputer(BaseEstimator, TransformerMixin)

df_clean = pd.read_csv('https://raw.githubusercontent.com/DiegoHurtad0/Linear-Regression-Model-Representation-Implementation-From-Scratch-using-Python/main/data/data_cleaning_hp.csv')

df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))

## loop to filling the missing Values using the DataInputer Class
for i in df_clean['fill type'].unique():
strategy = i
for j in df_clean[df_clean['fill type'] == i]['column name'].tolist():
column = j
print(strategy + column)
df[column] = imputer.fit(df, column, strategy)

For this version I am using the same variables of the previous exercise

X = df[['Totalsqr', 'BedroomAbvGr']].values
y = df[['SalePrice']].values
m = y.size
#Reshape y to a mx1 matrix
y = y.reshape((m, 1))

df.head(5)

Feature scaling

from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X)

# call featureNormalize on the loaded data
X_norm, mu, sigma = featureNormalize(X)

print('Computed mean:', mu)
print('Computed standard deviation:', sigma)

# call featureNormalize on the loaded data
X_scaled = scaler.transform(X)
print('Computed mean:', scaler.mean_)
print('Computed standard deviation:', sigma)

X = X_scaled

Train and test datasets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

Linear regression

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

size_ft = 2416 # sq-ft
bathrooms = 3

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

#Create linear regression
reg = LinearRegression()

#Train the model using the training sets
reg.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = reg.predict(X_test)
print('The accuracy of the Linear Regression',r2_score(y_test,y_pred))
print('The 𝜃 value: ' + str(reg.intercept_[0]) + ' , ' + str(reg.coef_[0][0]))
print('Predicted price of a ' + str(size_ft) + ' sq-ft, 3 br house (linear_model.LinearRegression):', round(reg.predict(np.array([[(size_ft - mu[0])/sigma[0], (bathrooms - mu[1]) / sigma[1]]]))[0][0], 1) )
print ('RMSE: ', mean_squared_error(y_test, y_pred))

Ridge Regression

from sklearn.linear_model import Ridge

#Create Ridge regression
ridge = Ridge()

#Train the model using the training sets
ridge.fit(X_train, y_train)
b = float(ridge.intercept_)

# Make predictions using the testing set - Ridge Regression
test_ridge = ridge.predict(X_test)
print('The accuracy of the Ridge Regression is', r2_score(y_test, test_ridge))
print('Predicted price of a ' + str(size_ft) + ' sq-ft, 3 br house (linear_model.Lasso):', round( ridge.predict(np.array([[(size_ft - mu[0])/sigma[0], (bathrooms - mu[1]) / sigma[1]]]))[0][0] , 2) )

Predictions:

Actual Sale Price: $ 12,247

Linear Regression: 12,3

Lasso Regression: 12,28

Conclusions:

The goal of this post is to explain the Model Representation of linear Regression and the implementation from scratch to understand what is behind the pre-build clases of models and compare the implementation with the pre-build clases of sklearn

Future Work:

Explain more details of the preprocessing process

Use all the columns

Feature engineering, feature selection

Use more complex methods like:

  • XGBoost Regressor
  • Explain More deatils of the data processing process
  • Light Gradient Boosting Regressor
  • Support Vector Regressor
  • Gradient Boosting Regressor
  • Random Forest Regressor

Jupiter Nootebok:

https://github.com/DiegoHurtad0/Linear-Regression-Model-Representation-Implementation-From-Scratch-using-Python

Contact Me:

Bailey the Golden Retriever Sitting at a Desk (knowyourmeme, 2022)

Linkedin:Diego Gustavo Hurtado olivares

Diego Hurtado Linkedin:

References

[1] “An Introduction to Linear Regression Analysis” by Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining. published by John Wiley & Sons, Inc., Hoboken, New Jersey, ISBN 978–0–470–17248–9.

[1]Brown, Tom B., et al. “Language Models are Few-Shot Learners.” arXiv preprint arXiv:2005.14165 (2020).

[3] Machine Learning Specialization. Available at: https://in.coursera.org/specializations/machine-learning-introduction (Accessed: January 13, 2023).

[4] “A Stochastic Approximation Method” by Robbins and Monro (1951) which introduced the stochastic gradient descent algorithm.

[5] Preprocessing data Available at: https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler) (Accessed: January 13, 2023).

[6] House Prices — Advanced Regression Techniques Available at: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/description (Accessed: January 13, 2023).

--

--

Diego Hurtado

MSc Senior Data scientist with strong data visualization skills, a creative mind for building enterprise-level ML solutions