Linear Regression From Scratch In Python

Amit Subhash Chejara
11 min readJun 23, 2024

--

Many packages in Python provide the ease of implementing linear regression in just a few lines of code. However, understanding what lies beneath the surface is equally important to gain a deeper understanding of the concepts. In this article, we will learn to implement linear regression from scratch in Python to grasp the fundamental principles.

Before diving into the implementation, it is assumed that you have already prepared your data in a suitable format. This includes ensuring that your predictor variables are in a numerical format and that your outcome variable is continuous. Additionally, it is assumed that your data is free from missing values and outliers, and that any necessary data preprocessing steps have been taken. With your data ready, we can now proceed to implement linear regression from scratch in Python.

Splitting The Data Into Training And Test Sets

The code begins by importing the necessary libraries, including NumPy, Matplotlib, and Pandas. It then assumes that the data is stored in a DataFrame and converts it into numpy arrays for easier manipulation. The DataFrame is split into training, cross-validation, and test sets. The training set comprises 60% of the data, while the remaining 40% is divided equally between cross-validation and the test set. This split allows for the evaluation of the model’s performance on unseen data.

# Importing the necessary libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


# Splitting the dataset into the Training set, Cross Validation set, and Test set

# Assuming the data is in a DataFrame with more than one column and rows and one target column
df = pd.DataFrame("Your dataframe path here")

# Convert DataFrame to numpy arrays
X = df[['X1', 'X2', ...,"Xn"]].values # Choose the feature columns you want to include
Y = df['Your target column'].values

# Calculate the number of rows in the dataset
n = len(Y)

# Split 60% of the dataset as the training set
train_end = int(0.6 * n)
x_train = X[:train_end]
y_train = Y[:train_end]

# Split the remaining 40% into two: one half for cross validation and the other for the test set
cv_end = int(0.5 * (n - train_end))
x_cv = X[train_end:train_end + cv_end]
y_cv = Y[train_end:train_end + cv_end]
x_test = X[train_end + cv_end:]
y_test = Y[train_end + cv_end:]

print(f"""The shape of the training set (input) is: {x_train.shape}\n
The shape of the training set (target) is: {y_train.shape}\n
The shape of the cross validation set (input) is: {x_cv.shape}\n
The shape of the cross validation set (target) is: {y_cv.shape}\n
The shape of the test set (input) is: {x_test.shape}\n
The shape of the test set (target) is: {y_test.shape}""")

The code prints the shapes and sizes of each dataset, providing a clear understanding of the data distribution. This information is crucial for model development and evaluation. For instance, knowing the number of samples and features in each set helps in determining the appropriate model architecture and hyperparameters.

The code prepares the data for modeling by splitting it into input features (X) and the target variable (Y). This separation is necessary for training a linear regression model, which requires separate inputs and outputs. The input features are further split into training, cross-validation, and test sets, ensuring that each set is used for a specific purpose: training the model, evaluating its performance on unseen data, and testing its generalization capabilities.

Defining Functions To Visualize The Data

The code defines two functions to visualize the data: plot_feature_vs_target and plot_feature_vs_target_with_predictions. These functions are used to plot each feature against the target variable, providing a visual representation of the relationship between the features and the target. This is an essential step in exploratory data analysis, as it helps in understanding the distribution of the data and identifying any potential issues or anomalies.

import matplotlib.pyplot as plt
# Visualizing the dataset
def plot_feature_vs_target(x_train, y_train, x_cv, y_cv, x_test, y_test):

# Assuming X_train, y_train, X_cv, y_cv, X_test, y_test are your training,
# cross-validation, and test sets respectively

# Assuming feature_names is a list of the names of your features
feature_names = ['Feature 1', 'Feature 2', 'Feature 3', 'Feature 4', 'Feature 5', 'Add more features according to your dataset']

# Plot each feature against the target variable
num_features = x_train.shape[1]
fig, axs = plt.subplots(num_features, figsize=(8, 6 * num_features))

for i in range(num_features):
axs[i].scatter(x_train[:, i], y_train, label='Training Set')
axs[i].scatter(x_cv[:, i], y_cv, label='Cross-Validation Set')
axs[i].scatter(x_test[:, i], y_test, label='Test Set')
axs[i].set_title(f'{feature_names[i]} vs Target')
axs[i].set_xlabel(feature_names[i])
axs[i].set_ylabel('Target')
axs[i].legend()

plt.tight_layout()
plt.show()
plot_feature_vs_target(x_train, y_train, x_cv, y_cv, x_test, y_test)

def plot_feature_vs_target(x_train, y_train, y_preds_train, x_cv=None, y_cv=None, y_preds_cv=None, x_test=None, y_test=None, y_preds_test=None):
# Assuming feature_names is a list of the names of your features
feature_names = ['Feature 1', 'Feature 2', 'Feature 3', 'Feature 4', 'Feature 5']

# Plot each feature against the target variable
num_features = x_train.shape[1]
fig, axs = plt.subplots(num_features, figsize=(8, 6 * num_features))

for i in range(num_features):
axs[i].scatter(x_train[:, i], y_train, label='Training Set (Actual)')
axs[i].scatter(x_train[:, i], y_preds_train, label='Training Set (Predicted)', marker='o')

if x_cv is not None and y_cv is not None:
axs[i].scatter(x_cv[:, i], y_cv, label='Cross-Validation Set (Actual)')
if y_preds_cv is not None:
axs[i].scatter(x_cv[:, i], y_preds_cv, label='Cross-Validation Set (Predicted)', marker='x')

if x_test is not None and y_test is not None:
axs[i].scatter(x_test[:, i], y_test, label='Test Set (Actual)')
if y_preds_test is not None:
axs[i].scatter(x_test[:, i], y_preds_test, label='Test Set (Predicted)', marker='^')

axs[i].set_title(f'{feature_names[i]} vs Target')
axs[i].set_xlabel(feature_names[i])
axs[i].set_ylabel('Target')
axs[i].legend()

plt.tight_layout()
plt.show()

plot_feature_vs_target(x_train, y_train, y_preds_train, x_cv, y_cv, y_preds_cv, x_test, y_test, y_preds_test)

The plot_feature_vs_target function takes in the training, cross-validation, and test sets, along with their corresponding target variables. It then plots each feature against the target variable for each set, allowing for a comparison of the data distribution across different subsets. This is useful for identifying any differences or inconsistencies in the data between the sets.

The plot_feature_vs_target_with_predictions function takes in the training, cross-validation, and test sets, along with their corresponding target variables and predicted values. It plots each feature against the target variable for each set, including the actual and predicted values. This allows for a visual comparison of the actual and predicted values, helping in evaluating the performance of the model.

Both functions provide customization options, such as the ability to specify the names of the features and the markers used for the actual and predicted values. This allows for a tailored visualization that meets the specific needs of the analysis.

Normalizing The Data

Before implementing the linear regression model, it is important to normalize the data. Normalization is a crucial step in data preprocessing, as it helps to ensure that the features are on a similar scale, which can improve the performance of the model.

The code defines a function called normalize that performs z-score normalization on the input data. Z-score normalization is a common method of normalization, where the data is transformed by subtracting the mean and dividing by the standard deviation of each feature.

The formula for z-score normalization is:
x(normalized​) = (xμ)/σ
where μ is the mean of the feature and σ is the standard deviation of the feature.​

import numpy as np

# Formula to normalize the data is:
# (x - mean) / standard deviation
# This is z-score normalization
# defining a function to normalize the data
def normalize(x):
return (x - np.mean(x, axis=0)) / np.std(x, axis=0)

x_train_norm = normalize(x_train)
x_cv_norm = normalize(x_cv)
x_test_norm = normalize(x_test)

print(x_train_norm.shape, x_cv_norm.shape, x_test_norm.shape)

The code then applies the normalize function to the training, cross-validation, and test sets, creating the normalized versions of the input features: x_train_norm, x_cv_norm, and x_test_norm.

By normalizing the data, the features are transformed to have a mean of 0 and a standard deviation of 1. This ensures that the features are on a similar scale, which can improve the stability and convergence of the linear regression model.

Defining The Linear Regression Model

The code defines a class called Linear_regression to represent the linear regression model. The model is initialized with two parameters: weights and bias. The weights are initialized as a 1D array of zeros, and the bias is initialized as a single zero value. The weights_2d is created by expanding the weights array into a 2D array with one column, which is necessary for matrix multiplication.

import numpy as np

class Linear_regression():
def __init__(self):
self.weights = np.zeros(x_train_norm.shape[1])
self.weights_2d = np.expand_dims(self.weights, axis=1)
self.bias = np.zeros(1)

def predict(self,x):
y_preds_2d = np.matmul(self.weights_2d.T,x.T)+self.bias
y_preds = np.squeeze(y_preds_2d)

return y_preds

# Getting the initial model parameters
model = Linear_regression()
print("The initial weights are: ", model.weights_2d)
print("The initial bias is: ", model.bias)
print("The shape of the initial weights is: ", model.weights_2d.shape)
print("The shape of input is: ", x_train_norm.shape)

The model defined in this code is a simple linear regression model that predicts the output based on the input features. The model’s weights and bias are initialized to zero, which means that the model does not have any prior knowledge or bias. The model’s predictions are calculated by multiplying the input features with the model’s weights and adding the bias.

The predict method is used to make predictions using the model. It takes in the input features x and returns the predicted output. The prediction is calculated by multiplying the input features with the model's weights and adding the bias. The result is a 2D array where each row corresponds to a sample in the input data.

The code prints the initial values of the model’s weights and bias. The shape of the initial weights is a 2D array with one column, which is necessary for matrix multiplication. The shape of the input data is a 2D array where each row corresponds to a sample in the input data.

Training The Model

The training process involves updating the model’s parameters using the derivatives calculated in the derivatives function. The model is trained using the gradient descent algorithm, which iteratively updates the parameters to minimize the cost function. The training process consists of multiple epochs, where each epoch involves calculating the derivatives and updating the parameters. The learning rate determines how much the parameters are updated in each iteration. The model's parameters are updated as follows:

  1. Weight Update: For each weight, the update is calculated as the derivative multiplied by the learning rate.
  2. Bias Update: The bias is updated as the derivative multiplied by the learning rate.

The model’s parameters are updated in each epoch until the desired number of epochs is reached. The cost function is printed at regular intervals to monitor the training process.

import numpy as np

# Defining the cost function
def my_cost_fn(y_train, y_preds, rp):
# Compute squared error
squared_error = np.square(np.subtract(y_preds, y_train))

weights_squared_sum = np.sum(np.square(model.weights_2d.squeeze()))

regularization = (rp * weights_squared_sum)/(2*len(y_train))

# Compute average squared error
avg_squared_error = np.mean(squared_error) / 2

# Add regularization to the cost
cost = avg_squared_error + regularization

return cost

reg_params = [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1] # These are the different values of regularization parameter values

for rp in reg_params:
print("Costs with regularization parameter = ", rp)
print("The cost for the training set is: ", my_cost_fn(y_train.squeeze(), y_preds_train, rp))
print("The cost for the cross validation set is: ", my_cost_fn(y_cv, y_preds_cv, rp))
print("The cost for the test set is: ", my_cost_fn(y_test, y_preds_test, rp))
print("\n")

# defining the derivatives.
def derivatives(y_preds, y_train, x_train,rp):
d={"d_w":[], "d_b":[]}
dJ_dw_list=[]
for j in range(len(model.weights_2d.squeeze())):
regularization = (rp*np.sum(model.weights_2d.squeeze()))/(len(y_train))
for i in range(x_train.shape[0]):
dJ_dw_list.append((y_preds[i] - y_train[i])*x_train[i][j])
d["d_w"].append(np.mean(np.array(dJ_dw_list))+regularization)
dJ_dw_list.clear()

dJ_db_list=[]
for i in range(x_train.shape[0]):
dJ_db_list.append((y_preds[i] - y_train[i]))
d["d_b"].append(np.mean(np.array(dJ_db_list)))
return d

#training the model using Gradient descent Algorithm
epochs = # prefered number of epochs
learning_rate = # your prefered learning rate

for epoch in range(epochs):
d=derivatives(y_preds_train,y_train,x_train_norm,rp)
for j in range(len(model.weights_2d)):
model.weights_2d[j] = model.weights_2d[j] - (learning_rate*d["d_w"][j])
model.bias = model.bias - (learning_rate*d["d_b"][0])

y_preds_train = model.predict(x_train_norm)

cost_fn = my_cost_fn(y_train.squeeze(), y_preds_train,rp)


if epoch%30==0:
print("cost function:",cost_fn)

Defining The Cost Function

Line 1: def my_cost_fn(y_train, y_preds, rp):

This line defines a function named my_cost_fn that takes three parameters: y_train, y_preds, and rp. The function calculates the cost of the model given the actual output (y_train), the predicted output (y_preds), and the regularization parameter (rp).

Line 2: squared_error = np.square(np.subtract(y_preds, y_train))

This line calculates the squared error between the actual output (y_train) and the predicted output (y_preds). The np.subtract function subtracts the actual output from the predicted output, and the np.square function squares the result. The squared error is then stored in the squared_error variable.

Line 3: weights_squared_sum = np.sum(np.square(model.weights_2d.squeeze()))

This line calculates the sum of the squares of the model’s weights. The model.weights_2d variable contains the model's weights, and the squeeze method is used to remove any extra dimensions. The np.square function squares the weights, and the np.sum function calculates the sum of the squared weights. The result is stored in the weights_squared_sum variable.

Line 4: regularization = (rp * weights_squared_sum)/(2*len(y_train))

This line calculates the regularization term. The regularization term is a penalty term that is added to the cost function to prevent overfitting. The term is calculated as the product of the regularization parameter (rp) and the sum of the squares of the model's weights, divided by twice the number of training samples (2*len(y_train)).

Line 5: avg_squared_error = np.mean(squared_error) / 2

This line calculates the average squared error. The np.mean function calculates the mean of the squared error, and the result is divided by 2 to get the average squared error.

Line 6: cost = avg_squared_error + regularization

This line calculates the total cost by adding the average squared error and the regularization term.

Line 7: return cost

This line returns the calculated cost.

Calculating The Derivatives

d={"d_w":[], "d_b":[]}

  • This line initializes a dictionary d with two keys: d_w and d_b. These keys will be used to store the derivatives of the weights and bias, respectively.

dJ_dw_list=[]

  • This line initializes an empty list dJ_dw_list to store the derivatives of the weights.

for j in range(len(model.weights_2d.squeeze())):

  • This loop iterates over the number of weights in the model.

regularization = (rp*np.sum(model.weights_2d.squeeze()))/(len(y_train))

  • This line calculates the regularization term. The regularization term is a penalty term that is added to the cost function to prevent overfitting. The term is calculated as the product of the regularization parameter rp and the sum of the squared weights, divided by the number of training samples.

for i in range(x_train.shape):

  • This nested loop iterates over the training samples.

dJ_dw_list.append((y_preds[i] - y_train[i])*x_train[i][j])

  • This line calculates the derivative of the cost function with respect to the current weight by multiplying the difference between the predicted and actual output with the corresponding input feature.

d["d_w"].append(np.mean(np.array(dJ_dw_list))+regularization)

  • This line calculates the mean of the derivatives in the dJ_dw_list list, adds the regularization term, and appends the result to the d_w list in the d dictionary.

dJ_dw_list.clear()

  • This line clears the dJ_dw_list to prepare for the next weight.

dJ_db_list=[]

  • This line initializes an empty list dJ_db_list to store the derivatives of the bias.

for i in range(x_train.shape):

  • This loop iterates over the training samples.

dJ_db_list.append((y_preds[i] - y_train[i]))

  • This line calculates the derivative of the cost function with respect to the bias by subtracting the actual output from the predicted output.

d["d_b"].append(np.mean(np.array(dJ_db_list)))

  • This line calculates the mean of the derivatives in the dJ_db_list list and appends it to the d_b list in the d dictionary.

return d

  • This line returns the d dictionary, which contains the derivatives of the weights and bias.

Finally, the code trains a linear regression model using the gradient descent algorithm. It iterates over a specified number of epochs, updating the model’s weights and bias at each iteration. The derivatives of the cost function with respect to the weights and bias are calculated using the derivatives function, and these derivatives are used to update the model's parameters. The model's predictions are calculated using the predict method, and the cost function is calculated using the my_cost_fn function. The cost function is printed at regular intervals to monitor the training process. The model's parameters are updated using the gradient descent algorithm, which adjusts the parameters in the direction of the negative gradient of the cost function. This process is repeated until the desired number of epochs is reached, resulting in a trained model that can be used to make predictions on new data.

Follow on:

Twitter(X)
LinkedIn
GitHub

Thank you for reading !!!

--

--