Mastering Linear Regression: A Comprehensive Guide from Simple to Polynomial Models with Detailed Math and Code

17 min readMay 8, 2024

https://media.licdn.com/dms/image/D4D12AQEuPK-_pcEaIw/article-cover_image-shrink_720_1280/0/1709095842344?e=2147483647&v=beta&t=ijH33ZaWy2dIkCMXuHkeJXMeK-WzQG5wpUcVCjWayj4

A foundational algorithm in data science, plays a pivotal role in predicting continuous outcomes. It predicts the relationship between two variables by assuming a linear connection between the independent and dependent variables. It seeks the optimal line that minimizes the sum of squared differences between predicted and actual values. Applied in various domains like economics and finance, this method analyzes and forecasts data trends. It can extend to multiple linear regression involving several independent variables and logistic regression, suitable for binary classification problems.

Any ML models follow this work flow:
Data → Model → Cost function (J) → Optimization of J

Here we will discuss three types of Linear Regression:

Simple Linear Regression
Multiple Linear Regression
Polynomial Linear Regression

Simple Linear Regression

In a simple linear regression, there is one independent variable (x) and one dependent variable (y). The model estimates the slope and intercept of the line of best fit, which represents the relationship between the variables. The slope represents the change in the dependent variable for each unit change in the independent variable, while the intercept represents the predicted value of the dependent variable when the independent variable is zero.

MODEL
To calculate the best-fit line, linear regression uses the traditional slope-intercept form of a line: Y = W*X + b
where Y = Dependent variable, X = Independent variable, W = Slope, b = constant/Intercept.

COST FUNCTION (J)
The goal of the linear regression algorithm is to get the best values for β₀ and β₁ to find the best fit line. The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.

Random Error :- In regression, the difference between the observed value of the dependent variable(Yi) and the predicted value (Y_predicted) is called the residuals or random error.

error = Y_predicted - Yi

Where Y_prediced = W*X + b.
The best fit line is a line that fits the given scatter plot in the best way. Mathematically, the best fit line is obtained by minimizing the Residual Sum of Squares (RSS).

The Cost Function helps to work out the optimal values for W and b, which provides the best fit line for the data points.
NOTE → The cost function should be continuous, differentiable and convex in nature.
In linear regression, Mean Squared Error (MSE) is used which is the average of squared error that occurred between the Y_predicted and Yi.

GRADIENT DESCENT
Gradient Descent is one of the optimization algorithms that optimize the cost function to reach the optimal minimal solution. To find the optimum solution we need to reduce the cost function(MSE) for all data points. This is done by updating the values of W and b iteratively until we get an optimal solution.
A regression model optimizes the gradient descent algorithm to update the coefficients of the line by reducing the cost function by randomly selecting coefficient values and then iteratively updating the values to reach the minimum cost function.

As our cost function is a convex function, it should have one global minima and to reach there we need to take some steps. Now if we take small steps it will take a lot of time to reach at the minima and if we take larger steps we will reach to the bottom sooner but there is a probability that we might overshoot. So in ML these steps are called as Learning Rate and this decides how fast the algorithm converges to the minima.

Mathematically, we can reach the minima by derivating the equation and equating it with zero (slope/gradient = 0). So we have to derivate the equation with respect to our parameters (W and b) so as to update the values of W and b.

Derivation of Gradient Descent
We will take gradient of cost funtion (J) by partial derivatives with respect to W and b.

Here the partial derivates are the gradients, and they are used to update the values of W and b. Alpha is the learning rate.

Evaluation Metrics for Linear Regression

The strength of any linear regression model can be assessed using various evaluation metrics. These evaluation metrics usually provide a measure of how well the observed outputs are being generated by the model.

1. Coefficient of Determination or R-squared (R2)
R-Squared is a number that explains the amount of variation that is explained/captured by the developed model. It always ranges between 0 & 1 . Overall, the higher the value of R-squared, the better the model fits the data.

Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each data point in the plot/data. It is the measure of the difference between the expected and the actual observed output.

Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean of the response variable.

The significance of R-squared is shown by the following figures,

2. Root Mean Squared Error (RMSE) and Residual Standard Error (RSE)
The Root Mean Squared Error is the square root of the variance of the residuals. It specifies the absolute fit of the model to the data i.e. how close the observed data points are to the predicted values. Mathematically it can be represented as,

To make this estimate unbiased, one has to divide the sum of the squared residuals by the degrees of freedom rather than the total number of data points in the model. This term is then called the Residual Standard Error(RSE). Mathematically it can be represented as,

R-squared is a better measure than RMSE. Because the value of Root Mean Squared Error depends on the units of the variables (i.e. it is not a normalized measure), it can change with the change in the unit of the variables.

Assumptions in Linear Regression

1. Linearity of residuals: There needs to be a linear relationship between the dependent variable and independent variable(s).

2. Independence of residuals: The error terms should not be dependent on one another (like in time-series data wherein the next value is dependent on the previous one). There should be no correlation between the residual terms. The absence of this phenomenon is known as Autocorrelation.

3. Normal distribution of residuals: The mean of residuals should follow a normal distribution with a mean equal to zero or close to zero. This is done in order to check whether the selected line is actually the line of best fit or not. If the error terms are non-normally distributed, suggests that there are a few unusual data points that must be studied closely to make a better model.

4. The equal variance of residuals: The error terms must have constant variance. This phenomenon is known as Homoscedasticity. The presence of non-constant variance in the error terms is referred to as Heteroscedasticity. Generally, non-constant variance arises in the presence of outliers or extreme leverage values.

Code of Simple Linear Regression from Scratch

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X,y = datasets.make_regression(n_samples=100, n_features=1, noise=20, random_state=4)
X_train, X_test ,y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

fig = plt.figure(figsize=(8,6))
plt.scatter(X[:,0], y, color="b", marker="o", s=30)
plt.show()

The first line generates a synthetic dataset for a simple linear regression problem. It creates X as the input feature (a single feature in this case) and y as the target variable. n_samples specifies the number of data points, n_features is the number of features in the dataset (1 in this case), noise controls the amount of noise in the data, and random_state ensures reproducibility.
Then we splits the dataset into training and test sets using the train_test_split function from scikit-learn. It takes the input features X and target variable y, along with test_size as the proportion of the dataset to include in the test split (20%).
The we will create a scatter plot with a specified size of 8x6 inches using Matplotlib’s figure function.
X[:,0] selects the first (and only) feature from the input features X. y is the target variable. color=”b” sets the color of the markers to blue, marker=”o” specifies that circular markers should be used, and s=30 sets the size of the markers to 30 points.

class LinearRegression() : 
    def __init__(self, lr = 0.002, n_iters = 1000):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.n_iters):
            y_pred = np.dot(X, self.weights) + self.bias

            dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
            db = (1/n_samples) * np.sum(y_pred - y)

            self.weights = self.weights - self.lr * dw
            self.bias = self.bias - self.lr * db

        
    def predict(self, X):
        y_pred = np.dot(X, self.weights) + self.bias
        return y_pred

reg = LinearRegression()
reg.fit(X_train, y_train)
pred = reg.predict(X_test)
print(pred)

This class implements a simple linear regression model.
The __init__() is the constructor method for the class. It initializes the linear regression model with two parameters:

lr: Learning rate, which controls the size of the step taken during gradient descent.
n_iters: Number of iterations for which the gradient descent algorithm will run.

The fit() function fits the model to the training data. It takes two arguments: input features (X) and output target variable (y). This function initializes the weights and bias of the model to zero.

The predict() function predicts the target variable for new input features ‘X’ using the learned weights and bias.

mse = np.mean((y_test - pred)**2)
print(mse) #--> 361.2812896055198

y_pred_line = reg.predict(X)
cmap = plt.get_cmap('viridis')
fig = plt.figure(figsize=(8,6))
m1 = plt.scatter(X_train, y_train, color=cmap(0.9), s=10)
m2 = plt.scatter(X_test, y_test, color=cmap(0.5), s=10)
plt.plot(X, y_pred_line, color='black', linewidth=2, label='Prediction')
plt.show()

Multiple Linear Regression

Multiple linear regression is a technique to understand the relationship between a single dependent variable and multiple independent variables. The formulation for multiple linear regression is also similar to simple linear regression with the small change that instead of having one beta variable, you will now have betas for all the variables used. The formula is given as: Y = W1*X1 + W2*X2 + … + WmXm + b

Considerations of Multiple Linear Regression
All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear Regression along with a few new additional assumptions.
1. Overfitting: When more and more variables are added to a model, the model may become far too complex and usually ends up memorizing all the data points in the training set. This phenomenon is known as the overfitting of a model. This usually leads to high training accuracy and very low test accuracy.
2. Multicollinearity: It is the phenomenon where a model with several independent variables, may have some variables interrelated.
3. Feature Selection: With more variables present, selecting the optimal set of predictors from the pool of given features (many of which might be redundant) becomes an important task for building a relevant and better model.

Multicollinearity As multicollinearity makes it difficult to find out which variable is actually contributing towards the prediction of the response variable, it leads one to conclude incorrectly, the effects of a variable on the target variable. Though it does not affect the precision of the predictions, it is essential to properly detect and deal with the multicollinearity present in the model, as random removal of any of these correlated variables from the model causes the coefficient values to swing wildly and even change signs.

Multicollinearity can be detected using the following methods.
1. Pairwise Correlations: Checking the pairwise correlations between different pairs of independent variables can throw useful insights in detecting multicollinearity.
2. Variance Inflation Factor (VIF): Pairwise correlations may not always be useful as it is possible that just one variable might not be able to completely explain some other variable but some of the variables combined could be ready to do this. Thus, to check these sorts of relations between variables, one can use VIF. VIF basically explains the relationship of one independent variable with all the other independent variables. VIF is given by,

VIF = 1/ (1 - R^2)

The common heuristic followed for the VIF values is if VIF > 10 then the value is definitely high and it should be dropped. And if the VIF=5 then it may be valid but should be inspected first. If VIF < 5, then it is considered a good vif value.

Overfitting and Underfitting in Linear Regression There have always been situations where a model performs well on training data but not on the test data. While training models on a dataset, overfitting, and underfitting are the most common problems faced by people.

Before understanding overfitting and underfitting one must know about bias and variance.

Bias :- Bias is a measure to determine how accurate is the model likely to be on future unseen data. Complex models, assuming there is enough training data available, can do predictions accurately. Whereas the models that are too naive, are very likely to perform badly with respect to predictions. Simply, Bias is errors made by training data.
Generally, linear algorithms have a high bias which makes them fast to learn and easier to understand but in general, are less flexible. Implying lower predictive performance on complex problems that fail to meet the expected outcomes.

Variance :- Variance is the sensitivity of the model towards training data, that is it quantifies how much the model will react when input data is changed. Ideally, the model shouldn’t change too much from one training dataset to the next training data, which will mean that the algorithm is good at picking out the hidden underlying patterns between the inputs and the output variables.
Ideally, a model should have lower variance which means that the model doesn’t change drastically after changing the training data(it is generalizable). Having higher variance will make a model change drastically even on a small change in the training dataset.

Bias Variance Tradeoff

In the pursuit of optimal performance, a supervised machine learning algorithm seeks to strike a balance between low bias and low variance for increased robustness. In the realm of machine learning, there exists an inherent relationship between bias and variance, characterized by an inverse correlation.

Increased bias leads to reduced variance.
Conversely, heightened variance results in diminished bias.
Finding an equilibrium between bias and variance is crucial, and algorithms must navigate this trade-off for optimal outcomes.
In practice, calculating precise bias and variance error terms is challenging due to the unknown nature of the underlying target function.

Overfitting
When a model learns each and every pattern and noise in the data to such extent that it affects the performance of the model on the unseen future dataset, it is referred to as overfitting. The model fits the data so well that it interprets noise as patterns in the data.

When a model has low bias and higher variance it ends up memorizing the data and causing overfitting. Overfitting causes the model to become specific rather than generic. This usually leads to high training accuracy and very low test accuracy.

Detecting overfitting is useful, but it doesn’t solve the actual problem. There are several ways to prevent overfitting, which are stated below:

Cross-validation
If the training data is too small to train add more relevant and clean data.
If the training data is too large, do some feature selection and remove unnecessary features.
Regularization

Underfitting
Underfitting is not often discussed as often as overfitting is discussed. When the model fails to learn from the training dataset and is also not able to generalize the test dataset, is referred to as underfitting. This type of problem can be very easily detected by the performance metrics.

When a model has high bias and low variance it ends up not generalizing the data and causing underfitting. It is unable to find the hidden underlying patterns from the data. This usually leads to low training accuracy and very low test accuracy. The ways to prevent underfitting are stated below,

Increase the model complexity
Increase the number of features in the training data
Remove noise from the data

Code of Mulptiple Linear Regression

X, y = datasets.make_regression(n_samples=100, n_features=3, noise=20, random_state=4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

fig = plt.figure(figsize=(8,6))
plt.scatter(X[:,0], y, color="b", marker="o", s=30)
plt.show()

reg = LinearRegression()
reg.fit(X_train, y_train)

pred = reg.predict(X_test)
print(pred)

mse = np.mean((y_test - pred)**2)
print(mse) #--> 698.925115269247

def plot_multiple_linear_regression(X, y, y_pred, title='Multiple Linear Regression'):
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')

    ax.scatter(X[:, 0], X[:, 1], y, color='blue', label='Actual')
    ax.scatter(X[:, 0], X[:, 1], y_pred, color='red', label='Predicted')

    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_zlabel('Target')

    ax.set_title(title)
    ax.legend()

    plt.show()

plot_multiple_linear_regression(X_test, y_test, pred)

Polynomial Linear Regression

A simple linear regression algorithm only works when the relationship between the data is linear. But suppose we have non-linear data, then linear regression will not be able to draw a best-fit line. Simple regression analysis fails in such conditions. Consider the below diagram, which has a non-linear relationship, and you can see the linear regression results on it, which does not perform well, meaning it does not come close to reality. Hence, we introduce polynomial regression to overcome this problem, which helps identify the curvilinear relationship between independent and dependent variables.

How Does Polynomial Regression Handle Non-Linear Data?
Polynomial regression is a form of Linear regression where only due to the Non-linear relationship between dependent and independent variables, we add some polynomial terms to linear regression to convert it into Polynomial regression.
In polynomial regression, the relationship between the dependent variable and the independent variable is modeled as an nth-degree polynomial function. When the polynomial is of degree 2, it is called a quadratic model; when the degree of a polynomial is 3, it is called a cubic model, and so on.
Suppose we have a dataset where variable X represents the Independent data and Y is the dependent data. Before feeding data to a mode in the preprocessing stage, we convert the input variables into polynomial terms using some degree.

Polynomial Regression models are usually fitted with the method of least squares. The least square method minimizes the variance of the coefficients under the Gauss-Markov Theorem.

Code of Polynomial Linear Regression

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X = 6 * np.random.rand(200, 1) - 3
y = 0.8 * X**2 + 0.9*X + 2 + np.random.randn(200, 1)

plt.plot(X, y, 'b.')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

We create an array X of shape (200, 1) containing random values between -3 and 3. np.random.randn() is used to generate random numbers between 0 and 1 and then scales and shifts them to be between -3 and 3.
Then we generates the target variable y based on the equation y = 0.8x^2 + 0.9x + 2, where x is the input variable X. It first calculates (0.8 * X^2) to get the quadratic term, (0.9*X) to get the linear term, and adds 2 for the constant term. It then adds some random noise using np.random.randn(200, 1) to make the data more realistic. The noise is sampled from a normal (Gaussian) distribution with mean 0 and standard deviation 1.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

lr = LinearRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
print(r2_score(y_test, y_pred)) #--> 0.2678387104012654

plt.plot(x_train, lr.predict(x_train), color="r")
plt.plot(X, y, "b.")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

Applying Polynomial Regression with One Variable

poly = PolynomialFeatures(degree=2, include_bias=True)
x_train_trans = poly.fit_transform(x_train)
x_test_trans = poly.transform(x_test)

lr = LinearRegression()
lr.fit(x_train_trans, y_train)
y_pred = lr.predict(x_test_trans)
print(r2_score(y_test, y_pred)) #--> 0.7692773830829487

print(lr.coef_)
print(lr.intercept_)

X_new = np.linspace(-3, 3, 200).reshape(200, 1)
X_new_poly = poly.transform(X_new)
y_new = lr.predict(X_new_poly)
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.plot(x_train, y_train, "b.",label='Training points')
plt.plot(x_test, y_test, "g.",label='Testing points')
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()

Changing the Degree of Polynomial

Now we will design a function that will help us to find the right value for a degree. here we apply all the preprocessing steps we have done above in a function and map the end prediction plot on it. All we need to do to pass is the degree and it will build a model and plot a graph of a particular degree. here we will create a pipeline of preprocessing steps that makes the process streamlined.

Data → Polynomial Transform → Linear Regression → Predictions

def polynomial_regression(degree):
    X_new=np.linspace(-3, 3, 100).reshape(100, 1)
    X_new_poly = poly.transform(X_new)
    polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
    std_scaler = StandardScaler()
    lin_reg = LinearRegression()
    polynomial_regression = Pipeline([
            ("poly_features", polybig_features),
            ("std_scaler", std_scaler),
            ("lin_reg", lin_reg),
        ])
    polynomial_regression.fit(X, y)
    y_newbig = polynomial_regression.predict(X_new)
    #plotting prediction line
    plt.plot(X_new, y_newbig,'r', label="Degree " + str(degree), linewidth=2)
    plt.plot(x_train, y_train, "b.", linewidth=3)
    plt.plot(x_test, y_test, "g.", linewidth=3)
    plt.legend(loc="upper left")
    plt.xlabel("X")
    plt.ylabel("y")
    plt.axis([-3, 3, 0, 10])
    plt.show()

when we run the function while passing high degrees like 10, 15, and 20, then the model tries to overfit the data means slowly the prediction line will leave its original essence and try to rely on training data points, and as there is some change in the training path, the line tries to catch the point.

polynomial_regression(10)

polynomial_regression(15)

polynomial_regression(25)

This is a problem with a High degree of polynomial, that we can see practically, so it’s necessary to choose an optimum value of a degree.

Polynomial Regression With Multiple columns

We have seen polynomial regression with one variable. most of the time, there will be many columns in input data, so how to apply polynomial regression and visualize the result in 3-dimensional space. It sometimes feels like a hectic task for most beginners, so let’s crack that out and understand how to perform polynomial regression in 3-d space.

# 3D polynomial regression
x = 7 * np.random.rand(100, 1) - 2.8
y = 7 * np.random.rand(100, 1) - 2.8
z = x**2 + y**2 + 0.2*x + 0.2*y + 0.1*x*y +2 + np.random.randn(100, 1)

import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x=x.ravel(), y=y.ravel(), z=z.ravel())
fig.show()

X = np.hstack((x, y, x**2, y**2, x*y))
reg = LinearRegression().fit(X, z)

# Generate grid points for surface plot
x_input, y_input = np.meshgrid(np.linspace(-2.8, 4.2, 100), np.linspace(-2.8, 4.2, 100))
X_input = np.hstack((x_input.reshape(-1, 1), y_input.reshape(-1, 1), x_input.reshape(-1, 1)**2, y_input.reshape(-1, 1)**2, (x_input.reshape(-1, 1)*y_input.reshape(-1, 1))))
z_final = reg.predict(X_input)

import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter3d(x=x.ravel(), y=y.ravel(), z=z.ravel(), mode='markers', marker=dict(size=3), name='Data'))
fig.add_trace(go.Surface(x=x_input, y=y_input, z=z_final.reshape(x_input.shape), name='Fitted Surface'))
fig.update_layout(scene=dict(xaxis_title='X', yaxis_title='Y', zaxis_title='Z'), title='3D Polynomial Regression')
fig.show()