ML From Scratch Part 02 —
Linear & Polynomial Regression
In-Depth

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

10 min readFeb 28, 2024

Linear Regression is a simple ML Algorithm, but also it’s a stepping stone into the ocean of Machine Learning & Deep Learning. It’s because of the Gradient Descent Algorithm (which is part of linear regression) which is a crucial optimization algorithm in Deep Learning.

This is Part 2, of my ML From Scratch Series,

ML From Scratch Part 01 — K-Nearest Neighbor (KNN)

Implementing KNN using Python.

medium.com

What is a Model?

It is a Mathematical Equation meant to capture patterns in data and generalize those findings to make future predictions; programmatically it’s an input-output system.

Above is a LEGO ad, and I find it interesting, The models are also meant for this (capture general patterns in data), but nowadays the above Mona-Lisa has more resolution 😂 (gpt4 has ~1.7 trillion parameters).

Prerequisites

Before going any further you have to make sure you know the Slope-Intercept form for a line, this form is also called General Linear Model in the data science field, you will find the importance of this later in this article.

To fully grasp everything written in this article, you will have to have some knowledge of Linear Algebra & Differential Calculus, but I will try my best to make it very clear and understandable.

Linear Regression

N = 50
x = np.linspace(0, 1, N)
y = x + np.random.randn(N)/4

plt.scatter(x, y, alpha=.75, edgecolors='k')
plt.show()

Consider the above data’s x-values as features and y-values as some outputs of the given features, as you can see we have a clear trend in the data, now how do we capture it?

I will be explaining two methods, which are

Least Squares
Gradient Descent

The goal of both methods is to find a slope & intercept of a line that passes through the data which has the least possible distance with each of the data points (with minimum error) for an n dimensional data.

Ordinary Least Squares (OLS)

Below we are finding an equation to calculate the slopes and intercepts for a line that has the least error in capturing the trend in the data.

So now we know how to calculate the slope and intercepts,

# Note: @ in np.ndarray type means dot-product
X = np.vstack((np.ones(len(x)), x)).T
betas = np.linalg.inv(X.T@X) @ X.T @ y
yHat = X@betas

plt.plot(x, yHat, 'r', alpha=.3, linewidth=2)
plt.scatter(x, y, alpha=.75, edgecolors='k')
plt.show()

Note: This method still works even if we have more than 3-dimensional data.

So now we can predict values and stuff, but we got the simple line because the data was very basic, what if the data were like the following?

N = 100
x = np.linspace(-1, 1, N)
y = 2*x**4 + x**3 + 2*x**2 + np.random.randn(N)*.4
X = np.vstack((np.ones(len(x)), x)).T

betas = np.linalg.inv(X.T@X) @ X.T @ y
yHat = X@betas
plt.plot(x, yHat, 'r', alpha=.3, linewidth=2)
plt.scatter(x, y, alpha=.75, edgecolors='k')
plt.show()

Now we struggle to find the best-fit line, for this data, we have to make use of Polynomial Regression.

Polynomial Regression

If you understand the linear regression, then it is just a piece of cake.

Previously we couldn’t capture the trend in the dataset was that a line with a simple equation (y = mx + b) can only be a straight line, to form a curved line we have to have a polynomial equation.

Note: in the equation for a polynomial if you consider the input values as constants then it is a linear equation, let me show you;

So if we preprocess the X values(inputs) by powering with a given order/degree to corresponding values, we can achieve the Polynomial Regression.

# from sklearn.preprocessing import PolynomialFeatures
import numpy as np
import matplotlib.pyplot as plt

def polynomial_features(X, order=2):
    if order < 2:
        return X

    output = np.array(X**1)

    # here we starts with 2 is becuase the first column is already calculated above
    for d in range(2, order + 1):
        output = np.hstack((output, X**d))

    return output

N = 100
x = np.linspace(-1, 1, N)
y = 2*x**4 + x**3 + 2*x**2 + np.random.randn(N)*.4
x = x.reshape(-1, 1)

X_poly = polynomial_features(x, order=4)
'''or use sklearn as shown below'''
# poly_feat = PolynomialFeatures(degree=4, include_bias=False)
# X_poly = poly_feat.fit_transform(x)

# add the np.ones to incorporate the intercept term
X_poly = np.hstack((X_poly, np.ones((len(X_poly), 1))))

# calculate the beta values
betas = np.linalg.inv(X_poly.T@X_poly) @ X_poly.T @ y
yHat = X_poly@betas

plt.plot(x, yHat, 'r', alpha=.3, linewidth=2)
plt.scatter(x, y, alpha=.75, edgecolors='k')
plt.show()

But wait, in the above example we generated the data by ourself so we were able to decide that degree/order of 4 is the best fit, but when working with real data how do we decide which degree/order to choose?

Choosing the degree has huge affects in the calculation, see the blow plot to see the effect of choosing the degree/order;

Some degrees tend to underfit some tend to overfit, We can use Bayes Information Criteria (BIC) to solve this problem.

Bayes Information Criteria

The solution is to calculate BIC value for some given degrees and from that take the degree that corresponds to the least BIC which will be the perfect value for that specific dataset.

N = 100
x = np.linspace(-1, 1, N)
y = 2*x**4 + x**3 + 2*x**2 + np.random.randn(N)*.4
x = x.reshape(-1, 1)

def get_bic(x, y, order):
    poly_feat = PolynomialFeatures(degree=order, include_bias=True)
    X_poly = poly_feat.fit_transform(x)
    
    betas = np.linalg.inv(X_poly.T@X_poly) @ X_poly.T @ y
    yHat = X_poly@betas
    
    sse = sum((y-yHat) ** 2)
    n = len(y)
    bic = n*np.log(sse) + order*np.log(n)
    return bic

orders = np.arange(1, 21)
bic = [get_bic(x, y, o) for o in orders]
BEST_ORDER = orders[np.argmin(bic)]

plt.plot(orders, bic, 'ks-')
plt.xlabel('Orders')
plt.ylabel('BICs')
plt.title(f'{BEST_ORDER} is the best polynomial order for this data.')
plt.show()

See even though we generated the data using a 4th-order polynomial function because of the random values added to it, the data varied a bit that’s why we got 6 as the best polynomial order (which is correct for this specific data).

Problems of using the Least-Square Method

So, now you know all about using Least-Squares for linear/polynomial regression models, but it isn’t the industry standard because it has some problems; which are,

Scalability: For small datasets finding the coefficients for the model like this (using the whole dataset at a time) is not a problem, but for large datasets, it can be computationally expensive and memory-intensive.
Flexibility: In multiple linear regression (when you have multiple predictor variables), the closed-form solution becomes more complex, and in some cases, it may not be feasible to compute.
Regularization: Regularization helps prevent overfitting in more complex models, but it is challenging to incorporate in the closed-form solution.
Online Learning: In scenarios where data arrives in a streaming fashion (online learning), The closed-form solution would require recomputing the entire solution every time new data is added.
Non-linear Models: In non-linear models, there is no closed-form solution available such as in neural networks.
etc…

All of these problems can be solved by using Gradient Descent instead.

Gradient Descent

Gradient descent is an optimization algorithm used to find the optimal values of parameters in a model that minimizes a given cost function.

This is completely different from the previous method we taught, the steps are,

Predict some values first using the General Linear Model [y=mx+b](weights are randomly initialized).
Then using some loss function we calculate the difference between predicted values and desired values (loss).
After that using calculus we will find partial derivatives of the weights and biases that participate in the loss function.
Finally, we tune the weights and biases using the calculated partial derivatives.
Repeat the whole operation for n number of epochs.

Step 01

N = 100
X = (np.linspace(-1, 1, N) + np.random.randn(N)).reshape(-1, 1)
Y = X[:, 0] + np.random.randn(N)/2

n_samples, n_features = X.shape
weights = np.random.randn(n_features)
bias = 0

y_pred = X @ weights + bias # @ symbol in numpy.ndarray means 'dot product'

Step 02

For the loss function, here we are using Mean-Squared Error

We don’t need to calculate the MSE loss We just need the partial derivative of the weights and biases of the MSE loss, so we can skip to the next step without calculating MSE in Python.

Step 03

It is highly suggested that before going ahead one should know what derivative and partial derivative mean, let me make a short description of it,

The derivative of a function indicates how much the output of the function can change with an infinitesimal change in the input, but what if multiple inputs are going to the function then we gotta know how each of the inputs contributes to the output of the function, right? that’s why we calculate the partial derivative with respect to each of the inputs in the function.

Below is an elaborated calculation of partial derivative with respect to both m & b in the MSE loss.

Since 2/n and summation terms are independent of m & b, and do not affect the derivative calculation with respect to these variables; so we kept those terms outside of the partial derivative calculation and put them back at the end of it (so don’t get confused).

# calculate the partial derivatives
dw = (2/n_samples) * X.T @ (y_pred - y)
db = (2/n_samples) * -np.sum(y_pred - y)

Now we need to subtract each of the weights and biases with the corresponding partial derivative multiplied by a constant learning_rate term;

Whole Algorithm

class LinearRegression:
    def __init__(self, learning_rate=0.001, n_epoch=1000):
        self.lr = learning_rate
        self.n_epoch = n_epoch
        self.weights = None  # slope
        self.bias = None  # y_intercept

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.random.randn(n_features)
        self.bias = 0

        for _ in range(self.n_epoch):
            y_predicted = np.dot(X, self.weights) + self.bias

            dw = (2/n_samples) * X.T @ (y_predicted - y)
            db = (2/n_samples) * np.sum(y_predicted - y)

            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict(self, X):
        y_predicted = X @ self.weights + self.bias
        return y_predicted

N = 100
X = (np.linspace(-5, 5, N) + np.random.randn(N)*.6).reshape(-1, 1)
y = X[:, 0] + np.random.randn(N)

model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

plt.plot(X[:, 0], y_pred, label='predicted')
plt.plot(X[:, 0], y, 'o', label='data points')
plt.legend()
plt.show()

We can also do the polynomial regression as we explained before, if you perform the preprocessing of the input variables and then pass that into the Algorithm then it will work like a charm,💖 you can do that by yourself, I’m leaving that up to you (do it as an exercise, don’t hate me please 😅).

GitHub — rohit-krish/MLFromScratch-Medium-Article-Series

Contribute to rohit-krish/MLFromScratch-Medium-Article-Series development by creating an account on GitHub.

github.com

Conclusion

So we are finally done huh? it may seem like a lot to take at first but if you read it one more I’m sure you can understand all of this, I elaborately explained every tiny topic in it, that’s because understanding this algorithm is very crucial in your data-science carrier;

I coded the entire thing from scratch, that is because I found most people don’t know how exactly these work, they only know how to call some prewritten functions in scikit-learn or other libraries, but data science is not that, You have to get your hands dirty to create something marvelous;

Like I always say, when working with real-world projects you don’t wanna get worried about all the nerdy stuff, then you can use libraries such as scikit-learn for that;

Below is the code to work with scikit-learn Python API,

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import datasets

# create data
X, y = datasets.make_regression(
    n_samples=100, n_features=2, noise=20
)

# split for test & train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression() # init model
model.fit(X_train, y_train) # train model
y_pred = model.predict(X_test) # predict

If this article helps you in any way, i really appreciate if you could support me through my BuyMeACoffee page, it really motivates me to produce amazing articles for you guys 🥰.

Have a good day, 😍 👋

ML From Scratch Part 02 — Linear & Polynomial RegressionIn-Depth