ML From Scratch Part 02 โ
Linear & Polynomial Regression
In-Depth
Linear Regression is a simple ML Algorithm, but also itโs a stepping stone into the ocean of Machine Learning & Deep Learning. Itโs because of the Gradient Descent Algorithm (which is part of linear regression) which is a crucial optimization algorithm in Deep Learning.
This is Part 2, of my ML From Scratch Series,
What is a Model?
It is a Mathematical Equation meant to capture patterns in data and generalize those findings to make future predictions; programmatically itโs an input-output system.
Above is a LEGO ad, and I find it interesting, The models are also meant for this (capture general patterns in data), but nowadays the above Mona-Lisa has more resolution ๐ (gpt4 has ~1.7 trillion parameters).
Prerequisites
Before going any further you have to make sure you know the Slope-Intercept form for a line, this form is also called General Linear Model in the data science field, you will find the importance of this later in this article.
To fully grasp everything written in this article, you will have to have some knowledge of Linear Algebra & Differential Calculus, but I will try my best to make it very clear and understandable.
Linear Regression
N = 50
x = np.linspace(0, 1, N)
y = x + np.random.randn(N)/4
plt.scatter(x, y, alpha=.75, edgecolors='k')
plt.show()
Consider the above dataโs x-values as features and y-values as some outputs of the given features, as you can see we have a clear trend in the data, now how do we capture it?
I will be explaining two methods, which are
- Least Squares
- Gradient Descent
The goal of both methods is to find a slope & intercept of a line that passes through the data which has the least possible distance with each of the data points (with minimum error) for an
n
dimensional data.
Ordinary Least Squares (OLS)
Below we are finding an equation to calculate the slopes and intercepts for a line that has the least error in capturing the trend in the data.
So now we know how to calculate the slope and intercepts,
# Note: @ in np.ndarray type means dot-product
X = np.vstack((np.ones(len(x)), x)).T
betas = np.linalg.inv(X.T@X) @ X.T @ y
yHat = X@betas
plt.plot(x, yHat, 'r', alpha=.3, linewidth=2)
plt.scatter(x, y, alpha=.75, edgecolors='k')
plt.show()
Note: This method still works even if we have more than 3-dimensional data.
So now we can predict values and stuff, but we got the simple line because the data was very basic, what if the data were like the following?
N = 100
x = np.linspace(-1, 1, N)
y = 2*x**4 + x**3 + 2*x**2 + np.random.randn(N)*.4
X = np.vstack((np.ones(len(x)), x)).T
betas = np.linalg.inv(X.T@X) @ X.T @ y
yHat = X@betas
plt.plot(x, yHat, 'r', alpha=.3, linewidth=2)
plt.scatter(x, y, alpha=.75, edgecolors='k')
plt.show()
Now we struggle to find the best-fit line, for this data, we have to make use of Polynomial Regression.
Polynomial Regression
If you understand the linear regression, then it is just a piece of cake.
Previously we couldnโt capture the trend in the dataset was that a line with a simple equation (y = mx + b
) can only be a straight line, to form a curved line we have to have a polynomial equation.
Note: in the equation for a polynomial if you consider the input values as constants then it is a linear equation, let me show you;
So if we preprocess the X
values(inputs) by powering with a given order/degree to corresponding values, we can achieve the Polynomial Regression.
# from sklearn.preprocessing import PolynomialFeatures
import numpy as np
import matplotlib.pyplot as plt
def polynomial_features(X, order=2):
if order < 2:
return X
output = np.array(X**1)
# here we starts with 2 is becuase the first column is already calculated above
for d in range(2, order + 1):
output = np.hstack((output, X**d))
return output
N = 100
x = np.linspace(-1, 1, N)
y = 2*x**4 + x**3 + 2*x**2 + np.random.randn(N)*.4
x = x.reshape(-1, 1)
X_poly = polynomial_features(x, order=4)
'''or use sklearn as shown below'''
# poly_feat = PolynomialFeatures(degree=4, include_bias=False)
# X_poly = poly_feat.fit_transform(x)
# add the np.ones to incorporate the intercept term
X_poly = np.hstack((X_poly, np.ones((len(X_poly), 1))))
# calculate the beta values
betas = np.linalg.inv(X_poly.T@X_poly) @ X_poly.T @ y
yHat = X_poly@betas
plt.plot(x, yHat, 'r', alpha=.3, linewidth=2)
plt.scatter(x, y, alpha=.75, edgecolors='k')
plt.show()
But wait, in the above example we generated the data by ourself so we were able to decide that degree/order of 4 is the best fit, but when working with real data how do we decide which degree/order to choose?
Choosing the degree has huge affects in the calculation, see the blow plot to see the effect of choosing the degree/order;
Some degrees tend to underfit some tend to overfit, We can use Bayes Information Criteria (BIC) to solve this problem.
Bayes Information Criteria
The solution is to calculate BIC
value for some given degrees and from that take the degree that corresponds to the least BIC
which will be the perfect value for that specific dataset.
N = 100
x = np.linspace(-1, 1, N)
y = 2*x**4 + x**3 + 2*x**2 + np.random.randn(N)*.4
x = x.reshape(-1, 1)
def get_bic(x, y, order):
poly_feat = PolynomialFeatures(degree=order, include_bias=True)
X_poly = poly_feat.fit_transform(x)
betas = np.linalg.inv(X_poly.T@X_poly) @ X_poly.T @ y
yHat = X_poly@betas
sse = sum((y-yHat) ** 2)
n = len(y)
bic = n*np.log(sse) + order*np.log(n)
return bic
orders = np.arange(1, 21)
bic = [get_bic(x, y, o) for o in orders]
BEST_ORDER = orders[np.argmin(bic)]
plt.plot(orders, bic, 'ks-')
plt.xlabel('Orders')
plt.ylabel('BICs')
plt.title(f'{BEST_ORDER} is the best polynomial order for this data.')
plt.show()
See even though we generated the data using a 4th-order polynomial function because of the random values added to it, the data varied a bit thatโs why we got 6 as the best polynomial order (which is correct for this specific data).
Problems of using the Least-Square Method
So, now you know all about using Least-Squares for linear/polynomial regression models, but it isnโt the industry standard because it has some problems; which are,
- Scalability: For small datasets finding the coefficients for the model like this (using the whole dataset at a time) is not a problem, but for large datasets, it can be computationally expensive and memory-intensive.
- Flexibility: In multiple linear regression (when you have multiple predictor variables), the closed-form solution becomes more complex, and in some cases, it may not be feasible to compute.
- Regularization: Regularization helps prevent overfitting in more complex models, but it is challenging to incorporate in the closed-form solution.
- Online Learning: In scenarios where data arrives in a streaming fashion (online learning), The closed-form solution would require recomputing the entire solution every time new data is added.
- Non-linear Models: In non-linear models, there is no closed-form solution available such as in neural networks.
- etcโฆ
All of these problems can be solved by using Gradient Descent instead.
Gradient Descent
Gradient descent is an optimization algorithm used to find the optimal values of parameters in a model that minimizes a given cost function.
This is completely different from the previous method we taught, the steps are,
- Predict some values first using the General Linear Model [y=mx+b](weights are randomly initialized).
- Then using some loss function we calculate the difference between predicted values and desired values (loss).
- After that using calculus we will find partial derivatives of the weights and biases that participate in the loss function.
- Finally, we tune the weights and biases using the calculated partial derivatives.
- Repeat the whole operation for
n
number ofepochs
.
Step 01
N = 100
X = (np.linspace(-1, 1, N) + np.random.randn(N)).reshape(-1, 1)
Y = X[:, 0] + np.random.randn(N)/2
n_samples, n_features = X.shape
weights = np.random.randn(n_features)
bias = 0
y_pred = X @ weights + bias # @ symbol in numpy.ndarray means 'dot product'
Step 02
For the loss function, here we are using Mean-Squared Error
We donโt need to calculate the MSE loss We just need the partial derivative of the weights and biases of the MSE loss, so we can skip to the next step without calculating MSE in Python.
Step 03
It is highly suggested that before going ahead one should know what derivative and partial derivative mean, let me make a short description of it,
The derivative of a function indicates how much the output of the function can change with an infinitesimal change in the input, but what if multiple inputs are going to the function then we gotta know how each of the inputs contributes to the output of the function, right? thatโs why we calculate the partial derivative with respect to each of the inputs in the function.
Below is an elaborated calculation of partial derivative with respect to both m
& b
in the MSE loss.
Since 2/n and summation terms are independent of m
& b
, and do not affect the derivative calculation with respect to these variables; so we kept those terms outside of the partial derivative calculation and put them back at the end of it (so donโt get confused).
# calculate the partial derivatives
dw = (2/n_samples) * X.T @ (y_pred - y)
db = (2/n_samples) * -np.sum(y_pred - y)
Now we need to subtract each of the weights and biases with the corresponding partial derivative multiplied by a constant learning_rate
term;
Whole Algorithm
class LinearRegression:
def __init__(self, learning_rate=0.001, n_epoch=1000):
self.lr = learning_rate
self.n_epoch = n_epoch
self.weights = None # slope
self.bias = None # y_intercept
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.random.randn(n_features)
self.bias = 0
for _ in range(self.n_epoch):
y_predicted = np.dot(X, self.weights) + self.bias
dw = (2/n_samples) * X.T @ (y_predicted - y)
db = (2/n_samples) * np.sum(y_predicted - y)
self.weights -= self.lr * dw
self.bias -= self.lr * db
def predict(self, X):
y_predicted = X @ self.weights + self.bias
return y_predicted
N = 100
X = (np.linspace(-5, 5, N) + np.random.randn(N)*.6).reshape(-1, 1)
y = X[:, 0] + np.random.randn(N)
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
plt.plot(X[:, 0], y_pred, label='predicted')
plt.plot(X[:, 0], y, 'o', label='data points')
plt.legend()
plt.show()
We can also do the polynomial regression as we explained before, if you perform the preprocessing of the input variables and then pass that into the Algorithm then it will work like a charm,๐ you can do that by yourself, Iโm leaving that up to you (do it as an exercise, donโt hate me please ๐ ).
Conclusion
So we are finally done huh? it may seem like a lot to take at first but if you read it one more Iโm sure you can understand all of this, I elaborately explained every tiny topic in it, thatโs because understanding this algorithm is very crucial in your data-science carrier;
I coded the entire thing from scratch, that is because I found most people donโt know how exactly these work, they only know how to call some prewritten functions in scikit-learn
or other libraries, but data science is not that, You have to get your hands dirty to create something marvelous;
Like I always say, when working with real-world projects you donโt wanna get worried about all the nerdy stuff, then you can use libraries such as scikit-learn
for that;
Below is the code to work with scikit-learn Python
API,
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import datasets
# create data
X, y = datasets.make_regression(
n_samples=100, n_features=2, noise=20
)
# split for test & train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression() # init model
model.fit(X_train, y_train) # train model
y_pred = model.predict(X_test) # predict
If this article helps you in any way, i really appreciate if you could support me through my BuyMeACoffee page, it really motivates me to produce amazing articles for you guys ๐ฅฐ.
Have a good day, ๐ ๐