Machine Learning Deep Dive #2: Linear Regression

Published in

turkcell

6 min readDec 4, 2022

Welcome to the second article of the “Machine Learning Deep Dive” biweekly series! This week we will talk about linear regression and compare different approaches: Ordinary Least Square and Gradient Descent. As in the first article, there will be a separate GitHub repository containing sample codes and examples prepared from the scratch and/or using the open-source Python libraries. Wish you pleasant reading!

Regression 101: What is Regression?

Anyone interested in machine learning has heard of these two words: regression and classification. Since these two methods are used for different types of analyses, it is important to know their differences. Classification is the method used to classify data into specific groups, to determine which class the data belongs to: i.e. filtering e-mails as spam or not. Regression, on the other hand, is the method used to estimate the discrete values of the data: i.e. weather forecast. The two main categories of regression models are linear models and non-linear types of regressions. Regression is a widely used model in many scientific and engineering disciplines.

Linear Regression: Simple but Effective

Linear regression is a supervised learning technique that makes an attempt to predict the relationship between two variables, the dependent and independent variables, by fitting a linear equation to the observed data. So what do we mean by that? Finding a linear relationship that exemplifies the given data actually means finding a linear line that best represents the dataset. Considering the simplest scenario, we can construct a basic linear line using the following equation:

If we find the intercept and coefficient values using the dataset (X) and labels (y), we can determine the most appropriate linear line. After calculating these two unknowns, we have the basic information necessary to make new predictions. All we need to do is to find the value that the new data intercepts with the linear regression line. Sounds logical, doesn’t it? The most critical thing here is the selection of the intercept and coefficient parameters to best represent the dataset. But how do we make this choice? There are two different ways to do this: one is using the entire dataset in a single step to make this estimation while the other iteratively completes this process. Let’s dive into these methods and learn more about them!

#1: Ordinary Least Square: One-Step Solution

Let’s revisit our purpose: we want to create a line such that the difference between the obtained estimates and the actual values is minimal (or even zero if possible), in other words, we want to find the best b values. OLS allows us to find the best b parameters by minimizing the squared errors of the distance of each sample’s estimation from the true value — we are using the squared error since the negative and positive distances should not impact the optimization differently. If we refresh our dusty knowledge of linear algebra to calculate the B values, we get the following equation (if you want to follow the calculations step by step, check out Ethem Alpaydın’s book):

And, that’s it! I was wondering if anything caught your attention. The OLS method allows us to calculate all the parameters we are looking for in a single step, we do not need to use any iterations. Additionally, by solving the equation, the entire process becomes quite straightforward and simple to understand. Besides the advantages, OLS also has some disadvantages. Multicollinearity undermines this definition since a change in one variable affects the other variables in a way that also affects the dependent variable. As a result, the model is less effective and the coefficient estimates are less precise. In addition, multiplying the entire matrix at once can be a problem when the dataset is large.

OLS is a method frequently used in linear regression. In fact, the linear regression function of the Scikit-Learn library is OLS based.

#2: Gradient Descent: The Star of Machine Learning

Courtesy: https://www.mltut.com/stochastic-gradient-descent-a-super-easy-complete-guide/

GD is an iterative optimization algorithm that is widely used in many different machine learning models. In contrast to OLS, gradient descent avoids attempting to solve any closed-form equations. It starts with a random solution set and achieves the best parameters by iteratively changing them. At each step, the parameters are updated and the effect of this change on the error is evaluated. So how do we know how to update to get the best parameter set? For this, we need to define a loss and an update function.

The GD algorithm generates random b values in the first step and starts the optimization. It makes an estimation using these b values and compares the results with the real values using the loss function (J). We can better understand the working principle of GD using the figure above. In the figure, the x-axis shows the variables of our model, and the y-axis shows the cost value we obtained. This graph is a frequently used graph to represent the solution space. Our aim is to find the values that give the minimum cost. For example, let’s assume that the random b values we created take us to the leftmost place on the graph. We will calculate the derivative from that starting point, and then we may use a tangent line to determine how steep the slope is. The modifications to the parameters, such as the weights and bias, will be guided by the slope. The slope will be steeper at the starting point, but when additional parameters are generated, the steepness should steadily diminish until it hits the point of convergence, which is the lowest point on the curve. The α in the update function, which is called the learning rate, controls how big steps we will take while updating. If we select a small learning rate, it might take too much time to reach the convergence point, or we might stuck into a local minimum point. Conversely, if we go with a large learning rate, we might not reach the minimum point. Therefore, careful selection of the learning rate significantly affects the performance of the algorithm.

Courtesy: https://www.ibm.com/cloud/learn/gradient-descent

If you want to develop a GD-based linear model using Scikit-Learn, you can use the SGDRegressor function.

Python Implementation: Let’s Get Our Hands Dirty!

Let’s write some code now! I have prepared a GitHub repository with the code of the Linear Regression algorithm based on Ordinary Least Square and Gradient Descent written from scratch and using the predefined functions of the Scikit-Learn library. You can perform performance analysis by comparing the results of two applications with each other. You can find the repository here.

If you want to talk about machine learning or about my article, you can contact me via my LinkedIn account. Stay tuned for the third paper of the series!