Demystifying Linear Regression: A Beginner’s Guide

7 min readApr 8, 2024

Let’s dive into the linear regression. Regression is a type of supervised learning task where the algorithm’s goal is to predict a continuous numerical output or target variable. In regression, the output is a real-valued number, and the algorithm’s objective is to learn a mapping from input features to this continuous output

Examples of regression tasks include home price prediction, black hole mass prediction, stock price prediction, age estimation, etc.

Linear Regression:

Linear regression is like a best-fit line on a graph. Imagine you have a bunch of points on a graph, and you want to draw a straight line that goes through them as closely as possible. That’s essentially what linear regression does, but it’s not just about drawing a line — it’s about using that line to make predictions.

Linear Regression is a fundamental statistical and machine-learning technique for solving regression problems used for modeling the relationship between a dependent variable (or target) and one or more independent variables (or features).

Univariable Linear Regression:

Let’s start with univariable linear regression. “Uni” means one, so univariable means we’re working with just one variable. Picture this: you have a list of houses and their prices. You want to know if there’s a relationship between the size of a house (let’s say in square feet) and its price. Univariable linear regression helps us figure out if there’s a straight-line relationship between these two variables. This means simply in univariable linear regression there is only 1 input feature and a target variable.

The equation for univariable linear regression is:

Y= wx+b

Y= target variable

W=slope of straight line

X=input features

b= y-intercept

Multivariable Linear Regression:

Now, let’s kick it up a notch with multivariable linear regression. “Multi” means more than one, so here we’re working with multiple variables. In addition to house size, let’s say we also want to consider the number of bedrooms and bathrooms in predicting the price of a house.

Instead of just one x-variable (like house size), we now have multiple x-variables (house size, number of bedrooms, number of bathrooms). We still want to find a best-fit line, but this time it’s a best-fit plane or hyperplane in higher dimensions. This plane represents the relationship between all the input variables (features) and the output variable (price). With this model, we can predict house prices based on multiple factors.

The equation for multivariable linear regression is:

Y=w1x1+w2x2+w3x3+….wnxn +b

where x1, x2, x3,….,xn are the input features and w1, w2, …, wn are learnable parameters. We actually have to adjust these learnable parameters and biases in a way that the line fits well between the features.

Now how to adjust the learnable parameters to find the best fit line?

In linear regression, the goal is to find the best-fit line (or plane in multivariable regression) that minimizes the difference between the predicted values and the actual target values in the training data. This process involves adjusting the learnable parameters and biases in the model.

Loss function:

For that purpose, there is a cost function that is calculated and our main goal is to reduce this cost function or loss function. The loss function quantifies how well the model is performing. In linear regression, a common loss function is the Mean Squared Error (MSE), which calculates the average squared difference between the predicted values and the actual target values.

As mentioned above our main task is to reduce the loss function ( MSE ) we calculate the MSE after each iteration of running model then adjust the parameters for that we use Gradient descent a powerful optimizing algorithm.

Gradient Descent:

One common method for adjusting the learnable parameters is gradient descent. Gradient descent is an optimization algorithm used to minimize the loss function, which measures the difference between the predicted values and the actual target values.

The formulas for adjusting the parameters are below, here the J( w,b ) is nothing but the loss function.

ω=ω − α * ∂/∂ω J(ω,b)

b=b − α * ∂/∂b J(ω,b)

here α is The learning rate is a hyperparameter that controls the size of the steps taken during gradient descent. It determines how quickly or slowly the model learns. A higher learning rate can lead to faster convergence but may risk overshooting the optimal solution, while a lower learning rate may take longer to converge but is less likely to overshoot. Usually, its value is taken as 0.001 or any other you can take.

where,

∂ /∂ω J(ω,b) =1/m ∑ i=1->m (y^(i)−y(i)) x(i)

∂ /∂b J(ω,b) =1/m ∑ i=1->m (y^(i)−y(i))

Where:

y^(i) represents the predicted output for the ith example.
y(i) represents the actual output for the ith example.
x(i) represents the input features for the ith example.
m is the total number of examples in the dataset.

While training the model for say 10 times or for 10 iterations. As the model is training on the training data set. So first all predictions are made then the cost function is calculated and initially its value might be high then via gradient descent we adjust parameters then gradually with the increase in the number of iterations, the cost or loss function reduces.

So that’s how the linear regression model is trained.

Note: There are 2 common issues in machine learning and statistical modeling that are underfitting and overfitting.

Underfitting:

Underfitting occurs when a model is too simple to capture the underlying patterns in the data, both in the training set and unseen data. It essentially means the model is not complex enough to represent the relationships between the input features and the target variable. This means it is not predicting the output correctly and the accuracy of the model is very low. Solutions to overcome this problem are:

1) Increase model complexity.

2) Add more features.

3) Feature engineering (It is basically creating the new input feature using the existing input features For example for home price prediction you are given the width and length of a home you can simply create another feature named area of home by multiplying the width and length of home you already know).

4) Remove noise (clean data to remove outliers).

5) Increase training data.

6) Reduce regularization.

Iterating the optimization algorithms like gradient descent for more times.

Overfitting:

Overfitting occurs when a model is excessively complex and fits the training data noise rather than the underlying patterns. It means the model is too flexible and essentially memorizes the training data rather than generalizing it. Simply we can say that in this case the cost function (The cost function quantifies the error between predicted and expected values and presents that error in the form of a single real number.) is very low close to 0 and the accuracy of the model is around 100%. Solutions to overcome this problem are:

1) Collect more training data

2) Feature scaling ( Feature scaling is a preprocessing technique in machine learning that helps bring all the features (variables) of your dataset onto a similar scale.)

3) Feature engineering

4) Do regularization.

Regularization is a powerful technique to deal with the overfitting problem.

Regularization:

Regularization is a technique used in machine learning and statistics to prevent overfitting, which occurs when a model fits the training data too closely, capturing noise and making it less effective at making predictions on new, unseen data. Regularization adds a penalty term to the model’s loss function, discouraging it from fitting the training data too precisely and encouraging it to find a simpler, more generalizable solution. There are 2 of its common techniques.

A. L1 regularization (lasso):

L1 regularization adds the absolute values of the model’s parameters to the loss function. Lasso is useful for feature selection and simplifying models.

A. L2 regularization (ridge):

L2 regularization adds the squared values of the model’s parameters to the loss function. Ridge helps reduce model complexity and is effective when all features are potentially relevant.

Implementation of Linear Regression in Python:

Let’s built Profit_Prediction_for_Restaurant_Franchise-using-univariable-Linear-Regression.

The code available at https://github.com/mrnust/Profit_Prediction_for_Restaurant_Franchise-using-univariable-Linear-Regression/blob/main/Profit_Prediction_for_Restaurant_Franchise.ipynb

Challenges and limitations:

Limited to Linear Relationships:

Linear regression assumes that the relationship between the independent and dependent variables is linear. If the true relationship is non-linear, linear regression may not accurately capture it.

Sensitive to Outliers:

Linear regression can be sensitive to outliers, which are data points that deviate significantly from the rest of the data. Outliers can heavily influence the estimated coefficients and decrease the model’s predictive accuracy.

Assumption of Independence:

Linear regression assumes that the observations are independent of each other. If there is autocorrelation or dependence among the observations, the model’s coefficients may be biased and its standard errors may be underestimated.

Hope it helps!
Feel free to reach out for suggestions and queries you have https://www.linkedin.com/in/md-rayyan/