Regression Revolution: Get Your Data Groove On with Linear Beats!

7 min readMay 15, 2024

About Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a straight line to the observed data points. It helps analyze and quantify the association between variables, enabling predictions and understanding of trends in data.

Think of linear regression like drawing the best straight line through a bunch of dots on a graph. It helps us see the relationship between two things, like how test scores change as study hours increase. It’s a way to predict one thing based on another, using a simple straight line.

3 Main reasons why we use linear Regression-

Simple Interpretation: It provides a straightforward interpretation of the relationship between variables. With a linear equation, it’s easy to understand how a change in one variable affects another.
Prediction: Linear regression allows us to predict the value of the dependent variable based on the values of the independent variables. This predictive capability is valuable in various fields for making informed decisions.
Understanding Relationships: It helps us understand the relationship between variables by quantifying how they are associated. This understanding is crucial for making hypotheses, testing theories, and drawing conclusions.

When we should Linear Regression

Linear Relationships: When the relationship between the dependent variable and the independent variable(s) appears to be linear. If you can visually observe a linear trend in your data, linear regression may be appropriate.
Interpretability: When you need a simple and interpretable model that provides insights into the relationship between variables. Linear regression coefficients represent the magnitude and direction of the relationship, making it easy to understand.
Predictive Modeling: When you want to make predictions based on continuous variables. Linear regression provides a straightforward framework for predicting values within the range of the independent variables, making it useful in forecasting and regression analysis.

Statistical Analysis for Linear Regression

To calculate best-fit line linear regression uses a traditional slope-intercept form which is given below,

Yi = β0 + β1Xi

where Yi = Dependent variable, β0 = constant/Intercept, β1 = Slope/Intercept, Xi = Independent variable.

This algorithm explains the linear relationship between the dependent(output) variable y and the independent(predictor) variable X using a straight line Y= B0 + B1 X.

But how the linear regression finds out which is the best fit line?

The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best fit line. The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.

Types of Linear regression

Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
ElasticNet Regression
Quantile Regression

What is best fit lines

The best fit line is a line that fits the given scatter plot in the best way. Mathematically, the best fit line is obtained by minimizing the Residual Sum of Squares(RSS)

Steps to implement Linear Regression

Import Libraries: Import necessary libraries such as NumPy, Pandas, and scikit-learn.
Load Data: Load your dataset into a Pandas DataFrame.
Split Data: Split the dataset into training and testing sets using train_test_split from scikit-learn.
Instantiate Model: Create an instance of the linear regression model from scikit-learn.
Fit Model: Fit the model to the training data using the fit method.
Make Predictions: Use the trained model to make predictions on the test data using the predict method.
Evaluate Model: Evaluate the model’s performance using evaluation metrics such as mean squared error or R-squared.
Visualize Results: Optionally, visualize the model’s predictions compared to the actual values using plots or charts.

Loss function Cost function

A loss function, also known as a cost function or error function, is a mathematical function that measures the difference between the actual values of the dependent variable and the values predicted by a model. In the context of linear regression, the loss function quantifies how well the model’s predictions match the observed data. The goal is to minimize this difference, indicating a better fit of the model to the data.

In linear regression, the most commonly used loss function is the Mean Squared Error (MSE) or its variants. The MSE is calculated by taking the average of the squared differences between the actual and predicted values for all data points in the dataset

Types of Cost Functions:

Mean Squared Error (MSE):

Calculates the average of the squared differences between the actual and predicted values.

Root Mean Squared Error (RMSE):

The square root of the MSE, providing an interpretable measure of the average deviation in the same units as the dependent variable.

Mean Absolute Error (MAE):

Calculates the average of the absolute differences between the actual and predicted values.

Evaluation Metrics

Evaluation metrics for linear regression assess the performance of the model in predicting the dependent variable based on the independent variables.

Difference between Evalutaion Metrics and Cost Function

Evaluation metrics and cost functions serve different purposes in the context of machine learning models, although they are related. Here’s the difference:

Evaluation Metrics:

Evaluation metrics are used to assess the performance of a machine learning model. They provide a quantitative measure of how well the model is performing on a particular task or dataset.
Evaluation metrics are typically chosen based on the specific objectives and requirements of the problem at hand. For example, in a regression problem, common evaluation metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared.
Evaluation metrics help practitioners understand how well a model generalizes to new, unseen data and compare the performance of different models.

Cost Functions:

Cost functions, also known as loss functions or objective functions, are used during the training process of a machine learning model to quantify how well the model is performing on the training data.
Cost functions measure the difference between the predicted values of the model and the actual values in the training data. The goal during training is to minimize this difference, i.e., minimize the cost function.
The choice of cost function depends on the type of machine learning task (e.g., regression, classification) and the specific requirements of the problem. For example, in linear regression, the cost function is often the Mean Squared Error (MSE).

Why they are same in case of Simple Linear Regression ?

In simple linear regression, the Mean Squared Error (MSE) serves as both the cost function during training and the evaluation metric. It measures the average of the squared differences between predicted and actual values. Minimizing MSE during training optimizes model parameters, while lower MSE values indicate better performance on new data. This convergence simplifies the training and evaluation process for simple linear regression models.

Application of Simple Linear Regression

Sales Forecasting: Simple linear regression can be used to predict sales based on a single predictor variable, such as advertising spending. For example, a company may use historical data on advertising expenditures and corresponding sales figures to develop a linear regression model. This model can then be used to forecast future sales based on planned advertising budgets.
GPA Prediction: In education, simple linear regression can be used to predict a student’s GPA based on a single predictor variable, such as the number of hours spent studying per week. By analyzing past student data, a university or educational institution can develop a linear regression model to understand the relationship between study hours and GPA. This model can then be used to predict the GPA of future students based on their study habits.

About Notebook and Dataset

The dataset Walmart Sales contains information about weekly sales in different stores along with various other features. Here’s a brief description of the dataset:

Store: The store number.
Date: The date of the sales data.
Weekly_Sales: The total sales for the week.
Holiday_Flag: A binary flag indicating whether the week includes a holiday (1) or not (0).
Temperature: The temperature on the date of sales.
Fuel_Price: The fuel price on the date of sales.
CPI: The Consumer Price Index on the date of sales.
Unemployment: The unemployment rate on the date of sales.

Dataset link — https://www.kaggle.com/datasets/mikhail1681/walmart-sales

Notebook link — https://www.kaggle.com/code/krupadharamshi/linear-regression-kdp

The code we performed demonstrates building a simple linear regression model using the given dataset. Here’s a summary of the steps:

Load the Data: The dataset is loaded into a pandas DataFrame.
Prepare the Data: Features (independent variables) and the target variable (Weekly_Sales) are selected from the dataset.
Split the Data: The dataset is split into training and testing sets using train_test_split().
Build the Model: A simple linear regression model is created using LinearRegression() from scikit-learn.
Train the Model: The model is trained on the training data using the fit() method.
Make Predictions: Predictions are made on the testing data using the predict() method.
Evaluate the Model: Model performance is evaluated using mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and R-squared (R2) score.
Visualize the Results: Actual vs predicted values are plotted to visualize the performance of the model.