Linear Regression Model

Nermeen Abd El-Hafeez
8 min readAug 28, 2023

What is linear regression?

Linear regression is a supervised machine learning algorithm used for modeling the relationship between a dependent variable (also called the target or outcome) and one or more independent variables (features).The algorithm assumes a linear relationship between the independent variables and the dependent variable, which means that the relationship can be represented by a straight-line equation.

Independent and Dependent variable

Dependent Variable: The dependent variable, also referred to as the “target variable,” is the key factor you aim to predict based on the values of other variables. It is represented as “Y” in the linear regression equation. This variable’s value depends on the variations of the independent variables.

Independent Variables: Independent variables, often termed “features,” are the factors that contribute to the behavior of the dependent variable. They are represented as “X” in the linear regression equation. These variables are manipulated or observed to understand their impact on the dependent variable’s outcome.

Depandent VS. Independent Variables

In this dataset, several columns are identified as independent variables, namely Gender, Age, Occupation, Sleep Duration, Physical Activity Level, BMI Category, Blood Pressure, Heart Rate, and Daily Steps. These independent variables collectively influence or impact the dependent variable, which is the occurrence of sleep disorders. The relationship between these independent variables and the likelihood of sleep disorders varies based on the values within each of these columns. The objective is to analyze how changes in the independent variables correlate with changes in the sleep disorder occurrence.

What is Simple linear regression?

Simple Linear Regression is a statistical method used to model the relationship between a single independent variable and a dependent variable. It seeks to find the best-fitting linear equation that describes how changes in the independent variable impact changes in the dependent variable.

The equation for simple linear regression can be expressed as:

y = b0 + b1⋅x

Where:

  • y is the dependent variable (the variable you’re trying to predict).
  • x is the independent variable (the variable that you’re using to make predictions).
  • b0 is the intercept or constant term, representing the value of y when x is 0.
  • b1 is the coefficient of the independent variable, representing the change in y for a unit change in x.

Example:

In a simple linear regression scenario, you have one independent variable and one dependent variable. For instance, let’s consider predicting students’ exam scores based on the number of hours they studied.

  • Dependent Variable: Exam Score
  • Independent Variable: Study Hours

The linear regression equation is: Exam Score = b0 + b1 * Study Hours

What is Multiple linear regression?

Multiple Linear Regression is an extension of simple linear regression that involves modeling the relationship between a dependent variable and two or more independent variables. It allows you to understand how multiple independent variables jointly impact the dependent variable and how they interact with each other.

The equation for multiple linear regression can be expressed as:

y = b0 + b1.x1+ b2.x2 +…+ bn.xn

Where:

  • y is the dependent variable (the variable you’re trying to predict).
  • x1,x2,…,xn are the independent variables (features or factors that influence the dependent variable).
  • b0 is the intercept term, which represents the value of y when all x variables are zero.
  • b1,b2,…,bn are the coefficients associated with the respective independent variables. They represent the change in y for a one-unit change in each x variable, while holding the other variables constant.

Example:

In multiple linear regression, you have more than one independent variable influencing the dependent variable. Let’s consider predicting house prices based on various features: square footage, number of bedrooms, and neighborhood.

  • Dependent Variable: House Price
  • Independent Variables: Square Footage, Number of Bedrooms, Neighborhood

The linear regression equation is: House Price = b0 + b1 * Square Footage + b2 * Number of Bedrooms + b3 * Neighborhood

Some methods for evaluating linear regression model

Mean Squared Error(MSE):

Mean Squared Error (MSE) serves as a fundamental metric to gauge the accuracy of statistical models. It quantifies the average squared disparity between the predicted values and the actual observations. In an ideal scenario with zero model error, the MSE converges to zero. Conversely, as the model’s error magnifies, the MSE value escalates proportionally. This metric is also referred to as Mean Squared Deviation (MSD), reflecting its focus on evaluating the dispersion of deviations between predictions and actual data points.

The Mean Squared Error (MSE) is calculated using the following formula:

Where:

  • n is the number of data points in the dataset.
  • yᵢ is the actual target value of the i-th data point.
  • ŷᵢ is the predicted value of the i-th data point.

Root Mean Squared Error(RMSE):

The Root Mean Square Error (RMSE) quantifies the average disparity between the predicted values of a statistical model and the actual observed values. In mathematical terms, RMSE corresponds to the standard deviation of the residuals. Residuals depict the spatial gap between the regression line and individual data points.

The Root Mean Square Error (RMSE) is calculated using the following formula:

  • n is the total number of data points.
  • yi​ represents the actual observed value of the target variable for the ith data point.
  • ŷ​i​ represents the predicted value of the target variable for the ith data point.

Mean Absolute error(MAE):

Mean Absolute Error (MAE) is a metric used to measure the average absolute difference between the predicted values and the actual values in a regression model. It provides an indication of how close the predictions are to the true values on average.

The Mean Absolute Error (MAE) is calculated using the following formula:

  • n is the total number of data points.
  • yi​ represents the actual observed value of the target variable for the ith data point.
  • ŷ​i​​ represents the predicted value of the target variable for the ith data point.

R-squared :

R-squared, also known as the Coefficient of Determination, is a statistical measure used to assess the goodness of fit of a regression model. It indicates the proportion of the variance in the dependent variable that is explained by the independent variables in the model. In other words, R-squared quantifies the proportion of the variability in the response variable that the model captures.

R-squared values range from 0 to 1, where:

  • An R-squared value of 0 signifies that the model fails to elucidate any of the variations in the dependent variable.
  • An R-squared value of 1 signifies that the model adeptly elucidates all variations in the dependent variable.

The R-squared is calculated using the following formula:

  • SSR is the Sum of Squared Regression: It represents the sum of the squared differences between the predicted values (fitted values) and the mean of the dependent variable.
  • SSE is the Sum of Squared Errors. It represents the sum of the squared differences between the actual observed values of the dependent variable and the predicted values (fitted values) from the regression model.
  • Sum of Squared Total (SST) is a statistical term used in regression analysis to quantify the total variability of the dependent variable. It represents the total sum of squared differences between the actual observed values of the dependent variable and the mean of the dependent variable.

SST = SSR + SSE

  • n is the total number of data points.
  • yirepresents the actual observed value of the target variable for the ith data point.
  • ŷ​i​​​ represents the predicted value of the target variable for the ith data point.
  • ȳrepresents the mean of the observed values of the target variable.

What is sklearn library?

scikit-learn (also known as sklearn) is an open-source machine learning library for Python. It provides a wide range of tools and functionalities for various machine learning tasks such as classification, regression, clustering, dimensionality reduction, model selection, and more. scikit-learn is built on top of other popular libraries such as NumPy, SciPy, and Matplotlib.

Linear Regression Using sklearn

1 - Import the necessary libraries:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

2 - Prepare your data:

Assuming you have your data stored in ‘X’ (independent variables) and ‘y’ (dependent variable):

X = np.array([[x1, x2, ...], [x1, x2, ...], ...])  # Independent variables
y = np.array([y1, y2, ...]) # Dependent variable

3 - Split the data into training and testing sets:

Splitting the data into training and testing sets is a crucial step in the process of training and evaluating machine learning models. It involves dividing the available dataset into two distinct subsets: the training set and the testing set.

  • Training Set:

The training set is the portion of the dataset that is used to train or “teach” the machine learning model. It contains input data (independent variables) along with their corresponding target values (dependent variables). The model learns patterns, relationships, and features from this data to make accurate predictions or classifications.

  • Testing Set:

The testing set is a separate portion of the dataset that the model has never seen during training. It is used to evaluate the model’s performance and assess its ability to generalize to new, unseen data. The testing set contains input data without the corresponding target values.

The model’s predictions on the testing set are compared to the actual target values (which were withheld during training) to measure its accuracy.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4 - Create a linear regression model:

model = LinearRegression()

5 - Train the model:

Training a machine learning model is the process of “learning” from the data and adjusting the model’s parameters to minimize the difference between its predictions and the actual target values.

model.fit(X_train, y_train)

6 - Make predictions:

y_pred = model.predict(X_test)

7 - Evaluation:

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False) # Pass squared=False to get RMSE
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation results
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print("R-squared (Coefficient of Determination):", r2)

In conclusion, linear regression holds a fundamental and influential position in the field of data science. As you delve into the realm of data-driven insights, linear regression not only serves as an initial step but also as a potent instrument to amplify your analytical prowess. Its capacity to unveil concealed relationships and anticipate outcomes empowers you to extract valuable insights from data, enabling informed decisions within a data-driven landscape.

--

--

Nermeen Abd El-Hafeez

Passionate data science enthusiast, 2 years of self-study. Proficient in Python, skilled in data analysis, machine learning, and deep learning techniques.