ML Series: Day 4 — Multiple Linear Regression (MLR)

Ebrahim Mousavi
11 min readJan 15, 2024

--

Figure 1. Multiple Linear Regression vs Simple Linear Regression

Multiple linear regression is a generalized form of simple linear regression that uses several independent variables to predict the dependent variable. According to Table 1, the use of engine size x_1, number of cylinders x_2, and fuel consumption amount x_3 to estimate CO2 gas emitted (y) is multiple linear regression. The multiple linear regression formula is shown in Equation 1.

Equation 1 for multiple linear regression:

In Equation 1, m refers to the number of independent variables and the process of training the model is similar to simple linear regression, with the difference that instead of determining a and b, here we have to find m+1 suitable numbers (b_0, b_1, …b_m). The use of multiple linear regression makes it possible to show the effectiveness of independent variables (x_1, x_2, x_3, … x_m) in estimating the dependent variable (y) and to obtain the effect of changes in each independent variable in changing the dependent variable.

Table 1 shows the independent variables and the dependent variable in multiple linear regression:

Figure 1. Independent variables and the dependent variable

As mentioned earlier, the data cannot be illustrated in dimensions higher than three dimensions, but the target data or labels can be drawn with respect to each of the independent variables (x_i). The data that we introduced in this chapter has seven characteristics or features, which are drawn in Figure 2, the dependent variable of CO2 gas (CO2_Emissions) compared to three features (independent variable).

Figure 2. Drawing dependent variable data for each of the features

As can be seen, the data displays are relatively scattered and it shows that the output of the dependent variable (y) is a function of several independent variables in other words, it is not affected by only one variable and several variables play a role in determining the output value.

I’m going to write a code for resolving multiple linear regression for a dataset which is called, “50 Startups Data” that is appropriate for the task of multiple linear regression. According to the Kaggle website:

The dataset that’s we see here contains data about 50 startups. It has 5 columns: “R&D Spend”, “Administration”, “Marketing Spend”, “State”, “Profit”.
The first 3 columns indicate how much each startup spends on Research and Development, how much they spend on Marketing, and how much they spend on administration cost, the state column indicates which state the startup is based in, and the last column states the profit made by the startup.

If you would like to access the codes and dataset, please open my GitHub repository, and download the “50_Startups.csv” dataset.

First of all, we should import some required libraries:

Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Then we should load csv dataset with pandas:

Load dataset

df = pd.read_csv('./Datasets/50_Startups.csv')
df.head()
Figure 3. The first five rows of data
print(df.dtypes)
Figure 4. Types of columns or features

We should convert the “object” type to the “category” type and with this command in pandas it is possible:

df['State'] = df['State'].astype('category')
print(df.dtypes)
Figure 5. Converted object to category

This is the time to clean and pre-process the data:

1. Check and handle missing values

We can see null values in pandas with the following command:

df.isnull().sum()
Figure 6. See null values

As you see in the above figure, we have two null values in the second column and one also in the third column. To see missing values you can use another amazing library, called Missingno. Missingno is a Python library used for visualizing missing data in datasets. It provides a convenient way to identify patterns and understand the extent of missing values in a dataset.

import missingno as msno
msno.matrix(df)
Figure 7. Missingno output for finding missing values

You can use the fillna() function to replace NaN values in a pandas DataFrame with a specific value. One of the common values for replacing is median and I fill it to a missing value for each column.

Check Administration feature:

m = df['Administration'].median()
print(f"Median for Administration feature is:", m)

# Median for Administration feature is: 122699.795

The number of missing values for each feature is computed with this command:

print(df['Administration'].isna().sum())
# 2

Finally, fill it with this command (fill with median):

df['Administration'].fillna(m, inplace=True)

print(df['Administration'].isna().sum())
# 0

We should fill missing value for another feature (Marketing Spend):

med_marketing = df['Marketing Spend'].median()
df['Marketing Spend'].fillna(med_marketing, inplace=True)
df['Marketing Spend'].isna().sum()
# 0

After filling the missing values with the median of each feature, now we see again the result of msno library:

msno.matrix(df)
Figure 8. Missingno output after filling missing values

2. Encoding categorical feature

Before I explain what is encoding and how I can do it, look ahead to the output of dataframe:

df.head()
Figure 9. The output of data before encoding

In machine learning, it is common to have categorical variables (features) that contain non-numerical values. However, most machine learning algorithms require numerical inputs. To address this, we perform a process called “encoding categorical features”, which involves converting these categorical variables into numerical representations that can be understood and processed by machine learning models.

There are several common methods for encoding categorical features:

  1. One-Hot Encoding: For each categorical variable, a new binary feature is created for each unique category. The binary feature takes a value of 1 if the category is present and 0 otherwise. One-hot encoding expands the feature space but preserves the information about the categories. One-Hot Encoding is commonly performed using the pd.get_dummies() function in pandas. For example, we have a column in our data called “State” with categories “New York”, “California”, and “Florida”, and using pd.get_dummies() that column will create new columns for each category and assign a value of 1 or 0 depending on whether the category is present or not. So the resulting will have columns like “State_California”, “State_Florida”, and “State_New York” with binary values indicating the presence or absence of each column in the original “State” column.
  2. Label Encoding: In this method, each category is assigned a unique integer label. The labels are assigned in a way that preserves the ordinal relationship between the categories, if any. We don’t need to this method, because the State feature doesn’t have any orders.

This is the way that we use One-Hot encoding with Pandas:

df_encoded = pd.get_dummies(df, columns=['State'], dtype=np.float64)
df_encoded.head()
Figure 10. The output of data after encoding the ‘State’ feature

3. Change the order of columns

Our label is Profit and it’s common to have the label in the last column with the following code we can place the columns in any order we like:

df_encoded = df_encoded[['R&D Spend', 'Administration', 'Marketing Spend',
'State_California', 'State_Florida', 'State_New York', 'Profit']]

df_encoded.head()
Figure 11. reordered features

4. Rename the column names so that they can be codable

In data analysis and machine learning, it is generally recommended to avoid using column names with spaces and instead convert them to a format that is more convenient and compatible with programming languages. The most common approach is to replace spaces with underscores or another suitable character.

df_encoded.rename(columns={'R&D Spend': 'R&D_Spend',
'Marketing Spend': 'Marketing_Spend',
'State_New York': 'State_New_York'}, inplace=True)
df_encoded.head()
Figure 12. Renamed columns

5. Box Plot for Outliers

Outliers are data points that significantly deviate from the majority of the data in a dataset. They can be observations that are unusually large or small compared to the other values. Outliers can occur due to various reasons such as measurement errors, data corruption, or rare events.

It is important to consider outliers in machine learning models because they can have a significant impact on the model’s performance and results. Outliers can distort statistical measures such as the mean and standard deviation, leading to biased estimates and inaccurate predictions. They can also affect the assumptions of certain algorithms, such as linear regression, which assume that the data follows a normal distribution.

One commonly used technique to identify and visualize outliers is the box plot (also known as a box-and-whisker plot). A box plot provides a visual summary of the distribution of a dataset by displaying the quartiles, median, and potential outliers. The box represents the interquartile range (IQR), which contains the middle 50% of the data. The whiskers extend from the box to the minimum and maximum data points within a certain range (often 1.5 times the IQR). Data points outside this range are considered potential outliers and are plotted as individual points.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 4, figsize=(20, 8))
axes_flat = axes.flatten()

for i, col in enumerate(df_encoded.columns):
ax = axes_flat[i]
ax.boxplot(df_encoded[col])
ax.set_title(col)
Figure 13. Box plot for finding outliers

Note: When working with one-hot encoded features, there is no need to consider outliers as you would with numerical features. One-hot encoded features consist of binary values (0 or 1), there is no numerical distribution to assess for outliers. Instead of measuring the magnitude or deviation of values, one-hot encoded features focus on the existence or absence of a certain category. Thus, the notion of outliers is irrelevant in the context of one-hot encoded features. However, I plotted the Box plot for all features.

There are several approaches to removing outliers from a dataset and one of the common and useful methods is called IQR I will explain it below:

  • Calculate the interquartile range (IQR) for the dataset by subtracting the first quartile (Q1) from the third quartile (Q3).
  • Define a lower bound (Q1–1.5 * IQR) and an upper bound (Q3 + 1.5 * IQR).
  • Remove any data points that fall below the lower bound or above the upper bound.

This is the image that shows how the IQR method works:

Figure 14. IQR method for outlier detection

Calculate the first quartile (q1), third quartile (q3), and interquartile range (iqr) for each column in df_encoded. This helps in identifying the range within which most of the data points lie.

q1 = df_encoded.quantile(0.25)
q3 = df_encoded.quantile(0.75)
iqr = q3 - q1

Calculate the lower and upper cutoff values based on the interquartile range. These cutoff values serve as thresholds for determining outliers. Data points below cutoff_low or above cutoff_high are considered potential outliers.

cutoff_low  = q1 - (1.5 * iqr)
cutoff_high = q3 + (1.5 * iqr)

Create a boolean mask that identifies which data points df_encoded fall within the acceptable range defined by the cutoff values. This mask will be True for data points that are not considered outliers and False for potential outliers.

mask = (df_encoded >= cutoff_low) & (df_encoded <= cutoff_high)

Apply the boolean mask to filter out the potential outlier rows from df_encoded. mask.all(axis=1) checks if all the values in each row of the mask are True, indicating that none of the columns in that row contain outliers. The resulting df_filtered contains only the rows without outliers.

df_filtered = df_encoded[mask.all(axis=1)]

The integrated code for removing outliers is here:

q1 = df_encoded.quantile(0.25)
q3 = df_encoded.quantile(0.75)
iqr = q3 - q1

cutoff_low = q1 - (1.5 * iqr)
cutoff_high = q3 + (1.5 * iqr)

mask = (df_encoded >= cutoff_low) & (df_encoded <= cutoff_high)

df_filtered = df_encoded[mask.all(axis=1)]
df_filtered.head()

It’s time to create a new DataFrame that contains the rows df_encoded that are identified as outliers based on the boolean mask.

# see outliers with ~ [~mask.all]
df_outliers = df_encoded[~mask.all(axis=1)]
df_outliers.head()
Figure 15. A new DataFrame that contains outliers

6. Feature selection/Reduction

Feature selection or reduction with correlation involves identifying and selecting or reducing features in a dataset based on their correlation with the target variable or with each other. By examining the strength and direction of the relationships between features, this approach helps to identify the most relevant and informative features for a given task, potentially improving model performance and reducing computational complexity.

the Pearson correlation coefficient is commonly used for measuring linear correlation, other correlation measures can capture different types of relationships. For example:

  1. Pearson correlation coefficient: It measures the strength and direction of the linear relationship between two continuous variables.
  2. Spearman correlation coefficient: It assesses the monotonic relationship between variables, which captures any increasing or decreasing trend between variables, regardless of linearity.
  3. Kendall’s tau: It measures the ordinal association between variables, which is suitable for ranked or ordinal data.
correlation = df_filtered.corr(method='pearson')
correlation
Figure 16. Correlation between features

We need to know the impacts of every feature on the label (Profit) and with these lines of code we can do it:

corr = df_filtered.corr()
corr[['Profit']].abs().sort_values(by='Profit', ascending=False)
Figure 17. Sorted Correlation based on label

We want to do Feature selection or reduction with correlation involves identifying and selecting or reducing features in a dataset based on their correlation with the target variable or with each other. In this case, I want to remove features with a low correlation with the label (Profit):

df_copy = df_filtered.copy()
df_copy.drop(['State_California', 'State_Florida', 'State_New_York'], axis=1, inplace=True)
df_copy.head()
Figure 18. Dataset after dropping some features

Train test split

Here, we are assigning the X and the y variables in which the X feature variable has independent variables and the y feature variable has a dependent variable.

X = df_copy.iloc[:, :-1].values
y = df_copy.iloc[:, -1].values

Then we should split into train, and test sets.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, shuffle=True)

x_train.shape, y_train.shape, x_test.shape, y_test.shape
# ((37, 3), (37,), (10, 3), (10,))

Model

from sklearn.linear_model import LinearRegression

# create model
model = LinearRegression()

Train

model.fit(x_train, y_train)
model.intercept_, model.coef_

Test

# Test model score for train data
model.score(x_train, y_train)
# 0.966
# Test model score for test data
model.score(x_test, y_test)
# 0.91

Predict

y_hat = model.predict(x_test[[0], :])
y_hat

# array([103132.59393452])

In Part 4, we talked about the Multiple Linear Regression and how we should preprocess data. In Part 5: Machine Learning Series: Day 5 — Nonlinear regression, we discuss Nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables.

If you like the article and would like to support me make sure to:

👏 Clap for the story (as much as you liked it) and follow me 👉
📰 View more content on my medium profile
🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter

--

--