Linear Regression: Machine Learning

5 min readJan 2, 2023

This article continues from the previous: Data Preprocessing.

In general, there are 4 main types of learning:
1. Supervised Learning
2. Unsupervised Learning
3. Transfer Learning
4. Reinforcement Learning

Regression models are under supervised learning, and are used for predicting a continuous numerical value such as house prices, salary, stocks, etc.

Before diving in-depth into linear regression, we have to first understand — Simple Linear Regression.

Simple Linear Regression

This might sound familiar, as we all have seen the formula ‘y = mx + c’ at some point in our life, and well, this equation is the equation for linear regression.

This photo came from: https://www.superdatascience.com/

Mathematics again? Don’t worry, I will start by explaining the meaning of the formula, bare with me.

Using the following dataset as an example:
(Imagine you are trying to predict a person’s income depending on the years of experience that this person has worked, and the existing data is given below)

Showing only 5 of the data in the dataset

After plotting all data points, the line of best fit across them will be:

In this case, the formula y = mx + c:
> y: the value that we are trying to predict (salary, dependent variable)
> x: the years of experience (feature, independent variable)
> c: y-intercept, the salary where when x = 0
> m: slope coefficient, how much of the salary increases if we increase the years of experience (line of best fit).

After getting the line of best fit, all of the future predictions that we made, will be the values along the line of best fit.

Phew! That ain’t that hard right?

Multiple Linear Regression

After understanding Simple Linear Regression, I bet multiple linear regression isn’t a hard topic for you now.

Simply put, it is just a Simple Linear Regression with multiple features.
From the example above, what if we have an extra column ‘Age’?
> In this case, both ‘Age’ & ‘Years of experience’ will both affect the final dependent variable outcome, salary.

The following equation is the equation for Multiple Linear Regression:

By following the above equation, we can treat X1 as ‘Years of Experience’ and X2 as ‘Age’.

Similar to Simple Linear Regression, each feature (Xn) in a dataset has its own slope coefficient (bn), and we simply add up the product of each Xn & bn, and finally add the y-intercept to it.

Told you! It’s not that hard, is it?

Applying Linear Regression Model

Now comes to the fun part, let’s train a model using Linear Regression.

As the previous article mentioned, we will make use of our amazing libraries: NumPy, Pandas, Matplotlib, and Scikit-Learn.

First, we import our data:

df = pd.read_csv('data.csv')
df.head() # Showing only the first 5 rows

Then, we have to separate our data into independent variables and the variable that we want to predict respectively.

X = df.drop('Profit', axis=1).values # Dropping 'Profit' column
y = df['Profit'].values # Keeping only 'Profit' column

Secondly, we have to encode our categorical data.

As the previous article mentioned, we have to turn categorical data into numerical data, and in this case, it is the ‘State’ column.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Transformer is the tool that actually transforms our values in the column
# OneHotEncoder is the tool we use to encode our categorical values
stateTransformer = ColumnTransformer(transformers=[('encoder', 
                                      OneHotEncoder(),
                                      [3])], # Our data located in column index 3
                       remainder='passthrough') # Keep other values in other columns

X = ct.fit_transform(X)

After applying the above formula… voila!

The first 3 columns are the encoded categorical values.
> As there are a total of 3 categories in a column, 3 new columns were made.

Splitting the dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

As mentioned in the preprocessing article, we then split our data into training set & test set.

Training the model

Now comes the fun part, we start training our model!

With Scikit-Learn things can’t be easier.

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X_train, y_train)

We simply import the Linear Regression model, and fit (train) it using our training data, and it is done!

Predicting the Test Set using our trained model

y_preds = regressor.predict(X_test)

Our trained model will predict the ‘Profit’ variable with our given Test Set:

Predicted results

Amazing! Isn’t it?

To look at how well our model is doing, we can:

Nice! Out of 1, we get 0.9347!

That wasn’t hard right?

Despite the theory might be a little bit complicated, however, the application of it becomes so easy, thanks to Scikit-Learn!

Despite that applications of ML algorithms are easy, but it is worthwhile knowing what is underneath the hood, especially to give an idea of how models work.

Do not be intimidated by what is coming up, all of these have been taken care of for you by Scikit-Learn when you apply it.

Assumptions when applying Linear Regression

There is an amazing image by https://www.superdatascience.com/, as shown below.

I believe the above image clearly tells when to correctly applied Linear Regression.

Eliminating unnecessary features in Linear Regression

Imagine that you have 20 features to predict one single value, there is a high probability that some of the features are not necessary.

If we train our model with all unnecessary features, this may result in a poor training process.

Some of the common ways to build our models are:

Backward Elimination
Forward Selection
Bidirectional Elimination
All Possible Models

If we go through the above ways to build our model, it takes up a lot of time. Luckily enough, we have Scikit-Learn!

> Continue reading: Polynomial Regression

Linear Regression: Machine Learning

Written by TC. Lin