Linear Regression is simple!

#1 of Machine Learning Algorithms Series

Bala Kowsalya
Nov 8 · 8 min read

I know you are curious about Machine Learning and scratching your head where to start. You are on the right track. In this series of blog posts, we are going to dive more into ML algorithms.

Let’s get started!
If you are a newbie, you might have heard of Linear Regression, which is the first step to get your hands wet in ML algorithms. Linear Regression is used to predict the target values.

By the end of this tutorial, you will understand,

  • What is a Linear Regression?
  • Terminologies
  • Linear Regression with one variable — Univariate regression
  • Linear Regression with multiple variables — Multivariate regression
  • Implementation of the algorithm using ‘sklearn’ in Python

What is a Linear Regression?

As the name suggests, the Linear Regression model figures out the linear relationship between the feature variable(s) and a target variable so that it can be used to predict the target value for an unknown input value.

To make it simple and understand better, let us take an example if you have a dataset containing columns such as Age and Height (here Age as input feature and Height as a target value) height will increase as the age increases. There is a linear relationship between the values.

Now we need to find the best fitting regression line that can pass through Age and Height data.

But, how to decide which line is the best fitting one?

Answer: 💡 The line which has a minimum difference between the actual value and the predicted value is considered as the best fitting regression line for a given dataset.

Regression Line in Age — Height dataset

We found the best fitting line for the data, at this instance, now we can predict the unknown height of a person at a particular age using this regression line.
For example, we can predict that the height of a particular person at age 5 years 6 months will be approximately equal to 1.1 meters using the regression line (shown in the image below).

Predicting the unknown target value

This is how linear regression can be used to predict the unknown target values for a given set of feature values. This example has explained it in the best possible way. Like this, you can predict any future/unknown target value using historical data. I hope now you are clear with what linear regression is.

Still, the implementation of this algorithm and how we got that regression line is mystical, right?

To demystify how this algorithm works, we need to discover several terminologies and understand how exactly we found our best-fitting line. That is where the actual implementation of this algorithm lies.


Understanding Linear Regression Better

Linear Regression attempts to predict the dependent variable(y) based on the values of the independent variable(x). It figures out the linear relationship between the feature(x) and the target variable(y).

The linear relationship between the variables x and y can be expressed in an equation like this,

y = mx + b (linear equation)

where b is the intercept and m is the slope of the linear line. For a given dataset containing values x and y, Linear Regression finds optimal values for the slope and the intercept and returns the linear line with least error and more accurate.

Process of fitting a linear regression model

The above graphical representation clearly depicts the iterations in which the linear regression algorithm tries to predict the best fit line for that dataset. The model is getting trained during those iterations to predict the best fit (y) target values for unknown values.


Technically speaking,

Hypothesis

Hypothesis h(x) is a function that takes x values as input and predicts values of y. Since the values which we predict are not accurate but the closest assumption, this equation is called hypothesis.

Overview of prediction

Cost function

We measure the accuracy of our hypothesis function by using a cost function. The cost function is the measure of the difference between the actual target value and the predicted target value. It is measured using the Root Mean Squared Error.

Cost Function

The mean is halved (1/2) as the derivative term of the square function will cancel out the half in gradient descent computation.
We need to minimize the cost function to have the best accurate predicted values.

Gradient descent

Gradient descent is an optimization algorithm used to minimize hypothesis function by iteratively moving in the direction of steepest descent. Gradient descent is used to update the parameters of the hypothesis.

Iterative gradient descent algorithm

As you can see in the first graph, the process of computing the parameters of the hypothesis is done iteratively until it reaches the local minimum and no longer the cost function can be reduced. On carefully seeing both the graphs parallelly you can understand that we get the best-fitting regression line at the steepest descent(local minimum)

Learning Rate

The learning rate is a hyper-parameter that controls the way we adjust the parameters of the hypothesis to arrive at the best-fitting regression line. The lower the value makes us take baby steps towards the local minimum, choosing the greater learning rate value might result in overshoot and will never converge at the steepest descent.


Practical Implementation

‘Learning by doing is the best way to learn’

Univariate Linear Regression

Univariate Linear Regression is an algorithm to find the linear relationship between a single feature variable(x) and a target variable(y). Here, the dependent variable relies only on a single feature variable.

Let’s implement the Univariate Linear Regression algorithm in a simple dataset that has only 14 entries.

Importing the data

Download the dataset from this link. The dataset has two columns,
1) Experience in years and 2) Salary in rupees
From our intuition, we can guess that the salary will change with respect to the experience of an employee. Now let’s create a machine learning model to predict the salary of a person with some ‘x’ years of experience.


  • Import the packages needed
# import the packages needed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
  • Import the data from a .csv file to a variable using Pandas’ .read_csv()
data = pd.read_csv(‘linear-regression-dataset.csv’)
  • Create feature(X) and target(y) variable from the dataset
col = data.shape[1]
X = data.iloc[:,:-1]
y = data.iloc[:,col-1]
  • Plot the data
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X,y,label='Training Data')
ax.legend(loc=2)
ax.set_xlabel('Experience(Years)')
ax.set_ylabel('Salary(Rs)')
ax.set_title('Experience Vs Salary')
Plotting the data: Experience vs Salary
  • Split the data into train and test set using train_test_split function in sklearn.
    We usually divide the dataset into test and train set, this practice helps us in training the model using a part of the dataset and finding the accuracy of the model by predicting the target values in the test set and evaluate the model accuracy. This keeps the model away from overfitting.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)
  • Import linear_model from sklearn package and train the model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)

  • Now the model can be used to predict the target values of test data
y_pred = model.predict(X_test)
y_pred

array([[ 5100.92944233],
[18486.098341 ],
[10678.08315011]])

  • You can see the difference in values between actual target data and predicted target data.
print(y_test)
  • Find the accuracy of our model using score()
model.score(X_test,y_test)

0.9455565414771729

  • Plot and visualize the data and the regression line
y_pred = model.predict(X_test).flatten() 
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(X_test, y_pred ,'r', label='Prediction')
ax.scatter(X_test, y_test, label='Training Data')
ax.legend(loc=2)
ax.set_xlabel('Experience(Years)')
ax.set_ylabel('Salary(Rs)')
ax.set_title('Experience Vs Salary Prediction')
Regression Line and the test set data

Yes, you did it! you have passed half a way through linear regression.
The target variable will not depend only on a single feature variable. There are cases in which target variable depends on multiple feature variables. There comes the Multiple Linear Regression method.


Multiple Linear Regression

Multiple linear regression correlates multiple independent variables to a dependent variable. Everything else is as similar as discussed in Univariate Regression.
Let’s take an example data and try to implement & predict the values.

Importing the data

Download the dataset from this link. The dataset contains data about 50 startups with column headers, 1) R&D spend, 2) Administration, 3) Marketing spend, 4) State and 5) Profit

#Importing packages needed
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Importing dataset
data = pd.read_csv(‘50_Startups.csv’)
data.head()
Sample Data from 50_Startups dataset

The last column Profit alone is assigned to variable y (target variable)

#Defining feature and target variables
X = data.iloc[:, :-1].values
y = data.iloc[:, 4].values

Encoding categorical data

The machine learning models are based on mathematical equations and can operate only on numerics. So, we need to transform the categorical data into numerical values.
There are several methods available to transform categorical data, the commonly used is one — hot encoding technique.

What is one — hot encoding?

It splits the column which has ‘n’ categories into ‘n’ separate columns. It populates the column with 0 or 1 corresponding to which column it has been placed.
I took the below example screenshot from geeksforgeeks.com to make you understand it.

Sample dataset
Transformation of columns after one-hot encoding
# Preprocessing the data to encode categorical datafrom sklearn.preprocessing import LabelEncoder, OneHotEncoderlabelEncoderX = LabelEncoder()
X[:,3] = labelEncoderX.fit_transform(X[:,3])
oneHotEncoder = OneHotEncoder(categorical_features = [3])
X = oneHotEncoder.fit_transform(X).toarray()

Splitting data into two halves

from sklearn.model_selection import train_test_split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

Fitting the model and evaluation

from sklearn.linear_model import LinearRegression# Fitting the model
model = LinearRegression()
model.fit(X_train,y_train)
# Predicting test values
y_pred = model.predict(X_test)
y_pred
# Accuracy measure
model.score(X_test,y_test)

0.9347068473282303

y_pred

array([103015.20159796, 132582.27760816, 132447.73845174, 71976.09851258,
178537.48221055, 116161.24230166, 67851.69209676, 98791.73374686,
113969.43533013, 167921.06569551])

y_test

array([103282.38, 144259.4 , 146121.95, 77798.83, 191050.39, 105008.31,
81229.06, 97483.56, 110352.25, 166187.94])

You can see that the actual and predicted values are closely matching with the accuracy score of 93%

That’s it. This is how regression methods are used to predict the unknown/future target values for the feature value(s).

I hope that this article provides you an understanding of how to practically implement the linear regression algorithm in your datasets.
Transform data into insights! 💡✔
Applaud yourself!

Other tutorials,

My publications,

Happy Learning! 👏

Data Science Everywhere

Find the best articles related to Data Science

Bala Kowsalya

Written by

Machine Learning Enthusiast. Passionate Computer Science Engineer. Technical Writer. Blogger. A Lifelong Learner. ❤ I wish to watch you learn things better!

Data Science Everywhere

Find the best articles related to Data Science

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade