Chapter 3: Simple Linear Regression

Published in

Machine Learning for beginners

6 min readJan 4, 2021

What is Regression?

Regression is a method to determine the statistical relationship between a dependent variable and one or more independent variables. If we are trying to obtain a linear relationship between 2 variables it’s known as Linear Regression whereas more than 2 variables is known as Multiple linear regression.

Linear Regression

In this blog post, we will be looking into Linear Regression.

Here as mentioned above we will only be dealing two variables: one independent variable (X) and a dependent variable (Y). The goal here is to obtain a relationship among these two variables such that once a completely new X value is given, the machine learning algorithm predicts it’s corresponding Y value. Here’s a teeny bit of math :

Yi = b + mx

Here the dependent variable is y and the independent variable is x. This equation allows us to build a relationship between x and y. For instance, the mark that a student receives for a test (y) based on how many hours he studies for the test (x).

Consider the following data set:

The goal in linear regression is to find the best fit line for the given data set.

So how could we find this best fit line?

Least Square Method

Suppose we need to know the marks a student will get if studies for 5 hours. The line will predict 36 marks whereas the actual value is only 20. Thereby, we must try to minimize this distance as much as possible so that the prediction will be close to the actual value. To minimize this distance we use the Least Squares method.

Here we take the sum of all the squares of the distance and find the minimum out of these sums to find the best fit line.

Let’s start implementing, shall we?!

Python

The dataset and the code for this example can be accessed :

# Simple Linear Regression in Python
# DATA PREPROCESSING# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd# Importing dataset
dataset = pd.read_csv(‘marks_data.csv’)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values# Splitting dataset into Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

The above code represents how data preprocessing can be done for the above dataset. If this code feels unfamiliar refer to this blog post on data preprocessing in order to proceed.

Now that the data is well-preprocessed let’s now see how to fit the simple linear regression model to the training set.

# Fitting Simple Linear Regression to the Training setfrom sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)

Here we imported the LinearRegression class & we created an object of this class known as ‘regressor’. Then we used the fit method available in this class to fit the model to the training set. The parameters needed for the fit method is the X and Y training data in which we extracted as X_traing and y_train.

After executing these lines of code, we have successfully trained our first machine learning model. Now let’s see how this simple regression machine learns the correlations in the training set and how it can predict the test set observations.

# Predicting the Test set Results
y_pred = regressor.predict(X_test)

The predict method will take the test set and build the relevant predictions according to the regression machine we prepared earlier. Once you execute these lines of code you will be able to compare the vector y_pred and y_text to see how accurate the predictions of your machine learning model is!

The above simple regression model can be coded using R as follows :

# Simple Linear Regression in R
# Importing the dataset
dataset = read.csv(‘marks_data.csv’)# Splitting the dataset into the Training set and Test set
# install.packages(‘caTools’)
library(caTools)
set.seed(123)split = sample.split(dataset$Marks, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Simple Linear Regression to the Training setregressor = lm(formula = Marks ~ Hours,data = training_set)# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)# Visualising the Training set results
library(ggplot2)ggplot() + geom_point(aes(x = training_set$Hours, y = training_set$Salary),colour = ‘red’) + geom_line(aes(x = training_set$Hours, y = predict(regressor, newdata = training_set)),colour = ‘blue’) + ggtitle(‘Marks vs Hours (Training set)’) + xlab(‘Hours’) + ylab(‘Marks’)# Visualising the Test set resultslibrary(ggplot2)ggplot() + geom_point(aes(x = test_set$Hours, y = test_set$Marls), colour = ‘red’) +geom_line(aes(x = training_set$Hours, y = predict(regressor, newdata = training_set)),colour = ‘blue’) + ggtitle(‘Marks vs Hours (Test set)’) + xlab(‘Hours’) + ylab(‘Marks’)

So far we have learned how to import a dataset & how to split the data set into 2.

Now let’s look at how to perform simple linear regression with R.

Here we declared a new variable ‘regressor’ and we used the ‘lm’ function to fit the linear model. The parameters of this function specified here are the formula and the dataset. The rest of the parameters will take it’s default values. In the formula, we have shown that the marks are proportional to the no. of hours by using ‘~’. Next, we specify the data needed to fit the model which is the training data set.

Finally, we predict the test set using the predict function which takes the regressor and the dataset as parameters. After running that line of code, we are able to see the predictions done by our model.