Linear Regression for Newbies

The Theory
Linear Regression is the process of fitting a line to the dataset.
Single Variable Linear Regression
The Mathematics
The equation of Line is y=m*x+c
Where,
y = dependent variable
X = independent variable
C = intercept
The algorithm is trying to fit a line to the data by adjusting the values of m and c. Its Objective is to attain to a value of m such that for any given value of x it would be properly predicting the value of y.
There are various ways in which we can attain the values of m and c
1. Statistical approach
2. Iterative approach
Here we are using a scikit learn framework which internally uses iterative approach to attain the linear regression
The Dataset
Dataset consists of two columns namely X and y
Where
For List Price Vs. Best Price for a New GMC Pickup dataset
X = List price (in $1000) for a GMC pickup truck
Y = Best price (in $1000) for a GMC pickup truck
The data is taken from Consumer’s Digest.
For Fire and Theft in Chicago
X = fires per 100 housing units
Y = thefts per 1000 population within the same Zip code in the Chicago metro area
The data is taken from U.S Commission of Civil Rights.
For Auto Insurance in Sweden dataset
X = number of claims
Y = total payment for all the claims in thousands of Swedish Kronor
The data is taken from Swedish Committee on Analysis of Risk Premium in Motor Insurance.
For Gray Kangaroos dataset
X = nasal length (mm ¥10)
Y = nasal width (mm ¥ 10)
for a male gray kangaroo from a random sample of such animals
The data is taken from Australian Journal of Zoology, Vol. 28, p607–613.
The Code
The Code was written in three phases
- Data preprocessing phase
- Training
- Prediction and plotting
The data preprocessing phase
Imports
Numpy import for array processing, python doesn’t have built in array support. The feature of working with native arrays can be used in python with the help of numpy library.
Pandas is a library of python used for working with tables, on importing the data, mostly data will be of table format, for ease manipulation of tables pandas library is imported
Matplotlib is a library of python used to plot graphs, for the purpose of visualizing the results we would be plotting the results with the help of matplotlib library.
#Importsimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Reading the dataset from data
In this line of code using the read_excel method of pandas library, the dataset has been imported from data folder and stored in dataset variable.
# Reading the dataset from data
dataset = pd.read_csv(r’..\\data\\prices.csv’)On viewing the dataset, it contains of two columns X and Y where X is dependent variable and Y is Independent Variable.
X is an independent variable
Y is dependent variable Inference
For x-value of 7.6 ,157 y-value
For x-value of 7.1 ,174 y-value
And goes on
Creating Dependent and Independent variables
The X Column from the dataset is extracted into an X variable of type numpy, similarly the y variable
X is an independent variable
Y is dependent variable Inference
# Creating Dependent and Independent variables
X = dataset['X'].values
y = dataset['Y'].valuesOn input 10 it would result in a pandas Series object
So, values attribute is used to attain an numpy array
Visualizing the data
The step is to just see how the dataset is
# Visualizing the data
title='Linear Regression on <Dataset>'
x_axis_label = 'X-value < The corresponding attribute of X in dataset >'
y_axis_label = 'y-value < The corresponding attribute of X in dataset >'
plt.scatter(X,y)
plt.title(title)
plt.xlabel(x_axis_label)
plt.ylabel(y_axis_label)
plt.show()On visualization the data would appear something like this
Each point on the plot is a data point showing the respective list price on x-axis and Best Price on y-axis
The X and Y attributes would vary based on dataset.
Splitting the data into training set and test set
We are splitting the whole dataset into training and test set where training set is used for fitting the line to data and test set is used to check how good the line if for the data.
# Splitting the data into training set and test set
from sklearn.model_selection import train_test_split
X_test,X_train,y_test,y_train = train_test_split(X,y, test_size = 0.8)Reshaping the numpy arrays since the scikit learn model expects 2-D array in further code
In further the scikit learn model would be expecting a 2-D array of shape (length,1).
# Reshaping the numpy arrays since the scikit learn model expects 2-D array in further code
X_train = np.reshape(X_train,newshape = (-1,1))
y_train = np.reshape(y_train,newshape = (-1,1))
X_test = np.reshape(X_test,newshape = (-1,1))
y_test = np.reshape(y_test,newshape = (-1,1))The code was just to convert a single dimensional array into a 2-D array where each element is an array
The Training phase
Importing the linear model from sklearn framework
From scikit learn Library LinearRegression is imported. Lr is an object of LinearRegression.
The process of training is done in the fit method, our dependent and independent variable are fed into to the fit method in which it would try to fit a line to the data provided.
# Importing the linear model from sklearn framework
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X = X_train, y = y_train)The Prediction phase
Predicting the Results
By the trained linear regression model we are trying to predict the values of test data. Y_pred variable contains all the predicted y-values of the test x-values.
# Predicting the Results
y_pred = lr.predict(X_test)Visualizing the Results
As we have predicted the y-values for a set of x-values we are visualizing the results to check how good did our line fit for our predictions.
The plot shows the red points are the data points are actual values where the blue line is the predictions.
# Visualizing the Results
plt.scatter(X_test,y_test,c='red')
plt.plot(X_test,y_pred)
plt.title(title)
plt.xlabel(x_axis_label)
plt.ylabel(y_axis_label)
plt.show()Similarly,
Graph results for all the datasets
There are certain outliers in Fire Theft graphs. Since, there is random parameter exits the results might not be the same for every execution of code.
Link to Jupyter Notebook
Link to Code