A Noob’s guide to Practical Machine Learning : Implementing Linear Regression Algorithm

Rajesh Hadiya
Noob Devs
Published in
10 min readJul 11, 2019

--

Hello folks, welcome to the grand new series of practical machine learning tutorial. In this series, i’ll try to cover both theoretical as well as practical approach of machine learning. This is part 1 of this series. In this part, we will understand and implement linear regression algorithm which is widely used in ML for regression tasks. So, let’s get started.

Machine Learning is today’s one of the most discussed buzzword. You can realize applications of ML everywhere including but not limited to Google Assistant, Apple’s Siri, Self driving cars etc. It has lots of potential which can benefit us in unimaginable ways. So, many of the developers wants to learn it. But due to fear of ton’s of mathematics used in ML and lack of proper knowledge of practical implementation of algorithms, they abandon it in the middle or they won’t even start to learn it. Let me clear one thing, you don’t need Fields Medal (aka Mathematician’s Nobel Prize) in mathematics to start learning ML!! Or you don’t need Gigantic GPU powered system to implement normal ML algorithms as beginner. All you just need is basic understanding of mathematics and one PC to learn and implement ML algorithms.

In this tutorial, we will start with Linear regression algorithm. We will first understand theory behind it and then practically implement it’s application using python.

What is Linear Regression

According to Wikipedia, Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). It means, we have multiple independent variables and one (or more) dependent variables and we have to find relationship between them. For example, we want to predict weather condition in certain area. then we have independent variables like minimum temperature, maximum temperature, humidity etc and one dependent variable that is temperature value that we want to find. So, we have to find relationship between them that how all these independent variables affect the dependent one so we can predict the temperature.

Let us understand it with mathematical perspective. Consider below image for reference,

Figure 1

Here, we can see lots of points are scattered on plane. The main aim of linear regression is to draw a linear line that is as close as possible to all points with least errors (i.e. distance between line and point).

To achieve this, we need a mathematical equation known as equation of linear line and that is y = mx + b. where y is dependent variable, b is intercept, m is slope and x is independent variable. You can understand what intercept and slope looks like by looking at figure 1. We will use same equation in our algorithm as shown below :

Figure 2

How it works

For basic understanding, suppose you want to figure out what can be maximum temperature based on minimum temperature. Here, minimum temperature is independent variable and maximum temperature is dependent variable.For this, our equation would be like this :

Max_Temp = B0 + B1*Min_Temp

What Linear Regression will do is, it will compute this value with multiple data from given data-set and figure out ideal value of B0 and B1 which can be used later to derive value of Max_Temp for other values of Min_Temp.

Now, let’s see what happen’s when there are multiple independent variable and one dependent variable. Our equation will look like this :

Figure 3

Here, similar to above example, Linear Regression will try to find ideal value of b0,b1, b2,…, bn. b0 is known as intercept and b1,b1,..,bn is known as co-efficient.

Later, we will use those intercept and co-efficient’s value to predict dependent variable’s value for some independent variable value.

So far now, we have seen theory of how linear regression works. Now we will implement it in python.

Implementation of Multi Variate Linear Regression

In this tutorial, we are going to use Jetbrain’s PyCharm IDE but you are free to use whatever suits your need. If you are using PyCharm, then first setup your IDE as described here.

We will be using this algorithm to predict quality of wine based on various factors such as chlorides, alcohol, pH etc. You can find required dataset from here.

Next, we need to import bunch of packages to perform various mathematical operations and to implement actual algorithm.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
import seaborn as seaborn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Quick note: pandas and numpy library are used to perform various mathematical operations on dataset. matplotlib library is used to visualize(plot) our data. seaborn library is used to visualize statistical data. sklearn(scikit-learn) is used to implement actual algorithm and splitting training and testing data.

Next, import dataset by using below line of code. Here, we have used panda’s read_csv() function which will read Comma Separated Value (CSV) data from file and assign it to dataset variable.

dataset = pd.read_csv("D://Rajesh/ML/Dataset/wine_quality.csv")

In ML, data exploration is a key thing. First we need to analyse data manually to understand what it contains and relationship between multiple factors.

print(dataset.shape)
print(dataset.head())

Here, shape attribute will provide (rows,columns) that dataset contains. and head() will print top 5 value from that dataset as shown below:

Now, we will check if there is any column that contains null data. If there is such column, we will fill that column with fillna() method.

print(dataset.isnull().any())
dataset = dataset.fillna(method='ffill')

Next, we are going to visualize what is range of value of quality column as we are going to predict value of quality.

plot.figure(figsize=(15,10))
plot.tight_layout()
seaborn.distplot(dataset['quality'])
plot.show()

It will output graph as shown below. We can see that value of quality is mostly between 5 and 6.

Graph of quality attribute

In ML, there is two terms known as features and labels. To put it simply, features are list of independent variables and labels are dependent variables that we want to predict. So, now we will separate features and labels into two variables.

X = dataset[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides','free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']].valuesy = dataset['quality'].values

As we are using same dataset for training and testing the algorithm, we will split the dataset as 80% for training data and 20% for testing data with below line of code.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

we have used train_test_split() function to split our data into training and testing data. We will train algorithm on training data and later compare it’s prediction with testing data.

Now, we will initialize object of LinearRegression class and call it’s fit method to train algorithm on training data. This process is known as training machine learning model. So simple, isn’t it ?!!!

regressor = LinearRegression()
regressor.fit(X_train, y_train)

Remember the intercept and co-efficient that we discussed earlier ? we will print them to see what values our algorithm has derived for them.

print('Intercept is : ', regressor.intercept_)column_names = dataset.columns.values[:-1]
coef_df = pd.DataFrame(regressor.coef_, column_names, columns=['Coefficient'])
print(coef_df)

Here, we have created panda’s dataframe to print co-efficient in more readable manner. It will show output as below:

We can see that value of chlorides is negative. It means if we increase chlorides in one unit, quality will decrease in multiply of -1.873407. Similarly, if we increase residual sugar in one unit, quality will increase in multiply of 0.027870.

Finally, we will predict value of quality based on test data.

y_predict = regressor.predict(X_test)

It’s time to compare our predicted data and actual test data to check whether algorithm predicted correct value or not. We will print top 25 value of predicted data and actual test data.

df = pd.DataFrame({'Actual': y_test, 'Predicted':y_predict})
df1 = df.head(25)
print(df1)

It will print output like this:

From the above data, we can analyse that our algorithm predicted values nearly equal to actual data. These values are not completely same, but it’s satisfactory. We can also plot these values to visualize it.

df1.plot(kind='bar',figsize=(15,10))
plot.grid(which='major', linewidth='0.2', linestyle='-', color='green')
plot.show()

It will plot the graph as shown below:

Actual vs Predicted value of quality

Now, we will evaluate the performance of the algorithm. We will use three type of errors to evaluate it.

Mean Absolute Error (MAE) is the mean of the absolute value of the errors. It is calculated as:

Mean Squared Error (MSE) is the mean of the squared errors and is calculated as:

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

print("Mean Absolute Error : ",metrics.mean_absolute_error(y_test,y_predict))
print("Mean Squared Error : ",metrics.mean_squared_error(y_test,y_predict))
print("Root Mean Squared Error : ",np.sqrt(metrics.mean_absolute_error(y_test,y_predict)))

It will print below output:

Mean Absolute Error : 0.4696330928661105
Mean Squared Error : 0.384471197820124
Root Mean Squared Error : 0.6852978132652332

From above value, we can say that our algorithm is not very accurate but still it produce satisfactory prediction.

Implementation of Linear Regression with categorical values

Let suppose you are working in an insurance company and you want to predict expenses of a person of x age living in y region and having z bmi (body mass index). You already have dataset as like below :

Now, if you apply the same approach as mentioned above, you will get numerous errors. Why ? because dataset contains some values that is non-numerical.

Linear regression algorithm are very good at performing mathematical operations on numerical value. but they failed when you supply non-numeric values i.e string, character etc. This values is known as Categorical values. So, how to deal with such cases ? For that, we have to convert all categorical values into numeric one to feed to our algorithm. Let’s implement it practically. You can download required dataset from here.

dataset = pd.read_csv('D://Rajesh/ML/Dataset/insurance.csv')

We will now visually plot the data to analyse which region has most people based on dataset. This type of data exploration for analyzing data sets to summarize their main characteristic is known as Exploratory data analysis(EDA).

plot.figure(figsize=(15,10))
plot.tight_layout()
region_count = dataset['region'].value_counts()
seaborn.barplot(region_count.index, region_count.values, alpha=0.9)
plot.show()

Here, we are creating figure of width of 15 inch and height of 10 inch to plot our data. Then we are getting total value count for each distinct value in region column. Then we are plotting it by calling show() method of matplotlib.

Exploratory Data Analysis

Now, we need to convert those categorical value into numeric value. how will we do it ? There are multiple methods available for this like Integer Encoding, One-Hot Encoding etc. You can learn about various methods on internet. Here we are going to use One-Hot Encoding method. So what is One-Hot Encoding?

Suppose we have column named region and it has 4 values : northeast, northwest, southeast, southwest. There are 4 categories and therefore 4 binary variables are needed. A “1” value is placed in the binary variable for the region under consideration and “0” values for the other region. Therefore, after applying One-Hot encoding, it will look like :

This variables are also called dummy variables. pandas provide built-in functions called get_dummies() to convert categorical data into dummy variables. We will create dataframe object that contains original and dummy variables :

dummy_cat_df = pd.get_dummies(dataset, columns=['sex', 'smoker', 'region'], drop_first=True)

Here, we are specifying column names that contains categorical data. drop_first=True specify that it will automatically drop first column for each categorical value. So if there is k value in categorical column, it will remove newly created first column and keep new k-1 columns in dataset as shown below.

If you omit drop_first=True, it will also keep sex_female, smoker_no and region_northeast columns. As we are using these columns as reference columns, there is no need to keep those columns.

Now, we will separate features and labels as shown below:

X = dummy_cat_df.drop('expenses', axis=1).values
y = dummy_cat_df['expenses'].values

Here, variable X contains all columns except expenses and axis=1 indicates that it will drop specified column in column-wise manner. Similarly, variable y contains value of expenses column only.

Now, we will split data to 80% as training data and 20% as testing data and initialize Linear Regression algorithm and train it on training data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

You can print Intercept and Co-efficient same as above mentioned and call predict() method to make prediction on test data.

print('Intercept is : ', regressor.intercept_)column_names = dummy_cat_df.drop('expenses', axis=1).columns.values

coef_df = pd.DataFrame(regressor.coef_, column_names, columns=['Coefficient'])
print(coef_df)

y_predict = regressor.predict(X_test)

df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_predict.flatten()})
print(df.head(25))

Now, we will plot Actual vs Predicted value to analyse what our algorithm has predicted.

df1 = df.head(25)
df1.plot(kind='bar', figsize=(15, 10))
plot.title('Actual vs Predicted Value of Expense')
plot.grid(which='major', linestyle='-', linewidth='0.2', color='green')
plot.show()

From above graph, we can see that our algorithm has predicted reasonably good values for test data. Although it’s failed in some cases to predict value nearby the original one, but overall it’s performance is average.

So folks, it was all about Linear Regression. I hope now you will be able to implement it yourself.

Final Conclusion

Linear Regression perform very good at predicting value that has linear relationship with independent variable. However, most of the real world things aren’t linear in nature. So we need to make use of other algorithms for solving real world problems. We will cover them in next part of this series.

For complete code sample, you can checkout my github repository here.

--

--

Rajesh Hadiya
Noob Devs

Founder of MyStore | Talks about Kotlin, Android and Spring Boot