Supervised Learning Methods using Python

Himanshu Singh
8 min readJun 7, 2018

--

source: http://bigdata-madesimple.com

Machine Learning can be classified as of three types:-

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning

Supervised learning is the area of Machine Learning where we have a set of independent variables which helps us to analyse the dependent variable and the relation between them. Whatever we want to predict is called as Dependent Variable, while variables that we use to predict are called as Independent Variables. Suppose we want to predict the age of the person based on the person’s height and weight, then height and weight will be the independent variables, while age will be the dependent variable. We will talk about this in detail and apply all the algorithms to check comparative accuracy.

Unsupervised Learning is the area where we don’t have any dependent variable. We just have collection of variables and we try to find out similarity between them and classify them into clusters. Reinforcement learning is field where machine performs actions and look at the results. Based on the results it learns and then repeats the process again until it understands the entire phenomenon of action and result relation. Unsupervised and Reinforcement Learning are out of scope of this post. To know more about them, click here.

In this post we are going to talk about Supervised learning Methods and their applications in Python.

Generally Supervised Learning is used for classification problems, where we predict whether a data-set will belong to one category or the other. In this post we will explore the Titanic Data set, from Kaggle. There are 11 variables using which we have to predict whether a person will survive the accident or not. Let’s explore the data set before applying different algorithms for the prediction.

import pandas as pd
titanic=pd.read_csv("train.csv")
titanic.head()

Pandas is used for Table Manipulations. Using Pandas package we are going to upload the Titanic Training Set and then using the head() function we will look at the first five rows. The output looks like this:

You can see that we have 12 variables: 11 Independent and 1 Dependent (Survived Column)

Now that we know about the dataset, let’s first talk about the Supervised algorithms that we are going to apply on the above dataset:

  1. Logistic Regression
  2. K-NN Algorithm
  3. Naive Bayes Theorem
  4. Linear Support Vector Machines
  5. Non-Linear Support Vector Machines
  6. Decision Trees
  7. Random Forest

In this post I am not going to explain all the algorithms. I will explain the application of them on the above data set. But for those who want the explanations as well,I have given the links of the best sites which explain the above algorithms.

So let’s start the process. Application of the entire things can be broken down into following parts:-

  1. Data Pre-processing & Cleansing
  2. Splitting Data into Training and Test Set
  3. Applying all the above algorithms
  4. Comparing the accuracy scores

Data Preprocessing

We will start with the Data Preprocessing of Titanic Dataset. Preprocessing steps are given below:

  1. Divide the Data into two Data Frames: Categorical & Numerical
  2. Categorical will contain all the columns from the data set which contain Categories. Numerical will contain all the columns containing Numbers.
  3. We will drop the columns from both the Data Frames if:
    a. Column is not important
    b. Column contains more than 80% Null Values
  4. Next we will take care of Null Values. If null values are less than 80% in a column then we will replace it with the category having maximum count in the column (for Categorical Columns) or with the mean of the column (for Numerical Columns)
  5. If a column has around 60–80% null values, and we feel that the column is important, we can create a new category replacing all the null values
  6. For the Numerical Columns, we will plot the box plot and see if there are any outliers. Then we will replace the outliers from the data as well.
  7. Let’s apply all the above approaches in the Data Set.
titanic_cat = titanic.select_dtypes(object)
titanic_num = titanic.select_dtypes(np.number)

Above lines will create two Data Frames. One Containing Categories, one Numbers. head() of them is given below:

titanic_cat.head()
titanic_num.head()

Let’s First Look at the Categorical Data Frame.

We can drop the tickets column and name column as they are not important for making our model.

titanic_cat.drop(['Name','Ticket'], axis=1, inplace=True)

Now our Data will look like this:

Next step is to look at the Null Values present in the above columns:

titanic_cat.isnull().sum()

Above line will give us the following output:

So, we have 687 Null Values in Cabin Column, and 2 in Embarked Column. We have 891 rows in total, while 687 contains null values in Cabin. The percentage of Null Values leads to around 77%. This is approximately equal to 80%. Either we can delete the row, or we can replace it by the max count category. For now, let’s do the replacement.

titanic_cat.Cabin.fillna(titanic_cat.Cabin.value_counts().idxmax(), inplace=True)
titanic_cat.Embarked.fillna(titanic_cat.Embarked.value_counts().idxmax(), inplace=True)

Above two lines will replace all the null values present with the maximum count Category. Now we can again check for null values and then we can get following result:

We have now successfully removed all the null values. Now our data set looks like this:

titanic_cat.head(20)

Now, the next step will be to replace all the categories with Numerical Labels, otherwise we will not be able to apply our algorithms over them. For that we will be using LabelEncoders Method.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
titanic_cat = titanic_cat.apply(le.fit_transform)

above lines will transform all the categories into numbers. We can see the data set:

titanic_cat.head()

Now we are done processing the Categorical Data Frame. Next we need to work on Numerical Data Frame.

titanic_num.isna().sum()

Above line executes and gives us following output:

We have only one column: ‘Age’ which contains Null Values. Let’s replace it with Mean.

titanic_num.Age.fillna(titanic_num.Age.mean(), inplace=True)
titanic_num.isna().sum()

Above line gives us the below output:

Now that we have removed all the null values, we will remove unnecessary columns. PassengerId is the unnecessary column, so we will drop it.

titanic_num.drop(['PassengerId'], axis=1, inplace=True)
titanic_num.head()
Output

Now that we are done with Data Preprocessing Steps, we will combine the two Data Frames and make it as one.

titanic_final = pd.concat([titanic_cat,titanic_num],axis=1)
titanic_final.head()

Output:

Let’s Start Dividing the Data Set into Training and Test-Set

Splitting Data

Let’s define our dependent and independent variables. Dependent will be Survived, because we want to predict whether a person is going to survive or not. Independent will be remaining variables. Given below is the code for the partition.

X = titanic.final.drop(['Survived'], axis=1)
Y = titanic_final['Survived']

Now we will be taking 80% of the data as our training set, and remaining 20% as our test set.

X_train = np.array(X[0:int(0.80*len(X))])
Y_train = np.array(Y[0:int(0.80*len(Y))])
X_test = np.array(X[int(0.80*len(X)):])
Y_test = np.array(Y[int(0.80*len(Y)):])
len(X_train), len(Y_train), len(X_test), len(Y_test)
Output

Now let’s start applying our Supervised Learning Algorithms

Algorithms Applications

All the Machine Learning models are saved inside the package called sklearn (scikit-learn). We will be applying all the above listed models on the Processed Titanic Data Set and compare the accuracy score. First let us import all the algorithms.

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Now lt’s initialize them in respective variables

LR = LogisticRegression()
KNN = KNeighborsClassifier()
NB = GaussianNB()
LSVM = LinearSVC()
NLSVM = SVC(kernel='rbf')
DT = DecisionTreeClassifier()
RF = RandomForestClassifier()

Now that we have initialized all our algorithms, next step is to train our model on our Training Data Set:

LR_fit = LR.fit(X_train, Y_train)
KNN_fit = KNN.fit(X_train, Y_train)
NB_fit = NB.fit(X_train, Y_train)
LSVM_fit = LSVM.fit(X_train, Y_train)
NLSVM_fit = NLSVM.fit(X_train, Y_train)
DT_fit = DT.fit(X_train, Y_train)
RF_fit = RF.fit(X_train, Y_train)

Now our trained models are saved in above variable. We just need to now predict on the Test Data Set and compare the accuracy scores. Let’s see how it’s done:

LR_pred = LR_fit.predict(X_test)
KNN_pred = KNN_fit.predict(X_test)
NB_pred = NB_fit.predict(X_test)
LSVM_pred = LSVM_fit.predict(X_test)
NLSVM_pred = NLSVM_fit.predict(X_test)
DT_pred = DT_fit.predict(X_test)
RF_pred = RF_fit.predict(X_test)

Accuracy Score

from sklearn.metrics import accuracy_scoreprint("Logistic Regression is %f percent accurate" % (accuracy_score(LR_pred, Y_test)*100))
print("KNN is %f percent accurate" % (accuracy_score(KNN_pred, Y_test)*100))
print("Naive Bayes is %f percent accurate" % (accuracy_score(NB_pred, Y_test)*100))
print("Linear SVMs is %f percent accurate" % (accuracy_score(LSVM_pred, Y_test)*100))
print("Non Linear SVMs is %f percent accurate" % (accuracy_score(NLSVM_pred, Y_test)*100))
print("Decision Trees is %f percent accurate" % (accuracy_score(DT_pred, Y_test)*100))
print("Random Forests is %f percent accurate" % (accuracy_score(RF_pred, Y_test)*100))

Output:

Results from all algorithms

We can see that from above applications, Random Forests are giving the best results. It needs to be mentioned that I have not done any kind of model improvement to increase the accuracy. I have applied all the algorithms directly on the Titanic Data Set. In my next post I will be applying all the model improvement techniques and then we’ll again look at the accuracy outputs.

--

--

Himanshu Singh

ML Consultant, Researcher, Founder, Author, Trainer, Speaker, Story-teller Connect with me on LinkedIn: https://www.linkedin.com/in/himanshu-singh-2264a350/