Beginners Guide to Classification in Machine Learning

Ahaan Goswami
Analytics Vidhya
Published in
4 min readOct 13, 2019

Classification comes under Supervised Learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. In this article, I’m going to compare some of the popular classification models like- CART, Perceptron, Logistic Regression, Neural Networks, and Random Forest.

Dataset

For simplicity, I’ve used a small fertility dataset that contains over 100 instances and 9 features:

  • Season in which the analysis was performed
  • Age
  • Childhood diseases
  • Accident or serious trauma
  • Surgical intervention
  • High fevers in the last year
  • Alcohol consumption
  • Smoking habit
  • Hours spent sitting per day

The dataset used can be found here.

Before loading the data, we will need to import these libraries:

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

After this, we can read the data by running:

path= '<path-to-file>'
data = pd.read_csv(path)

Preprocessing

In the real world, you would have to preprocess and normalize your data almost all the time. However, our dataset is already normalized (using Label Encoding). For example, the seasons- winter, spring, summer, and fall are represented as -1, -0.33, 0.33, and 1. The only part that needs to be preprocessed is the last column i.e. the output. The ‘N’ needs to be converted to 1 and the ‘O’ needs to be converted to 0. This can be done by running this command:

data.Output.replace(('N', 'O'), (1, 0), inplace=True)

Next, we need to initialize X-axis and Y-axis. The ‘output’ column will be our Y-axis and the rest of the features would make up the X-axis. After this, the data will be divided into training and testing. The most common ratio is 70:30. Here, X_train and Y_train will contain 70% of the dataset and X_test and Y_test will contain the remaining 30%.

Y = data['Output']
X = data.iloc[:,:-1]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=30)

Analysis of Different Models

Import these libraries:

from sklearn.linear_model import Perceptron
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Perceptron

Perceptron is a single layer neural network. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

ppn = Perceptron(max_iter=100, eta0=0.5)
ppn.fit(X_train, Y_train)
y_pred = ppn.predict(X_test)
print(accuracy_score(Y_test, y_pred))
accuracy=accuracy_score(Y_test,y_pred)print('Accuracy: %.2f'%(accuracy*100))

Here, max_iter refers to the maximum number of passes/iterations over the training data and eta0 refers to the constant by which the updates are multiplied. The accuracy of this model was 83.33%

Note: After applying L2 and elasticnet regularization the result remained the same, whereas, L1 regularization the accuracy reduced to 73.33. You can check this out yourself by adding another parameter penalty='l1/l2/elasticnet' to the first line of the code.

Logistic Regression

Logistic regression is the go-to method for binary classification problems (problems with two output values). It is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

lg_reg = LogisticRegression()
lg_reg.fit(X_train, Y_train)
y_pred = lg_reg.predict(X_test)accuracy=accuracy_score(Y_test,y_pred)
print('Accuracy: %.2f'%(accuracy*100))

We don't need to use any additional parameters for this model. The first 2 lines of the code will call the logistic regression function and train the data. The next line predicts the output of X_test. The accuracy of this was 86.67%

Note: L1 and L2 regularization had no effect on the accuracy of the model, whereas elasticnet regularization was not possible(since the dataset is too small).

CART-Decision Tree

Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label.

classifier = DecisionTreeClassifier()
classifier.fit(X_train, Y_train)
y_pred = classifier.predict(X_test)classifier = DecisionTreeClassifier(max_leaf_nodes=60)
classifier.fit(X_train, Y_train)
y_pred = classifier.predict(X_test)
accuracy=accuracy_score(Y_test,y_pred)
print('Accuracy: %.2f'%(accuracy*100))

Here max_leaf_nodes refers to growing the tree in best-first fashion. Best nodes are defined as relative reduction in impurity. The number 60 can be anything or left as None. The accuracy of this model was 76.67%

Random Forest

Random Forests operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

classifier= RandomForestClassifier(n_estimators=100, criterion= 'gini') 
classifier.fit(X_train,Y_train)
y_pred= classifier.predict(X_test)

accuracy=accuracy_score(Y_test,y_pred)
print('Accuracy: %.2f'%(accuracy*100))

n_estimators refers to the number of trees in the forest, criterion refers to the function to measure the quality of a split. This can be either gini index or entropy (both resulted in the same accuracy in this case). The accuracy of this model was 80%

Neural Network

To create the neural network we will use TensorFlow backend. For this we will require the following libraries:

from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. They can adapt to changing input; so the network generates the best possible result without needing to redesign the output criteria.

model= Sequential()
model.add(Dense(9, input_dim=9, activation='relu'))
model.add(Dense(7,activation='relu'))
model.add(Dense(2,activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])y_test_cat=to_categorical(Y_test)
y_train_cat=to_categorical(Y_train)
model.fit(X_train, y_train_cat,epochs=100,batch_size=10)

Here the number 9 refers to the number of features of the dataset, the number 7 refers to the number of neurons, the number 2 refers to the possible outputs of the dataset and epochs refers to the measure of the number of times all of the training vectors are used once to update the weights.

_,accuracy=model.evaluate(X_test,y_test_cat)
print('Accuracy: %.2f'%(accuracy*100))

The accuracy of this model was 86.67%

Conclusion

After training and testing the dataset through five different classification models it was observed that Linear Regression and Neural Network had the highest accuracy (86.67%), followed by Perceptron(83.33%), Random Forest (80%) and CART Decision Tree (76.67%)

--

--