7 types of Multi*-Classification

Inside AI

Rupak (Bob) Roy - II
TheLeanProgrammer
29 min readJun 25, 2021

--

Full guide to knn, logistic, support vector machine, kernel svm, naive bayes, decision tree classification, random forest, Deep Learning and even with Grid Search Multi-Classification.

As usual,

Hi! How are you doing? I hope it's great……

Today let's understand and perform all types of classification for Multi-Class/ Multi-Label target variable.

Let’s get started, we will use a dataset that has 7 types/categories of glass. The dataset is available at UCI https://archive.ics.uci.edu/ml/datasets/Glass+Identification

Number of Attributes: 10 (including an Id#) plus the class attribute

— all attributes are continuously valued

Let’s get started with our commonly used Classification method:

1.) Logistic Regression then we will use

2.) Knn

3.) Support Vector Machine

4.) Kernel SVM

5.) Naive Bayes

6.) Decision Tree Classification

7.) Random Forest Classification

Any else Classification? Let me know in the comment below.

Till here it’s the same as before, load the data then split the data into X and Y where Y is the dependent/target variable 9th column (glass categories) and rest from 0 to 9 are independent variables X

Note: in python index position of the columns start from 0 and not from 1.

Then before we will split the data into train & test datasets, we need to check for any categorical imbalance. If one of the categories is way too less than the rest, it's better to remove the imbalanced category as it means it doesn’t have enough data to learn the cause-effect relationship. In general, I make sure it should least have 5–10% of the total categories.

After this step, we will transform all the columns(dependent variables) into one standard value/range that will reduce the spread, magnitude of the data points without losing the original meaning of the data.

It helps the algorithm to compute the data faster and efficiently.

Now it's time to fit the data with logistic regression and predict with test results.

DONE… !!! super easy isn’t it ?

Let’s compare the predicted results with our original dataset.

Fatten it helps to represent data in a 1-dimensional array like a list.

Actual Vs Predicted
Multi-Class Confusion Matrix

What we can understand from this confusion matrix, that 11 data points of class 0 are actually class 0 and detected class 0, 3 data points which is actually class 0 but detected class 1. The same goes for the 2nd row 10 data points are actually class 1 but detected class 0 and 12 data points are actually class 1 and detected correctly class 1 and the list goes on.

Alright, we have another metric to evaluate the model performance is by using metrics.accuracy_score

Well, our classifier didn’t work well, no worries! We will try another powerful classifier and see if it improves, but before that let me put all the pieces together in case if you wish to use it as a template.

Next is KNN.

What is KNN?

K-Nearest Neighbors (KNN) is one of the simplest algorithms used in Machine Learning for regression and classification. KNN algorithms classify new data points based on similarity measures (e.g. Euclidean distance function).

Classification is done by a majority vote to its neighbors (K).

K Nearest Neighbors Plot
K Nearest Neighbors

Let’s get started on how to apply KNN for Multi-Classification problems.

Till here it’s the same as before, load the data, define X and Y, split the data, and then scale the independent variables

NOW we will fit the KNN to our training data set where K nearest neighbors K =9 , metric = minkowski which helps to measure three-dimensional Euclidean space and p = 2 is the Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2

That’s it!

Let’s check the model accuracy

Not much!!! Well, we definitely learn how to apply knn for the Multi-Classification problem.

Here are all the pieces of KNN together

Next is SVM an another powerful classifier

SUPPORT VECTOR MACHINE

What is SVM?

SVM is a supervised machine learning algorithm that can be used for classification or regression problems

In brief, the principle working of SVM is to find the nearest data point(either class) with the help of a hyper-plane. This distance is called as Margin

SVM is highly preferred by many as it produces significant accuracy with less computation power.

Lets get understand this with the help of an example.

Well till here it’s the same as others. First, we import the data that defined X & Y, Split the data into train and test sets, scale the independent variables to reduce the magnitude of the spread of data points without losing their original meaning.

It's time to fit the SVM into the training set.

Okay, we did improve a bit. I believe the data is way too non-linearly separately. Let’s try another advanced version of SVM called Kernel SVM

What is Kernel SVM

The complexity of Linear svm grows with the size of the dataset. In simple words Kernel SVM ‘rbf’ transforms complex non-linear data to higher dimensional 3D space to separate the data classes.

kernel svm helps to transform non-linear data in high dimensional space.
Converting to 3D space. makes it possible to separate the data points
Converting to 3D space. makes it possible to separate the data points

Usually linear and polynomial kernels are less time-consuming and provide less accuracy than the rbf or Gaussian kernels.

So, the rule of thumb is: use linear SVMs (or logistic regression) for linear problems, and nonlinear kernels such as the Radial Basis Function kernel for non-linear problems.

Lets. Compare Linear svm with kernel Radial based svm

Well till here it’s the same things everywhere. Load the data then define X and Y, split the data, and transform to the standard range to reduce the magnitude of data without losing its original meaning.

Now we will fit the data in both Linear as well as Kernel ‘rbf’ SVM to compare both of them.

Confusion Matrix: Top| cm | Linear SVM, Second cm1| Kernel ‘rbf’ SVM

The confusion matrix of Kernel SVM is performing better in identifying True Positive and True Negative than Linear SVM

Evaluation Metrics | First | Linear SVM, Second | Kernel ‘rbf’ SVM

The accuracy score of our Kernel svm model is better than linear svm

Hence Kernel SVM performs better than Linear SVM.

Well, that’s not enough, we have a more powerful classifier.

Let me put all the codes together for Kernel SVM

Next similar to SVM we have Naïve Bayes

What is Naive Bayes in short?

Naïve Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem.

Naive Bayes

P(c|x) is the posterior probability of class (target) given predictor (attribute).

  • P(c) is the prior probability of class.
  • P(x|c) is the likelihood which is the probability of predictor given class.
  • P(x) is the prior probability of predictor.

Likelihood: How probable is the evidence given that our hypothesis is true.

Prior: How probable was our hypothesis before observing the evidence?

Posterior: How probable is our hypothesis given the observed evidence?

Marginal: How probable is the new evidence under all possible hypotheses?

It's a long chapter about how Naive Bayes works. if are you interested to go in-depth further you can visit my other site. However

In short Naive Bayes uses class of probability method to classify the problem solution.

Let’s see how can we apply Naïve Bayes in Multi-Classification

Naïve Bayes didn’t perform well for this data. And also it makes sense Naïve Bayes is usually good for textual data.

Well, we have 2 more powerful algorithms to go.

Let me put all the Naive Bayes codes together

Next is Decision Trees / Rule-based Classifier.

What are Decision Trees?

Decision Trees are a non-parametric supervised learning method used for both classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules derived from the data features.

The decision rules are generally in form of if-then-else statements. The deeper the tree, the more complex the rules and fitter the model.

A decision tree gives output in a tree-like graph with nodes. Take this graph as an example, beautifully explained.

Decision Trees | Graph Credit ~ TDS
Decision Trees | Graph Credit ~ TDS

Let’s get hands-on experience on how to perform Decision trees.

So what we got

Confusion Matrix: Decision Tree

Confusion matrix|Decision Tree is performing better in identifying True Positives than Naïve Bayes.

The accuracy score of our Decision Tree model is better than Naïve Bayes

Hence Decision Tree is performing better for this non-linearly separable data.

Wait since decision trees are rule-based classifiers and we can generate rules, let’s visualize and see what we go.

The tree is finally exported and we can visualize using http://www.webgraphviz.com/ by copying the data from the ‘multi-class_tree.dot’ file.

Decision Tree Classification with http://www.webgraphviz.com/

Well, it’s a long list of trees, very difficult to put everything out here. So our classification did perform very well from the previous classifier.

Next, we have another classifier call Random Forest an upgrade version of Decision Tree Classifier

Let me put all of the Decision Tree Classifier codes together.

Next RANDOM FOREST

What is a random forest?

Random Forest is the upgrade version of decision trees. The name itself refers it consists of a large number of individual decision trees that operate as an ensemble. Thus we are combining the predictive power of several decision trees to give more accuracy.

Random Forest Graphical Representation
Random Forest Graphical Representation

Let’s get started with the help of an example

Till here it’s the same basic data pre-processing step from loading the data, defining X & Y, splitting the data into train, and test to data normalization/scaling to reduce the magnitude of the spread of data points.

Now we will fit the random forest into the dataset. Also, we will do for decision tree so that later we can compare the performance.

Well, we did increase the Balanced Accuracy by 4% with the default random forest settings.

Now to know the best settings it is not possible to try each and every setting one by one, it’s a tedious and time-consuming process and not very productive. For that, we have an automated process to find the best optimal settings for each classifier is called GRID SEARCH. But before we move ahead with Grid Search let me put all the pieces of Random Forest together so that later you can use it as a template.

Next Grid Search

Grid Search

Finding the best parameters by manual tuning is a tedious process and time-consuming as it contains so many parameters to be tested over and over again. Well, it’s time-consuming and not productive. So to overcome this issue we will look into a method ‘GRID SEARCH’ to automate the task of finding the best model parameters for us.

We will divide this into 2 sections: a) Grid Search for finding the best hyperparameters for our machine learning model b.) Grid Search for Deep Learning models.

Let’s start with a) Grid Search for machine learning models

Then we will split the data into train and test, scale our data before we fit our model. For this example, we will use Random Forest Classifier (RF) which now has the highest accuracy score with default parameters.

To automate the search of the best parameters of our Random Forest Model.

And what we have…….. Best Parameters for n_estimators: 1500

Okay, let’s try out with n_estimators and see if it improves our model.

Nice! It did improve from 0.70 to 0.72. You can use these codes as a template with a few modifications like the list of parameters for different types of classifiers and to know the parameters you can simply select the classifier name ‘svm’ + press ‘ctrl’ + ‘i’

Let’s me put all of the pieces together.

Still not that satisfied with the accuracy level, isn’t it? have you tried the next-generation machine learning technique? Deep Learning.

So what are we waiting for, let’s try it?

I believe you are already aware of how Neural Networks work if not…don’t worry,, there are plenty of resources available on the web to get started with. However, I will too walk you through in brief what is neuron networks and how it learns?

Parts Of Neuron
Parts Of Neuron

In this diagram/photo, Dendrites are the receivers of the neuron while Axom is the transmitter of neuron signal.

What is a neuron?

In Artificial Intelligence Neuron is a mathematical function that models the functioning of a biological neuron. Typically, a neuron computes the weighted average of its input, and this sum is passed through a nonlinear function, also called as activation function, such as the sigmoid, Relu

Now if we put this in a flow diagram it will look something like this

Simple Neuron Network Diagram

In real off-course we gonna have larger and more complex neuronal networks.

Multi-layer Neuron Network

How does it learn?

When they go process data back and forth (also known as backpropagation). They create weights to save the optimized parameter settings over n over again that gives less error/loss inaccuracy. Once it reaches the point where further calculation doesn’t give any improvement over preceding accuracy, the parameter settings are saved as weights. Now there are different types of methods to minimize the loss inaccuracy. One of them is the Gradient Descent.

Gradient Descent is an optimized algorithm often used for finding weights.

Types of Gradient Descent

1. Batch Gradient Descent: it calculates the error for each example in the training dataset but only updates the model after all training examples have been evaluated. In other words, it takes the whole data and adjusts weights with iterations & iterations.

Pros:

a) Fewer updates to the model means this variant of gradient descent is more computationally efficient than stochastic gradient descent.

b) And with the decreased update frequency results in a more stable error gradient and that may result in more stable convergence.

Cons:

a.) However stable error may result in premature convergence of the model to a less optimal set of parameters.

b.) It is implemented in such a way that it requires the entire training set in memory and is available to the algorithm. Thus with respect to training speed, may become slow for a large dataset.

2. Stochastic Gradient Descent calculates the error and updates the model for each example in the training dataset.

In other words: one row at a time, adjust the weights with iterations. Helps to avoid the local minimum rather than the global minimum and it's faster.

Pros:

a.) This variant is simpler to understand and implement for beginners

b.) The frequent updates immediately give an insight into the performance of the model and the rate of improvement.

c.) The increased model update frequency one row at a time can result in faster learning on some problems.

Cons:

a.) However updating the model so frequently is computationally expensive than others variants of gradient descent, especially train models on a large dataset.

b.) But the frequent updates can result in a noisy gradient signal which may cause the model parameters and in turn the model error to jump around.

3. Mini-Batch Gradient Descent: is a variation of the gradient descent algorithm that splits the training set into small batches that are used to calculate model error and update model co-efficient.

Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.

Pros:

a.) The model update frequency is higher than batch gradient descent which allows for a more robust convergence and avoiding local minima.

b.) The batch updates provide a computationally more efficient process than stochastic gradient descent.

c.) The batching allows both the efficiency of not having all the training data in memory and algorithm implementation.

Cons:

a.) Mini-batch requires the configuration of an additional ‘mini-batch’ size hyperparameter for the learning algorithm.

b.) Error information must be accumulated across mini-batches of training examples like batch gradient descent Thus requiring high computational power.

THE MOST COMMONLY USED OPTIMIZER IN DEEP LEARNING is ADAM, an another optimized algorithm.

NOW Since we have an idea of how Neural networks work. Let’s get started with a real-life example.

First, we will import the data and the libraries as go.

Then as usual define what is X and what is Y. I have also added groupby(y).size() to check any imbalance classes.

Now the interesting and the most important part to performing multi-classification in deep learning is to encode the target variable (y) that converts each category into a dummy variable to classify each category ……done that’s it!

We will split the data into train and test set as usual and one more simple and super fast step we have to do is split the dataset into training and test dataset for the ANN to learn and test then we have to do Feature scaling to bring the magnitude into a small range that will help to reduce the workload in ANN without compromising the original meaning of the data.

Thus scaling doesn’t add any noise neither loses the original meaning of the data.

# AND WE ARE DONE WITH THE DATA PREPARATION !!!!!!!!!

#LET’s START THE FUN PART- CREATING A NEURAL NETWORK!!!

#a small note on Keras and TensorFlow BUZZ word that we hear all the time.

TensorFlow is an end-to-end open-source platform. It’s a comprehensive and flexible ecosystem of tools, libraries, and other resources that provide workflows with high-level APIs.

Keras, on the other hand, is a high-level neural networks library that is running on top of TensorFlow, CNTK, and Theano. Using Keras in deep learning allows developers to easily build neural networks without worrying much about the mathematical aspects of tensor algebra, numerical techniques, and optimization methods. Keras was developed with the objective of allowing people to write their own scripts without having to learn the backend in detail.

Let’s Get Back To The Track!

# we will add and connect layers using .add and DENSE with units = 30 hmmm..! what does that 30 means?

30 refers to a number of nodes/neurons in the layer, usually, we choose half of the number of columns(variables) we have in our dataset.

Next, we have kernel _initializer = ‘uniform’ where uniform is a function to initialize the weights for Stochastic gradient descent or any other optimizer like ‘ADAM’ What is an optimizer? we will get to the part in a few seconds.

Activation = ‘relu’ stands for the rectified linear unit is the rectifier to create and measure the non-linearity.

Relu is linear for all positive values and zeroes for all negative values. The downside for being zero for all negative values is a problem called “dying RELU” . a Relu neuron is “dead” if it’s stuck on the negative side and always outputs 0. The dying problem is likely to occur when the learning rate is too high or there is a large negative bias. ‘Leaky ReLU’ and ‘ELU’ are also good alternatives to try. Other variants include ReLU-6, Concatenated ReLU(CReLU), Exponential Linear(ELU,SELU), Parametric ReLU.

Last one is ‘input_dim’ simply refers to the number of columns(input dimensions)

Further, we will add a second layer the same way we did above, the only difference is we don’t need to add “input_dim” becoz it will learn itself from the first layer the input dimensions value is 30

Activation ‘relu’ is used for regression output and

Activation ‘softmax’ function used when we need multi-class classification output with a Dense value 8 means it has 8 classes.

This function is actually used to compile all the layers in other words calculate weights(settings) in the neural network.

Optimizer = ‘adam” just like Stochastic Gradient Descent (SGD) optimizes the algorithm to find the optimal set of weights in neural networks using pre-defined kernel_initializer =”uniform” that we set a while ago.

Loss = ‘binary_crossentropy’ is the function used to calculate the loss in accuracy for the Classification problem, for Regression its RMSE (Root Mean Square Error) for Multi-class we use loss = ‘categorical _cros entropy’

Metric== [‘accuracy’] is again another function to display the accuracy of the model.

Finally, we will use our model function with wrapper function ‘KerasClassifier where ‘build_fn’ refers to the model function, epochs = 350 refers a number of iteration that will be used to train the model, batch_size = 1 refers to the data batch size to be used at a time to train our model, verbose = 1 it just displays the process during the training of our model.

It's time to fit out the model with X_train and y_train..done!

Wow now we have an accuracy of 99% The highest of all, that’s why deep learning is very famous for non-linearly separable data. However, tuning deep learning models can be a bit difficult as it has lots of parameters to tune but we can also use GRID Search to automate and find the best optimal parameters for us for each case scenario. Search for Grid Search from my profile if wish to see in detail ‘how to use Grid Search for Deep Learning.

Our model is ready to predict new data.

The values from 1 to 7 are the predictions of classes

Done.. we can also use Kfold to cross-validate our model

The whole code will look something like this.

Congratulations! we have completed all. It's a long blog, I tried to keep it as short as possible keeping the important concepts intact.

I hope you have enjoyed

Feel Free to ask because “Curiosity Leads To Perfection”

Some of my alternative internet presences are Facebook, Instagram, Udemy, Blogger, Issuu, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Stay tuned for more updates.! have a good day….

https://www.quora.com/profile/Rupak-Bob-Roy
https://www.quora.com/profile/Rupak-Bob-Roy

~ Be Happy and Enjoy!

pexel

Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer

--

--

Rupak (Bob) Roy - II
TheLeanProgrammer

Things i write about frequently on Medium: Data Science, Machine Learning, Deep Learning, NLP and many other random topics of interest. ~ Let’s stay connected!