ML06: Intro to Multi-class Classification

Vaibhav Malhotra
Oct 14, 2020 · 4 min read

This is a continuation of Mathematics behind Machine Learning Series.

Image for post
Image for post

In the last post, we saw how elegantly the maximum margin principle formulates to solve a binary classification task.

But for multiclass, the notion of maximum margin is harder to formulate. So the question of how can we reuse the 2-class formulation for k classes.

One-vs-Rest (OvR): for each class k train an SVM that is expert in classifying k (one) versus non-k (the rest) meaning we have to create a binary classifier for each class k, which is an expert in separating that class(k) from all the other classes.

One-vs-One (OvO): for each pair of classes (k, k’), train an SVM that is expert in classifying k versus k’.

In either case, make multiclass predictions by combining individual binary predictions.

One-vs-Rest

Let’s take an example of a 3 class classifier, for OvR we need 3 SVMs each of which is an expert in separating 1 class from the other 2. Let see a visual example below:

Image for post
Image for post
OvR SVM with linear kernel

We can 3 SVMs one for blue, one for orange, and one for green class. Each SVM is trained on binary data, meaning that it’s task is to separate its class from all the other classes. The heatmap below shows the heatmap for individual SVM.

Heatmaps on training OvR

One-vs-One

For the same example above let’s see how OvO works. For this 3 class example, instead of training an SVM for each class, we train an SVM for each pair of classes. So while training an SVM it just focuses on the pair of classes and ignores other classes.

Image for post
Image for post
OvO with linear kernel

For the example above we train one SVM for separating blue and green, one for separating blue and orange, and one for separating orange and green. The heatmap below displays the training process, and we can see that while training for a pair it completely ignores the other classes.

Image for post
Image for post
Heatmaps on training for OvO

In order to make a prediction for a point x, we choose the class k which has the highest number of votes.

One issue with both approaches is that they can be very slow to train.

OvR: trains k SVMs, each on the full dataset.

OvO: trains k² SVMs, but each on a fraction of the dataset.

Logistic regression (LR) for Multiclass classification.

It is important to have an understanding of Logistic regression, please refer to this article if you need a bit of revision.

We can use the concept of OvR and OvO with logistic regression as well, however it turns out that LR has natural multiclass maximum-likelihood formulation.

Multi-output Logistic Regression from Multi-output Linear Regression

We can train a Linear Regression model to produce a k-dimensional vector as output for a multioutput dataset as below:

Image for post
Image for post

Now if we snap element-wise sigmoid onto the output, and predict output for each label independently, we receive a multilabel output as below:

Image for post
Image for post
An example of the actual and predicted label

For classification, the actual probabilities belonging to each class cannot be predicted independently. They must all sum to 1.

So ideally we need to predict joint probabilities for all our classes. Thankfully we have Softmax function to our rescue.

Image for post
Image for post

Intuition: Amplify the largest value(s) in a vector a, then normalize them to ensure the sum of all values after applying softmax is equal to 1. Hence we have achieved our Multiclass classifier.

Let’s see it in action with the same example we saw for SVM:

Image for post
Image for post
Logistic Regression on Multi-Class classification with 3 classes

Bonus Section:

Multiclass vs Multi-label Classifications vs Tagging

Multiclass: Given x predict one unique class (one “class label”) out of K classes.

Multilabel: Given x predict the subset of K labels that should be associated to x.

Tagging: Multilabel classification usually of images.

There are many more MultiClass classifiers available like k-NN, Naive Bayes classifier that I will discuss later.

I hope this article provided you with intuition on MultiClass classification. They are important to deal with real-world problems and have a wide variety of applications. Hope you enjoyed learning it!

For questions/feedback you can reach me at my LinkedIn or at my Website.

Happy learning!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Vaibhav Malhotra

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Vaibhav Malhotra

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store