This is a continuation of Mathematics behind Machine Learning Series.
In the last post, we saw how elegantly the maximum margin principle formulates to solve a binary classification task.
But for multiclass, the notion of maximum margin is harder to formulate. So the question of how can we reuse the 2-class formulation for k classes.
One-vs-Rest (OvR): for each class k train an SVM that is expert in classifying k (one) versus non-k (the rest) meaning we have to create a binary classifier for each class k, which is an expert in separating that class(k) from all the other classes.
One-vs-One (OvO): for each pair of classes (k, k’), train an SVM that is expert in classifying k versus k’.
In either case, make multiclass predictions by combining individual binary predictions.
Let’s take an example of a 3 class classifier, for OvR we need 3 SVMs each of which is an expert in separating 1 class from the other 2. Let see a visual example below:
We can 3 SVMs one for blue, one for orange, and one for green class. Each SVM is trained on binary data, meaning that it’s task is to separate its class from all the other classes. The heatmap below shows the heatmap for individual SVM.
For the same example above let’s see how OvO works. For this 3 class example, instead of training an SVM for each class, we train an SVM for each pair of classes. So while training an SVM it just focuses on the pair of classes and ignores other classes.
For the example above we train one SVM for separating blue and green, one for separating blue and orange, and one for separating orange and green. The heatmap below displays the training process, and we can see that while training for a pair it completely ignores the other classes.
In order to make a prediction for a point x, we choose the class k which has the highest number of votes.
One issue with both approaches is that they can be very slow to train.
OvR: trains k SVMs, each on the full dataset.
OvO: trains k² SVMs, but each on a fraction of the dataset.
Logistic regression (LR) for Multiclass classification.
It is important to have an understanding of Logistic regression, please refer to this article if you need a bit of revision.
We can use the concept of OvR and OvO with logistic regression as well, however it turns out that LR has natural multiclass maximum-likelihood formulation.
Multi-output Logistic Regression from Multi-output Linear Regression
We can train a Linear Regression model to produce a k-dimensional vector as output for a multioutput dataset as below:
Now if we snap element-wise sigmoid onto the output, and predict output for each label independently, we receive a multilabel output as below:
For classification, the actual probabilities belonging to each class cannot be predicted independently. They must all sum to 1.
So ideally we need to predict joint probabilities for all our classes. Thankfully we have Softmax function to our rescue.
Intuition: Amplify the largest value(s) in a vector a, then normalize them to ensure the sum of all values after applying softmax is equal to 1. Hence we have achieved our Multiclass classifier.
Let’s see it in action with the same example we saw for SVM:
Multiclass vs Multi-label Classifications vs Tagging
Multiclass: Given x predict one unique class (one “class label”) out of K classes.
Multilabel: Given x predict the subset of K labels that should be associated to x.
Tagging: Multilabel classification usually of images.
There are many more MultiClass classifiers available like k-NN, Naive Bayes classifier that I will discuss later.
I hope this article provided you with intuition on MultiClass classification. They are important to deal with real-world problems and have a wide variety of applications. Hope you enjoyed learning it!