Machine Learning Reductions & Mother Algorithms, Part II: Multiclass to Binary Classification

Published in

The Hands-on Advisors

11 min readNov 16, 2018

Following our introductory Part I on ML reductions & mother algorithms, let’s talk about a classic reduction: one-against-all (OAA) — also known as one-vs-all (OVA) and one-vs-rest (OVR). It’s one of the simplest reductions, but also one of the most powerful, with several desirable properties. Unfortunately, it’s seen in some circles as too simple, with dissidents pointing to the problem with class imbalance. In fact, this issue can be mitigated with a neat trick (more on that later), leaving us with a general purpose solution to almost any multiclass problem you can think of.

Quicksort, a classic sorting algorithm, is also a reduction. Smaller sub-problems are solved and composed to form the solution to the original problem.

So how does OAA work?

Classification algorithms aim to learn a optimal decision boundary to separate different inputs from each other. At prediction time, inputs are classified into different classes using this boundary. In binary classification settings, there are two possible classes (“is this credit card transaction fraudelent or legimate?”); in multiclass settings, there may be many more.

The OAA approach to multiclass classification allows you to reduce any multiclass learning problem to a problem solvable using binary classifiers. It works as follows:

For each of the i={1…n} classes in your training dataset, create n new datasets D₁…Dn (one for each class)
For each new dataset Di, mark samples of the corresponding class i as positive and all other samples as negatives
Train N binary classifiers, each on their own dataset Di
At prediction time, ask all binary classifiers for a prediction. Choose the class for which the corresponding classifier outputs a positive, breaking ties randomly

This is what we’ll henceforth refer to as vanilla OAA or just OAA. There’s an even more fundamental version — where ties are broken arbitrarily instead of randomly — known as strawman one-against-all (SOAA) [1]. However, it works worse in practice, so we’ll skip the details in this post.

In the strict OAA scheme, we want to use binary classifiers as oracles. The term oracle pops up frequently in literature related to reductions, and it’s a powerful concept: ideally, we want to design reductions such that the original problem is solvable using learners whose inner workings aren’t known. That’s precisely what OAA does: we can use any binary classification algorithms we like. We don’t need to know what loss functions are used, nor do we need the oracles to output class probabilities/confidence scores. We could even ask someone else to provide the base classifiers, if so inclined.

Detour: the “one classifier/one regressor” trick

(Note: this section covers a technical trick that isn’t OAA-specific; feel free to skip to the next section)

I claimed that you need to train several binary classifiers when using OAA, but that’s not strictly true. In fact, you can squeeze several different binary classifiers into one binary classifier using some feature index manipulation, in a trick known as the one classifier trick. It’s one of my favourite ML curiosities.

Here’s how it works: first, let’s say we have a 3-class classification problem and 4 features a, b , c & d for each example in our training set. Instead of training three classifiers with 4 features each, we can instead create 12-dimensional (sparse) training examples, prefixing the columns based on the class in question. We then zero out all but the relevant members. So, for a training set that originally looked like this (6 examples):

We transform it into this (6x3=18 examples):

We then use a single binary classifier to train on this dataset. During training using e.g. SGD, features whose values are zero will not have their weights adjusted; only the relevant feature weights will be tweaked. The end result is the same as training three separate classifiers, which is pretty neat! To do predictions, we follow a similar procedure: a single 4-feature example is exploded into 3 examples, with non-relevant indices set to zero. We then choose the class of the example with a positive prediction, breaking ties just as before.

Note that the one classifier trick also works for regression problems; in such cases, it’s known as the one regressor trick. And if you don’t want to do feature index manipulation yourself, you can combine this trick with another neat technique known as the hashing trick.

(Side note of a side note: when using the one classifier trick, one must be careful with regularisation terms so that only relevant weights are updated. The standard procedure is to add separate terms for the features in each class.)

(Back to the original subject of OAA)

The benefits of OAA

OAA has several nice properties due to the fact that it learns class labels independently rather that jointly. Some features you might appreciate include:

Parallelisability. OAA solutions are almost trivially scalable: if you have e.g. 10 classes, you can train your 10 oracle classes in separate threads/processes with no need for syncronisation. For large datasets and complex problems, this is not only a benefit, but a necessity: we need to duplicate the original dataset, so it’s good that we can offset most of the computational overhead.
Model interpretability. Since a single binary classifier represents one (and only one) class, it’s possible to understand more about the factors associated with one class by looking at its corresponding classifier. As ML interpretability becomes increasingly important, this is worth noting.
Simplicity. This has more to do with reductions in general than OAA, but the learning process is no more complex than the process used by the base learners in the system.

If you are comfortable with Python, scikit-learn has a OneVsRestClassifier class with built-in support for parallelisation via the n_jobs parameter.

The class imbalance issue

There’s a downside with vanilla OAA that I hinted at during the introduction of this post, namely the issue of class imbalance. Let’s assume that we have a 10-class multiclass classification problem, and a training set of 100,000 examples across which class labels are evenly distributed. In other words, each class has 10,000 positive examples in our training data.

Consider what happens when we reduce the original problem using OAA:

For each of the 10 classes, we create a new dataset, and mark corresponding positive examples (10,000) against negative examples (90,000)
We train a separate binary classifier on each of these 1:9 ratio datasets
At prediction time, we choose the class for which the prediction is positive, breaking ties randomly

When using OAA, each binary classifier is subject to class imbalance: because the number of negative examples far outweigh the number of positive examples, learning will typically skew towards the negative class. In such situations, we may end up with cases where all classifiers output negative, and since ties are broken randomly, our multiclass prediction is essentially random, too. The opposite can also happen: several base learners might output positives, again forcing a random tie-break. Fortunately, there exist several strategies for fixing this, including the one we’ll cover in this blog post: the WOAA reduction.

Fixing class imbalances: the WOAA reduction

Let’s once again consider a situation where we use OAA to solve a ten-class (n=10) problem using binary classification oracles. If we think about the problem some more, we can make some observations:

For a given binary oracle, a false negative produces an error in the original multiclass problem (n-1)/n = 9/10 = 90% of the time, assuming that all other learners correctly output a negative label and we break ties randomly.
Assuming no other errors, if n base learners produce false positives, the probability of an error in the original problem is n/(n+1), since only one positive can be a true positive.
If n base learners produce false positives, and one base learning produces a false negative, the probability of an error in the original problem is 100%.

Thinking about this from the perspective of the original problem, if there is a single false negative in the base learners, the probability of chosing the correct class is 1/n, or 10% in our example scenario. If there is a single false positive, the probability of a correct prediction is 1/2 = 50%. In fact, one can show that an error rate of ε in base learners can result in a worst-case error rate of (n-1)ε in the multiclass solution — far from ideal.

If we could make base learners more prone to outputting a positive rather than a negative, we can hopefully improve the accuracy of our original problem. We can achieve this by introducing a weighting framework, where some examples are weighted more than others: in this framework, a misclassification of an example with an example weight of 2 two is penalised twice as much as an example with unit weight. More generally, for the OAA problem, we transform the original n-class dataset into n base learner specific weighted datasets Di as before, but additionally, we convert each of these datasets a weighted dataset. Each (label, features) example in each dataset is converted into a weighted (label, features, weight) example as follows:

If the example x is negative (label 0), convert it into an example with weight n/2: (label=0, features=x, weight=n/2)
If the example x is negative (label 1), convert it into an example with unit weight n-1: (label=1, features=x, weight=n-1)

The conversion above is appropriate for >2-class classification problems. Using our working example of a 10-class problem with an original dataset of 100,000 evenly distributed examples, examples in base learner datasets (with 10,000 positive and 90,000 negative examples each) are converted into weighted examples as follows:

If the example x is negative (label 0), it is converted into a weighted example (label=0, features=x, weight=5)
If the example x is positive (label 1), it is converted into a weighted example (label=1, features=x, weight=9)

This weighting strategy is known as Weighted One-Against-All, or WOAA for short. The reduction was designed by Alina Beygelzimer, John Langford & Bianca Zadrozny and published in 2005 [2]. If the average error rate of base learners is ε, this twist on vanilla OAA achieves a multiclass error rate of approximately (n/2)ε, making it an attractive option when doing one-against-all classification.

If we had an importance-weighted binary classification algorithm at hand, we could stop here and just use that to use the WOAA approach to multiclass classification. But in true reductionist fashion, we should ask ourselves the following question: can we further reduce importance-weighted binary classification to binary classification? And can we do so using a binary oracle, without fiddling around with loss functions or internal learning processes?

As you may have guessed, yes you can. There are several strategies to do this, but one of the more principled solutions is the third reduction introduced in this blog post: the Costing reduction.

Fixing class imbalances through sampling: the Costing algorithm

Importance-weighted classification is a way of doing cost-sensitive classification: some misclassifications are costlier than others, and in order to deal with this, we weigh expensive mistakes more than we do others. We’ll talk about cost-sensitive classification in more detail later on in the series, but it’s worth mentioning briefly here as it’s how the Costing reduction got its name.

Costing, developed by Bianca Zadrozny, John Langford􏰀, & Naoki Abe [3], is a reduction that allows you to reduce any importance-weighted classification problem to a standard classification problem. For importance-weighted binary classification, that means reducing to standard binary classification. The reduction can implemented in several ways, but if we want to be able to treat our base learners as black-box oracles, the implementation involves rejection sampling the original dataset several times to create new datasets suitable for training using a binary oracle. Costing also adds ensembling to the mix, to correct for variance across individual sample sets.

Let’s start with the rejection sampling part first. Rejection sampling is a popular technique, used in several domains to generate observations from a desired target data distribution, based on draws from a different distribution (commonly known as a proposal distribution). In the context of machine learning, it works as follows:

For a given dataset D:

Randomly sample an example x from D
If x satisfies some acceptance criterion/criteria C, return x
If x does not satisfy C, go to 1)

The acceptance criterion C depends on the type of problem you are solving. If we wish to use sampling for implementing importance weighted classification, we want to design the acceptance criterion in such a way that the sampling process generates more costly examples than it does less-costly ones. By correcting the proportion of different types of training examples fed to our base learners, we can compensate for costly errors and achieve importance-weighted results.

Specifically, in order to make rejection sampling fit our needs, we modify the basic process by making C dependent on importance weights (known as costs in this setting). This variant of rejection sampling is known as cost-proportionate rejection sampling:

For a given dataset D:

Randomly sample an example x from D
Accept x with probability c/Z, where c is the cost associated with x and Z is a pre-chosen constant
If x not accepted, go to 1)

Z is a hyperparameter typically set to the maximum cost found over all all examples in our dataset D. Intuitively, during the sampling process, costlier examples are more likely to be accepted than less costly ones. If we use this process to generate new datasets for our base learners, they will be inherently more “balanced”, and less prone to class imbalance issues.

Following the theory laid out in WOAA, and our running example of a ten-class classification problem, for each base learner dataset Di, we can use the optimal importance weights (n/2 = 5 for negative, n-1=9 for positive) as costs, and convert all skewed oracle datasets into balanced sets, train on those using our base learners, and rollup to solve the original problem. Neat!

We’d be otherwise done, but there is a minor detail yet to be addressed. Since cost-proportionate rejection sampling throws away quite a bit of our original base learner datasets (since positive examples are in the minority), the training process is usually significantly faster than when doing vanilla OAA or WOAA with importance-aware classification algorithms. This is inherently good, but to make the process more robust, we can use this newfound computing time to create several datasets for each class, train a binary classifier for each dataset, and then average the results at prediction time. This is the full Costing reduction (Cost-proportionate rejection sampling with aggregation). We may, for example, choose to create 10 base learners for each class in a ten-class problem, resulting in 100 base learners in total (and the same number of datasets). Such ensembling may seem like overkill, but empirically, the approach works well. There are also good theoretical guarantees in terms of classification performance.

Side note: if you are adventurous, yes, you can stuff 100 classifiers into a single one using the one classifier trick!

Putting it all together

In this part of the series on ML reductions, we explored the classic one-against-all reduction for solving multiclass problems. It’s parallelisable, easy to understand, and more interpretable than most other classification methods. We can mitigate the issue of class imbalance with using a more sophisticated reduction, WOAA, which in turn can be reduced to binary classification via Costing. Put together, we have a multiclass->WOAA->Costing->binary classification reduction stack. On a more general note, using reductions, we can use binary learning algorithms to solve importance-weighted multiclass classification problems. And not just any binary learning algorithms — all binary learning algorithms, regardless of output (class or class probabilities), or other inner workings.

It’s this type of oracleisable approach to machine learning that makes reductions tremendously powerful. And we haven’t even scratched the surface. Stay tuned for Part III, where we’ll cover cost-sensitive classification in more detail.

[1] Learning Reductions that Really Work, Alina Beygelzimer, Hal Daumé III, John Langford, Paul Mineiro. 2015. URI: https://arxiv.org/pdf/1502.02704.pdf

[2] Weighted One-Against-All, Alina Beygelzimer, John Langford, & Bianca Zadrozny. 2015. URI: http://hunch.net/~beygel/woa.pdf

[3], Cost-Sensitive Learning by Cost-Proportionate Example Weighting, Bianca Zadrozny, John Langford􏰀, & Naoki Abe. 2013. URI: http://hunch.net/~jl/projects/reductions/costing/finalICDM2003.ps