# How to do Cost-Sensitive Learning

## Be right in classification modeling when it matters most

# Introduction

Data Science teams must often walk the line between making high-value bespoke models and making robust, transparent, and scalable solution pipelines. Transforming your data with a business-specific cost matrix prior to model fitting can tailor a model to your business, without adding complexity to your deployed pipeline. This article provides a practical discussion, with examples, of how you can start using your cost structure today to drive greater business value from your models.

# 1. Classification, Objectives, and Value

The field of data science has shown great value in automating or augmenting certain decisions in the business process. In theory, this amounts to:

- Accessing as much data surrounding this decision as possible.
- Designing an objective function tailored to the appropriate notion of business value.
- Creating an automated pipeline that will use the relevant data to fit a learner to this objective and deploy the resulting model in a way that is robust, transparent, and scalable.

In practice, designing and maintaining custom objective functions is often in contention with our desires for robust, transparent and scalable data science pipelines. As a result, many data science groups use pre-made objective functions, hoping that general model accuracy will mostly result in a solution that drives value for the business. While this is sometimes the case, you don’t have to look far to find data science solutions that are deemed valuable in a mathematical sense but not in a business one.

We describe here a series of strategies that can be employed to ensure that the models in question are being fit in such a way that maximizes business value. Moreover, these strategies reduce to an upsampling-downsampling strategy that can simply be inserted as a preprocessing step in your existing data science pipelines. All that’s required is to define how each classification and miss-classification translates to your business bottom line — a notion that we make precise below as a “cost matrix.”

# 2. The Cost-Sensitive Learning Landscape

Given a cost matrix *c = (c(i,j)(x))* where *c(i,j)(x)* represents the cost (perhaps negative or zero) of classifying *x* (which is really a member of class *j*) as being in class *i*. The first reduction one might make is in assuming that these functions *c(i,j)* are, in fact, constant. This distinction is known as a *class dependent cost* as opposed to an *example dependent cost*. Example dependent costs are more complicated and generally speaking, less is known about handling them. In this discussion, we will limit ourselves to the case of class dependent costs.

One further distinction that you might make is between the two-class case and the multi-class case. The two-class case of class dependent cost sensitive learning has been adequately solved (at least in some sense) by reducing to the two-class cost-insensitive learning case and then choosing an optimal decision threshold determined by the cost matrix (assuming some very mild reasonableness conditions on the cost matrix)[1]. We give a description of this solution in section 2.2.

The multi-class case is less clear. One could divide the realm of solutions first into the following two strategies:

- Reduce to the two-class case using the one-versus-one or one-versus-all techniques.
- Create a solution by treating all classes and costs simultaneously.

The classical solution, which does not reduce to the two-class case involves

weighting class i by

where *c(i) *is the sum of the *i*th column of the cost matrix, e.g.

*and n(k)* is the number of examples in class* k, κ* is the number of classes, and *n* is the total number of examples in the training set.

When *κ = 2*, this is the solution alluded to above, but when *m > 2 *we can demonstrate that this solution is suboptimal in certain cases.

When *κ > 2,* Zhou and Liu give conditions on the cost matrix *c *under which an optimal solution may be derived [2]. It is further suggested that, when *c *does not satisfy these conditions, reducing to the two-class case using the one-versus-one technique results in a better solution. This solution does, however, substantially increase the compute time required to train and score the solution.

# 3. The Two-Class Case

## 3.1 A Motivating Example

Suppose you have a learning task with training data *X = {x(1), x(2), . . . , x(n)} → {0,1} *that you would like to model. Let’s say that there are *n(0) *negative cases and *n(1) *positive cases. Suppose we choose to rebalance that data (say up or down sampling the negative class) so that there are *n(1) *negative cases and *n(1) *positive cases. That is, we are multiplying the number of negative cases by *n(1)/n(0).*

Consider now the following theorem of Elkan [1]:

Theorem 1:To make a target probability threshold p* correspond to a given probability threshold p(0), the number of negative examples in the training set should be multiplied by

Suppose further that we train a model *f : X → [0, 1] *which models the probability *P(y = 1|x)*, and predict using the threshold *0.5*. That is:

Using theorem 1, we can calculate the decision threshold of the original data (before balancing) to which this corresponds. This calculation is below:

So we see that modeling on the balanced data with a threshold of 0.5 corresponds to modeling on the raw (or original) data with a threshold of *n(1)/n*. What is left to observe now is that the optimal decision threshold is determined precisely by the existence of a cost matrix.

Let *c *be a real valued matrix

where *c(i,j) *is the cost of predicting class *i *when the truth is class *j* (e.g. *c(0,1) *is the cost associated to a false negative).

We will further require that the following two conditions be satisfied (known as the “reasonableness” conditions):

(i.e. correctly labeling an example is preferable to incorrectly labeling one).

Such a matrix is called a “cost matrix” in the two-class setting.

Observation 1:There is always a cost (perhaps negative or zero) associated to classifying an example as classithat is in actuality a member of classj.

A natural question is then, given a cost matrix *c, *and a function *f : X → [0, 1]*, which models the probability *P(y = 1|x)*, what criteria should be used to determine if we should classify *x *as 0 or 1?

The natural criteria is to classify *x *as a 1 exactly when the expected cost of classifying as a 1 is less than classifying it as a 0. (If the two expected values are equal, it doesn’t matter what we choose). More precisely, we should label *x* as a 1 if and only if:

The optimal threshold is then the value *p** satisfying:

Solving for *p**, we arrive at:

Thus, we see that the practice of balancing the classes and choosing 0.5 as the decision threshold corresponds to leaving the data unbalanced and optimizing the decision threshold to the below matrix:

which essentially says that (fixing the cost of a false positive at 1), the cost of “missing” a positive case is proportional to its scarcity.

Observation 2:This calculation explains why the (perhaps unexamined) practice of balancing and thresholding at 0.5, is a reasonable practice in the absence of more complete information, as costs are often proportional to the rarity of the missed event. That said, in the presence of more complete information (i.e. an explicit cost matrix) it is recommended that you optimize for it explicitly.

## 3.2 The Two-Class Strategy

For the sake of completeness, we now use the observations from section 3.1 to describe an end-to-end solution to go from a reasonable cost matrix to an optimal sampling or weighting technique, given a decision threshold of 0.5.

Let *c = (c(i,j)) *be a real valued matrix where *c(i,j)* is the cost of predicting class *i *when the truth is class *j* satisfying the reasonableness conditions *c(0,1) > c(0,0)* and *c(1,0) > c(1,1)*. The optimal decision threshold is then given by:

which, by theorem 1, corresponds to multiplying the number of negative cases by:

That is, we should sample *(c(1,0)-c(0,0))/(c(0,1)-c(1,1))* of the negative class, or equivalently, weight the positive class with *c(0,1)-c(1,1)* and the negative class with *c(1,0)-c(0,0)*.

Thus, we arrive at the following algorithm for handling the two-class case:

# 4 The Multi-Class Case

Suppose now that we have κ classes, with a κ×κ real-valued cost matrix *c = (c(i,j))* as above. Our goal is to construct a weight vector *w = (w(1), w(2), . . . , w(k))* with each *w(i)* a positive real number.

An optimal solution *w* should, in particular, be an optimal solution in the two-class case. That is, for each *i, j ∈ {1, . . . , κ } *with *i* different from *j*, we should have:

We can simplify this if we further insist that *c(i,i) = 0 *for all *i *(which by reasonableness also makes all other weights positive) and reduces this to:

Thus, we arrive at κ choose 2 equations, linear in *w(i)*:

from which we would like to find a solution *w = (w(i))* with all *w(i) *positive.

For this to be possible, we require the associated coefficient matrix of the above system to have rank less than *κ*. In the event that this condition is satisfied, we may solve directly for our weights *w*. Otherwise, we must apply a different approach (Section 4.1) or reduce to the 2-class case (Section 4.2).

## 4.1 The Classical Approach

The classical approach to class dependent weighting was presented by Breiman et al in [3] (a nice discussion also appears in [4]). The basic idea stems from the following observation:

Observation 3:The weight assigned to classishould be proportional to the cost of misclassifying a member of class i.

What then is a reasonable approximation of the cost of misclassifying a member of class *i*? If we had already constructed our classifier and knew the probability of misclassifying a member of class *i* as class *j* (let’s call that *P(y’ = j|y = 1))*, then this value would be:

Given that we cannot compute these probabilities before constructing our classifier, a reasonable approach is to assume that each of these probabilities are equal, giving a proposed weight of *c(i)* for the *i*th class where *c(i)* is given by

Convention insists that these weights are then scaled by

giving the final weight of:

where *c(i) *is defined as above, *n(k)* is the number of examples in class *k, κ* is the number of classes, and *n* is the total number of examples in the training set.

Note that when *κ = 2 *and *c(0,0) = c(1,1) = 0* then this formula gives the familiar solution:

## 4.2 Reduction to the 2-Class Case

Another natural option is to reduce to the solvable case (i.e. the two-class case). There are generally two standard ways to do this: the one versus one and the one versus all techniques. Since employing the one versus all technique would force us to again estimate the general cost of misclassifying an example from a given class, we favor the one versus one technique. A brief description of how this would work is below:

## 4.3 The Multi-Class Strategy

We are now at a point that we can propose two alternate strategies for prediction, which we will term “SimpleCost” (algorithm 3 proposed in [3] and described in [4]) and “ComplexCost” (algorithm 4 proposed in [2] and known as RESCALEnew).

The SimpleCost algorithm works as written for any learner *L* that accepts class weights *w*. For learners that do not accept weights, an equivalent weighting scheme can be accomplished through a mixture of upsampling and downsampling [Appendix].

The ComplexCost algorithm instead tries to find a non-vanishing weighting scheme that satisfies the (likely overdetermined) system of constraints — reducing to the two-class case when no such solution exists.

Recall that for a κ×κ real-valued cost matrix *c = (c(i,j))*, there exist κ choose 2 equations, linear in *w(i):*

that optimal weights *w = (w(i)) *must satisfy. We can place the coefficients in this system into a κ choose 2 × κ matrix referred to in algorithm 4 as *c’*.

# 5. Case Study — Comparison of Cost-Sensitive and Traditional Methods in a Lead Routing Problem.

Consider the following business process — you have access to a source of leads. For each lead you can choose to:

1. Send the lead to a team that sells only high-value products.

2. Send the lead to a team that sells only low-value products.

3. Choose not to field the lead.

There’s some complexity here, because the products for which the lead qualifies are not explicitly known. It could be no products; it could be only low-value products; or it could be high and low-value products. We can readily create a “Contribution Margin” Matrix (or CM-matrix) from the business knowledge that may look something like this:

In general, these are just real numbers, but it is important that all values be in the same units (here US dollars) and should be relative to the same baseline.

In this case, we can interpret that sending a lead to a business team that cannot actually be sold costs the business 10 dollars (the cost of doing the work of trying to make the sale). There is no cost to choosing not to send a lead to a business team (but there will be no CM generated). We also see that the team selling low-value products can sell a lead that qualifies for high-value products; but, in doing so, the team will create 50 dollars in CM instead of the 70 dollars in CM that could have been generated by sending the lead to the high-value product team.

Making the replacement *c(i,j) -> -c(i,j) + c(j,j) *generates a corresponding relative cost matrix:

We can then generate the corresponding counts, costs, and weights required by the SimpleCost algorithm:

If we fit a standard (cost-insensitive) random forest classifier with standard parameter tuning, we arrive at the following confusion matrix on a holdout set:

This method gives an average CM per lead of $12.18.

Employing the SimpleCost method, but using the same learner and the same holdout set, we arrive at the below confusion matrix:

Which gives an average CM per lead of $16.00 (a 31% improvement). The reader will notice that the effect of the cost-sensitive method is to:

1. Recognize that incorrectly predicting class C is the most painful.

2. Corresponding weight up the training examples from class A and B.

3. Get a model that predicts classes A and B more often — leading to a lower accuracy but a much more valuable model in the given business context.

Most models created to augment or automate a business decision can be made (and improved) by doing so in the presence of the relative costs of each classification.

# Conclusion

Employing an upsampling/downsampling/weighting scheme driven by a cost or CM matrix prior to modeling is very likely to lead to a solution with a much greater business value. Moreover, this can be accomplished without the addition of tremendous technical overhead or obscuring the interpretability of the solution.

*Interested in solving complex problems and building products that drive value? Come check us out at **RedVentures.com**.*

# Appendix: Creating sampling strategies from weights

In our discussion thus far, we have turned cost (or CM) matrices into class weights that we can then specify as parameters to a learner. While many learners do naturally accept class weights in this way — many packages are not ready-made for this purpose. To have a truly model-agnostic and cost-sensitive strategy, we are perhaps better suited to rely on class “mix” — that is, the proportion of the training set the class represents — rather than class weights.

A straightforward calculation shows how to derive the desired mix of class *i: μ(i) *in terms of the calculated weight *w(i) *and number of examples in the training set (before any up or down sampling) *n(i). *This formula is given below: