Smart Social Response (SSR), a new system we’re building at Hootsuite, automatically suggests the most relevant template response to users, allowing organizations to respond to customers’ common requests efficiently, and consistently. The feature is powered by machine learning models, which ingest user defined template responses as label classes, train on past conversations, and output response suggestions for new messages. During this project, we got to do a lot of experimentation with thresholding our model, which inspired me to write this blog post. So, let’s dig in!
What is thresholding? Why do we threshold?
In machine learning, specifically classification problems, thresholding refers to setting a threshold for an evaluation score and treating predictions/models differently depending on whether they score above the threshold or not.
Machine learning models, if performing poorly, can sometimes yield disastrous results. For example, if a cancer free patient is mistakenly classified as having cancer, the patient may be subjected to intense chemo treatments and undergo unnecessary suffering. Although a poor prediction for us simply means that the suggested responses will be ignored, we wish to deliver the best results to our customers. Therefore, by utilizing thresholding strategies, we attempt to filter out underperforming models.
What are some thresholding methods?
Accuracy, Precision, and Recall are some of the most popular metrics used for evaluating and thresholding machine learning models.
Accuracy is the ratio of correctly predicted samples to the total number of samples.
“Of all the samples, how many did we predict correctly?”
Precision is the ratio of correctly predicted positive samples to the total number of predicted positive samples.
“Of all the samples that we predicted to be positive, how many are actually correct?”
Recall is the ratio of correctly predicted positive samples to the total number of actual positive samples.
“Of all the positive samples, how many did we predict correctly?”
Fbeta is the weighted average of precision and recall, defined as follows:
Traditional methods usually threshold by setting a minimum score for one or more of the metrics described above.
How do we threshold our model? Why do we threshold this way?
We implement thresholding throughout multiple stages in our pipeline.
1. During data preparation, we prune all classes with number of supports less than a threshold. Why do we deliberately shrink our not-so-big dataset? We cannot trust the model to make meaningful predictions based on only a few samples. Indeed, experiments show that as we increase this threshold, the model gains higher precision. However, since we are now predicting for fewer instances, the recall of our model drops. The change in precision and recall is shown in the figure below.
As we increase the threshold, we are essentially predicting for less and less classes; therefore, our recall drops. At the same time, since the classes we predict on have larger and larger number of supports, the predictions are usually correct and thus our precision increases significantly.
2. During model validation/prediction, we threshold by setting K, the number of predictions we wish to return for every instance. For every prediction that we make, our classifier returns the prediction probability for each class, from which we take the top-K probabilities and return the corresponding classes as predictions. We define our evaluation metrics based on K as follows:
Precision@K: The number of cases for which target response r was within the topK responses that were ranked by the model, divided by the number of set of non-empty topK predictions.
Recall@K: The number of cases for which target response r was within the topK responses that were ranked by the model, divided by the number of instances in our validation set.
Setting a larger K almost always implies an increase in both Precision@K and Recall@K, since returning more predictions means that the target response is more likely to fall within the predictions. However, setting K is ultimately a business problem. For example, setting K = 10 when in reality we only show users our top 3 predictions will not yield meaningful model scores.
3. Once we have a tuple of top-K predictions, we set a minimum confidence threshold to prune out predictions with high uncertainties. We implement 2 methods to achieve this:
- We threshold each individual predicted class. If a class’s probability is below our threshold we discard that prediction. As a result, only predictions that we are very confident on will be shown to users.
- We average out the prediction probabilities for each tuple of K predictions. If the average is below our threshold we discard all K predictions, returning an empty set. We return the entire tuple if the average is above the threshold. By setting a low threshold (e.g. 0.1), we are able to filter out instances that are too ambiguous to be predicted on. For instance, if the probability average of a tuple of 5 predictions is 0.05, it is likely that the target message is not a good candidate for us to suggest responses. Therefore it’s best not to show the users any predictions at this time.
Figure 2 shows the effect of setting different confidence thresholds. As seen from the graph, while setting a low threshold improves F-Beta score, setting higher thresholds will result in nothing being predicted.
An example illustrating both methods is shown below.
Suppose we have 5 classes: 1, 2, 3, 4, 5
For a particular instance, the prediction probabilities are: 0.1, 0.2, 0.3, 0.15, 0.25
The confidence threshold is set to 0.3, and K is set to 3
Without any confidence thresholding, our top-K result would be [ 3, 5, 2], with corresponding probabilities [0.3, 0.25, 0.2]
Suppose we threshold individual classes. Since class 2 has a probability of 0.2, and class 5 has a probability of 0.25, both of which are below 0.25, both classes would be pruned from the set of K predictions. Thus we would now return .
Suppose we threshold by averaging. The average probability of this set of predictions is (0.2+0.25+0.3)/3 = 0.25, which is less than our threshold. The entire tuple of top-K predictions is pruned, and we return [ ] as a result.
4. At the end of our ML pipeline, we threshold for the last time using F-Beta score. As mentioned earlier, F-Beta measures the balance between Precision and Recall, with the beta parameter controlling the weight we put on each. In our early production model, we wish to focus on precision in order to boost user confidence in our model. This means that in cases of uncertainty, we would prefer to return nothing rather than a set of likely incorrect predictions. As users learn to trust our predictions, they will start using the SSR feature more often. This will in turn help us gather more data and boost our models’ performance. It is a positive feedback loop! Therefore, we discard models whose validation F-Beta scores are below our threshold, and temporarily disable the SSR function for these customers. Does this mean that these customers will be left out? Absolutely not! We will retrain models for these customers periodically. With more and more data coming in daily, the models will score above our threshold in no time!
Conclusion and Future Work
Thresholding is crucial for deploying and evaluating a successful ML model. Overtime, thresholding strategies should evolve with models. For instance, we might wish to set higher thresholds as models start performing better, or we might start putting a heavier emphasis on recall rather than precision. When devising a thresholding strategy, one should always think from a business perspective; it is crucial to understand what values each strategy brings to the business application.