Evolving From Descriptive to Prescriptive Analytics: Part 6, Discovering Blind Spots in the Data
By Shaikh Quader and Chad Marston
We built a classification model for IBM’s customer help desk to predict whether an open support ticket would turn into an escalation. While exploring, we encountered a few key characteristics of the training dataset that we had not thought of at the beginning. As we got deeper into the model-building exercises, we realized that the awareness of these characteristics were essential for building the ML model correctly. In several later ML projects, we’ve seen some of these same data characteristics, so today, we’ll share with you three of these particular characteristics.
Finding a needle in the haystack — Model accuracy alone is not enough
The training dataset had 2.5M examples of support tickets that were not escalated and another 10k tickets that were escalated. The ratio between these 2 classes of tickets was extremely imbalanced, 250:1. In other words, 99.60% of the tickets came from the non-escalation class. For a classification model, people often lean heavily towards accuracy as the most important evaluation metric. We thought the same initially, but then realized that accuracy was not actually most important in this project. Why was that?
Notice that if we built a trivial classifier that always labeled the test examples as “non-escalation” regardless of the escalation class, it would have prediction accuracy of 99.60%. That model might be highly accurate for predicting the majority class, (“non-escalation”), but it’ll have 0% accuracy for predicting the minority class(“escalation”). Obviously that’s useless since the business needs the model to predict escalations in order to intervene in advance. Understanding the business usage of the ML model is critical for choosing the right evaluation metric. In this project, we needed a good balance between precision and recall, not just high overall accuracy. Precision and recall help us measure how the model does at finding a needle in the haystack:
- Precision: Precision measures how the model is doing at predicting each class. For example, if the model identifies 10 tickets as “escalation” but only 6 are actual escalations, the model has a precision of 6/10 or 0.60.
- Recall: If the dataset has 10 actual escalations and the model correctly identifies 8 of them, the model has a recall of 8/10 or 0.80.
Usually, there’s a trade-off between precision and recall. Improving precision can drop the recall and vice-versa. It’s up to the business stakeholders to tell the data scientists, which is more important: identifying more actual escalations at the cost of having more false escalations classified as escalations (high recall, low precision)? Or minimizing false escalations at the cost of missing many actual escalations (low recall, high precision)?
If the business stakeholders go for high recall and low precision, they will need to engage more people to deal with a higher number of real escalations and possibly many false escalations. If they choose low recall and high precision, they can engage fewer people to deal with the escalations but will risk having model miss many real escalations. In our case, initially, the Business stakeholders preferred high precision over low recall so that they didn’t have to deal with a lot of false escalation alerts.
Signal leakage — A deadly sin with time series features
Our dataset had a few features whose value changed with time. This introduced us to a phenomenon called signal leakage. Signal leakage means you’re inadvertently including information from the future into your feature, which is a form of cheating.
As an example, let’s imagine ABC Bank, a customer, opens a support ticket with the help desk on August 1st. Initially, the ticket has Severity 4. On August 6th, the customer raises the severity of the ticket to 1. Not satisfied with the pace of resolution, on August 8th, the customer escalates the ticket to the Help Desk Manager.
In summary, here’s the timeline of this ticket and changes in value of Severity:
- Aug 1: Ticket opened, Severity 4
- Aug 6: Severity Raised to 1
- Aug 8: Ticket escalated, Severity 1
We want our model to predict the escalation risk of a ticket one week in advance. While feature engineering the historical training dataset, we can’t use the final value of severity (from Aug 6th in the above example). We have to use the value of the ticket’s severity one whole week in the past (Aug 1st in this example) from the date of escalation.
We had several time series features in the data. The dataset we initially received had only the final value of those features. Thankfully the awareness of signal leakage helped us identify an additional dataset which had the history of changes in the time series features.
The nuances of labels
If you’re lucky, your dataset might include labels, the target variable that the model will predict, for building a classification model, but often the labels aren’t available out of box. (We’ll tell such a story in a future blog post.) For this escalation prediction project, we had labels for the dataset, but as we did exploratory data analysis with domain experts, we realized that there were nuances in the labels, and understanding those nuances was essential for the right design of the ML model. On labels, it’s worth clarifying these questions upfront:
- How are the labels generated? Are they generated by a system or human?
- If a formula does the labeling, what is that formula?
- Does the formula cover all scenarios? Does it miss some? Does it conflict in some cases?
- Can we rely on the labels to be 100% accurate?
In this project we had access to reliable labels since the system generates the labels automatically when customers escalate their tickets. However, within the labels there were subcategories of escalations, and not all subcategories were equally important to the business. For example, our dataset had a particular type of escalation for tickets escalated within a few hours of being opened. Our training dataset was updated once a day, and so, there was no point in predicting these short-lived escalations. We could safely exclude that type of escalation from the model training. Instead, we focused on the subcategories of escalations that could live longer than a day.
The three data challenges described above are certainly not exhaustive. Each project might have its own unique issues, but many projects run into similar challenges. Recognizing common data challenges early and having a plan to deal with them will almost certainly make your data science teams more efficient.
We hope you’ll check out the previous posts in this series: