Plausible Negative Examples for Better Multi-Class Classifier Evaluation
At ACL2019, we introduced nex-cv: a metric based on cross-validation, adapted for improvement of small, unbalanced natural-language datasets used in chatbot design. The main idea is to use plausible negative examples in the evaluation of text classification.
Our experiences draw upon building recruitment chatbots that mediate communication between job-seekers and recruiters by exposing the ML/NLP dataset
to the recruiting team. Recruiter teams motivated by scale and accessibility to build and maintain chatbots that provide answers to frequently asked questions (FAQs) based on ML/NLP datasets.
Enterprise clients may have up to 100K employees, and commensurate hiring rate. Over 50% of end-user (job-seeker) traffic occurs outside of working hours or during holidays (consistent with the anecdotal reports that using the chatbot reduces email and ticket load) —we dig into this in another post.
Evaluation approaches must be be understandable to various stakeholders, and useful for improving chatbot performance. We validate the metric based on seven recruitment domain datasets in English and German over the course of one year.
Data Quality Improvements
Existing chatbot guidelines include “transparency” as an important topic, but, in practice, why something does not work, and under what conditions, can puzzle designers and developers, not just end-users (we’ve written about this in another paper and post). The nex-cv metric helps because it:
- produces accuracy scores more in line with human judgment than, for example, F1 on cross-validation (details in paper)
- ensures that data from low-population classes is included in both training and testing, at least as a “negative example”
- can be used to generate internal recommendations and as part of ongoing data quality maintenance
The principle of these “recommendations” is to solve the worst problems first. The most common problem we have is when a smaller category overlaps in meaning with a larger one, and degrades both. The above graphic shows suggested pairs to focus on. The category Remote_work has fewer questions (only 15) than Company_location (which has 26). This may mean that Remote_work should be merged into Company_location; redistributed elsewhere in the dataset; or enriched and refined. Our internal tools use the metric to provide actionable guidance that has helped improve overall data quality significantly and consistently. The above recommendation is a generated text summary of confusion matrix results that is used by non-developer staff to improve data quality.
Data Quality and Maintenance in Chatbots
Classes — chatbot “intents” — are trained with synthetic data and constitute anticipated use, rather than actual use. Existing general chatbot platforms include this synthetic data step as part of design and maintenance.
For example, when it comes to invocations for a voice agent (Ali et al., 2018)*, dataset construction encodes findings about how users imagine asking for an action: the authors use crowdsourcing to achieve both consistency useful for classification, and reflection of user expectations in the dataset.
We work on enabling domain-experts (recruiters) to maintain the dataset, which helps map end-user (jobseeker) needs to recruiters’ goals.
Data cleaning is not only relevant to chatbots. Model-agnostic systems for understanding machine learning can help iteratively develop machine learning models (Zhang et al., 2019)*. Feature engineering can be made accessible to non-developers or domain experts, e.g. (Ribeiro et al., 2016)*. We make use of representative examples in the process that surfaces nex-cv to non-developers; using the the inspection-explanation-refinement approach employed in (Zhang et al., 2019)*. Enabling non-developers to perform data cleaning effectively allows developers to focus on model adjustments and feature engineering.
There are many ways to measure overall chatbot quality, such as manual check-lists of high-level feature presence (Kuligowska, 2015; Pereira and Díaz, 2018)*. User behavior measurements — both explicit, like ratings or feedback, and implicit, like timing or sentiment — are explored in (Hung et al., 2009)*. During metric development, we used qualitative feedback from domain-expert users, and key performance indicators (KPIs), such as automatic response rate. The use of a classifier as a component in a complex flow demands robust and actionable evaluation of that component.
* — see paper for full references
Code & Implementation
The code available online — https://github.com/jobpal/nex-cv — provides the evaluation implementation, an abstract black-box definition for a classifier, and two strategies to help test an implementation. For integration testing, CustomClassifier.test() can be used to check consistency of classifier wrapper. For functional testing, nex-cv with both K = 0 (Alg. 2) and P = 0 (Alg. 2) should yield results similar to 5-fold cross-validation.
Evaluation and Improvement of Chatbot Text Classification Data Quality Using Plausible Negative Examples. Kit Kuksenok & Andriy Martyniv — presented at ACL 2019 Workshop on NLP for Conversational AI, August 1, 2019