Going Beyond Scalar Metrics: Behavioral Testing of NLP Models

Quantify model behavior for better insight into what your model does

Arctic Icebergs in Ilulissat, Greenland — Photo by Alexander Hafemann on Unsplash

Numbers dominate our daily lives. We see them everywhere: speed limits, travel times, temperature, height, age, and more. However, we seldom use these numbers in isolation — they are more useful when put in context. Knowing someone’s weight alone is insufficient to conclusively determine their overall health; rather we should also take their height, age and other factors into consideration. Similarly, while planning travel through Expedia.com®, we select hotels not only by their star rating, but also by using other attributes such as proximity to activities, parking availability, etc. This applies to how we build and compare machine learning models.

As I elaborated in the Model Calibration post, model predictivity alone is inadequate to differentiate a reliable model from an unreliable one. Similarly, two equally predictive models might have very different characteristics, which must be taken into account when choosing one model over another. Machine learning models are typically evaluated using a task appropriate scalar metric: Precision, Recall, Accuracy, and F-Score for classification models; MSE and MAE for regression models. These metrics, while necessary, are insufficient to fully describe the characteristics of the models they measure, as scalar values provide only a summary view. Though this is not generally catastrophic, characteristics of a customer facing model should be understood better. Behavioral testing is a methodology that allows us to differentiate and highlight characteristics of each model.

Why do Behavioral Testing

A behavioral test suite is designed to capture a model’s expected change — or invariance — in predictions using carefully curated test cases. Similar to unit testing of software, each test case targets a specific capability. However, unlike software engineering, ML models cannot usually be improved by simply “fixing a bug”. It requires careful analysis of the misclassifications, establishing a pattern in mistakes and making targeted changes to either data or model architecture or both.

There are 4 different user perspectives that shape a behavioral test suite. For an intent classification model, these are:

Product Design: An intent classifier deployed to a point-of-sale should adequately understand the local vernacular
Field Ops: Utterances that are misclassified by a model are ‘bugs” and should be treated like software bugs
Data Science: One does not simply change a model’s predictions by adding 1 example to training data
User Experience: If we know the deficiencies of a model, we can adapt UX accordingly

Each of these perspectives define the scope of test cases and their relative importance within a test suite. This serves as a gateway for each stakeholder to clearly define their expectations from a model, in alignment with wider project goals. Such relatively well defined expectations enable data scientists to make better trade-offs during model selection. This also provides increased transparency of relative strengths and weaknesses of each model — which are becoming more opaque. Once a test suite has been established, we have an automated validation in place to capture model regressions and be forewarned about any deviation in behavior. This is arguably better than being blindsided in production, after customers start using the model.

Behavioral Test Suite — Technical Details

Technically, a test suite — for each model — is composed of test capabilities, test types and test cases. In the context of NLP models, test capabilities represent concepts such as Vocabulary, Part-Of-Speech, Negation, Robustness etc. Each capability is associated with one or more of the following test types:

Minimum Functionality Tests or MFT
Directionality Tests or DIR
Invariance Tests or INV

A MFT is analogous to software unit tests, where a prediction is expected to match a given label. DIR tests expect predictions to change in a certain way, given label changing perturbations. INV tests, on the contrary, introduce label preserving perturbations to input and expect the predictions to remain unchanged.

Lastly, test cases are created for each applicable test capability and type combination. Depending on the model type, neither all capabilities nor all test types may be applicable. For example, an intent classification task has no notion of implicit ordinality amongst its class labels. Therefore, a DIR test makes little sense for it. Whereas, a sentiment classification task does have ordinality amongst its labels and a DIR test is perfectly suitable.

A table containing test cases of an intent classifier. Each test case is qualified by associated test capability and type. — Test Cases for an Intent Classifier

Our test suite implementation follows the methodology outlined in CheckList[1] and uses the code open-sourced in marcotcr/checklist.

Test Suite Curation and Coverage

Each test case is made up of several utterances, which are curated using either a template or foraged directly from anonymized customer conversations. There are two ways of test case curation: bottom-up & top-down. In a bottom-up approach, we first gather failed utterances, group and add them to an existing test case or create a new one. Alternatively, in a top-down approach, a test case is first proposed by a stakeholder and supporting utterances are then created using templates or from actual utterances.

Coverage for each test case is dependent on the permutability of examples and prevalence of real data. For instance, it is far easier to provide substantial coverage for a test case that only modifies a location or vendor name, as these values are easily permutable. Similarly, it is also straightforward to provide good coverage for a test case built by grouping failed production data. On the other hand, it takes much more effort and creativity to create examples that are novel paraphrases.

When a template is used to permute an example, there could be a lack of diversity in data. The CheckList implementation provides a way to alleviate this — by using language models to suggest words. This ensures that these templated examples are not limited to the creativity of the human authoring the test case. As the quality of the test suite is contingent upon on its coverage, the test case curation is the most labor intensive step of this implementation.

Scalability Concerns

It is natural to be concerned about the scalability and maintainability of such a test suite, especially if test cases are designed mostly top-down. However, this is no different from test driven software development, where test case creation is factored into the development effort. So, we only really need a mindset change about effort estimates in the model development lifecycle.

Conclusion

In this post, I presented the case for behavioral testing of NLP models as a methodology to mitigate the shortcomings of ubiquitous scalar evaluation metrics. We saw how a test suite naturally allows stakeholder inputs into model selection, and enables everyone to set expectations on a model’s behavior after deployment. We also glimpsed into the technical details of creating a test suite and associated challenges. As ethical and responsible AI take centre-stage, it is our collective responsibility to at-least know “what behavior” a model has, even if we cannot fully explain the “why”.

Learn more about technology at Expedia Group

References

Ribeiro, Marco Tulio, et al. “Beyond accuracy: Behavioral testing of NLP models with CheckList.” arXiv preprint arXiv:2005.04118 (2020).