Errudite: Scalable, Reproducible, and Testable Error Analysis

Error analysis is a compass, and we need it to be accurate.

Error analysis — the attempt to analyze when, how, and why machine-learning models fail — is a crucial part of the development cycle: Researchers use it to suggest directions for future improvement, and practitioners make deployment decisions based on it. Since error analysis profoundly determines the direction of subsequent actions, we cannot afford it to be biased or incomplete.

But how are people doing error analysis today? If you read some quotes from ACL papers (a top conference for NLP, or Natural Language Processing), this is what you see:

  • “We performed an error analysis on a sample of 100 questions.”
  • We randomly select 50 incorrect questions and categorize them into 6 classes.
  • We sample 100 incorrect predictions and try to find common error categories.

Apparently, the community has converged to this shared method:

“We randomly select 50–100 incorrect questions and roughly label them into N error groups.”

This seems reasonable; what could go wrong? A lot, it turns out. For example, a 50–100 sample size is too small, frequently covering less than 5% of the total errors. Such small samples are likely unrepresentative of the true error distribution. It could be disastrous if we deploy a model only because our sample happens to underestimate a crucial model deficiency (chatbot assistants making inappropriate replies to certain kinds of queries missing from the error sample, etc.).

Small sample size is just the first problem with the standard approach. In our ACL 2019 paper, we enumerate several key challenges for NLP error analysis, and raise three principles to advocate for a more precise and reproducible, scalable, and testable procedure. We also design Errudite, an interactive tool that instantiates these principles and addresses the problems with common ad hoc approaches to error analysis.

Below, we will walk through one specific case, and show how manual, subjective inspection of a small sample of errors can be ambiguous, biased, and miss the root cause of errors. We will also show how Errudite helps avoid pitfalls. Please see our paper for more cases, or watch this video for a demo!

The Scenario

Let’s use Errudite to analyze a well known Machine Comprehension (MC) baseline model: BiDAF. Given a question and a context paragraph, a MC model is supposed to find a snippet of the context that correctly answers the question.

In this example (from SQuAD, a well-known MC dataset), Murray Gold (bolded, ground truth) created the 2005 theme for Doctor Who.

As one of the most important testbeds for language understanding, MC error analysis is crucial yet challenging: experts are eager to find the weak spots of the state-of-the-art models, but with both inputs (question and context) and output (answer) to MC being unstructured text, they have limited features to break down and inspect the dataset. BiDAF is common enough in MC that most domain experts are familiar with it. In fact, our example comes from our real conversations with MC experts, in which we asked them to evaluate BiDAF’s strengths and weaknesses.

In the example above, BiDAF makes a mistake, predicting John Debney (underlined) instead of Murray Gold. We want to generalize from this one error, and understand our model performance more globally.

Our first question seeing this error is: Why does the model make this mistake? One hypothesis our domain experts came up with is the Distractor Hypothesis:

BiDAF is good at matching questions to named entity types 😄, but is often distracted by other spans with the same entity type 😞.

More specifically, in our example, when we ask “who”, BiDAF knows we are asking for a person name and gives us one, but it’s not the correct one.

With this hypothesis at hand, our second question is then, Does the model make similar mistakes often? Understanding the prevalence of an error hypothesis requires us to inspect more similar instances. As we’ve mentioned before, unstructured texts have limited features for exploring the dataset, and so people tend to group a subset of “distractors” by manually labeling error samples.

The problem is, an error group defined in this way is subjective, and experts can easily disagree on what one group name means. Imagine a corner case of the distractor error: the ground truth answer to a “when” question is “during his college year,” whereas our model returns a wrong year, “1996.” Some people consider this to be a “distractor” problem as the question type (“when”) and the predicted named entity (year) matches perfectly. However, others might categorize it as something else, because the ground truth is not a recognizable named entity. You won’t even realize this difference if you just see the name and text description of the error cause.

In fact, in our user study we observed such inconsistencies even for simple group definitions (please see our paper for details!): When given identical descriptions of an error type from a prior published analysis and asked to reproduce it, our expert users produced groups that vary in size from 13.8% to 45.2% of all errors — which further illustrates the ambiguity in subjective manual labeling. This leads us to our first principle:

P1: Error hypotheses should be defined precisely with concrete descriptions.

Principle 1: Be precise.

To overcome manual subjectivity and be more precise, Errudite uses a Domain-Specific Language (DSL) to quantify instances.

In short, the DSL applies a list of powerful extractors, on targets of an instance, with additional supporting operators. A simple case is to extract the character length of a question, and require it to be larger than 20.

Combinations of these three building blocks support the extraction of attributes from instances in various ways, so experts can answer the “how prevalent” question by objectively and systematically grouping instances with particular patterns (e.g., via filters on attributes).

We use the following query to define distractor errors:

These lines can be broken down into several semantically meaningful conditions:

  • Line 1, the ground truth is an ENTity, like “PERSON.”
  • Line 2–3, there are more tokens matching the ground truth entity type in the whole context than in the ground truth alone. Together with line 1, it will mean “find instances that have a potential distractor in the context.”
  • Line 4, the prediction entity type matches the ground truth one. Line 1–4 would mean, “our model finds a correct entity type.”
  • And finally, line 5, the prediction is incorrect, so that lines 1–5 define a group of instances that are distracted.

In contrast to just saying “distractor,” if you share these five lines to other researchers, they can precisely know we’ve excluded the previously described corner case. (To include it, our query might express a match between the question type — who, when, etc. — and the predicted named entity type.)

Principle 2: Cover all the data.

Applying our filters, 192 instances in the group are cases where BiDAF predicts a wrong span, but this span has the same entity type as the ground truth. Note that in addition to this precision, the DSL also makes error analysis scalable: our query filter for just one error category already exceeds the 50–100 sampling convention, which reduces the sampling error.

These 192 instances cover almost 6% of all BiDAF errors in our validation set. Looks convincing, right? This distractor hypothesis seems to be pretty solid! Should we now go and try to fix this problem in BiDAF?

If we apply all the partial filters and build all the other groups, we notice a different pattern: BiDAF predicts the exact correct span 68% of the time overall, which rises to 80% when the ground truth is an entity. When other entities with the same type are present in the passage, BiDAF is still 79% accurate (i.e., it is not particularly worse when there are potential distractors), and when it predicts an entity with the correct type, it is quite accurate — 88% of the time! This is much higher than the 68% exact match overall. This means BiDAF actually performs better when it has distractors and the entity type is matched. So if you just see the errors and decide this is a very important thing to fix, think twice, or you might miss something even bigger.

So, the DSL helps cover the entire dataset, including the correct instances. The error analysis is more systematic and scalable this way, and can give you different conclusions when compared to looking at a small sample of mistakes only. We formally state our second principle as:

P2: Error prevalence should be assessed over the entire dataset — including the true positive (non-error) examples.

Principle 3: Test error hypotheses to assess causality.

Now, we’ve defined groups related to distractors. But, the presence of distractors in a wrong prediction does not necessarily indicate that distractors were the root cause of the mistake. Turning back to our previous example, it’s easy to assume that it is wrong due to the distractor, but maybe it’s because we need to do multi-sentence reasoning to link “Doctor Who” with “the series”, or perhaps something else.

This leads to one more problem in the status-quo: We cannot effectively isolate the true cause of an error. To dig into root causes, we state a third principle:

P3: Error hypotheses should be explicitly tested.

In Errudite, we help answer this question, “Are the 192 instances really wrong because of the distractor?”, by asking a related what-if question: “If the predicted distractor was not there, would the model predict correctly?” We answer this question though counterfactual analysis with rewrite rules.

Leveraging our Domain Specific Language, Errudite uses rules to rewrite all instances in a group. We can verify whether distractors are causing mistakes by using a rewrite rule on the is_distracted group: We rewrite the context by replacing the model’s predicted distractor string. We swap in a meaningless placeholder “#”, so it won’t be detected as an entity anymore.

Once rewritten, we ask our model to perform prediction again. In our previous example (the first below), with “John Debney” replaced, we now get a different distractor prediction — “Ron Grainer.” It seems that another distractor is still confusing the model!

As for the rest of the group: changing to a different incorrect entity occurs 29% of the time. Another 48% of the time, the predictions were corrected, so indeed the distractors were causing flawed predictions. However, for the remaining 23%, the model predicts the same span as before, except now this contains the meaningless hash token! Some other factors are likely at play: Maybe the predicted sentences heavily overlap with the question, and almost forced the model to do a naive token matching rather than searching for entities. This kind of counterfactual analysis helps develop insights not available through grouping alone.

Precise + Reproducible + Re-applicable

What do we get out of this running story? Throughout the analysis, we were able to build attributes, groups, and rewrite rules with precise queries.

We applied them to BiDAF, and found that BiDAF is not particularly bad at distractors, and that errors seemingly due to distractors may be wrong for other reasons. Beyond that, with the queries saved, we can easily share them, and re-apply them to the same or different models or data whenever we want.

Bonus: User Interface Functionality

Errudite provides a graphical user interface that not only integrates the analysis process, but also provides additional exploration support such as visualizing data distributions, suggesting potential queries, and presenting the grouping and rewriting results.

Suggestion via Programming-by-Demonstration

To make it easier to write DSL queries, we use programming by demonstration to generalize extraction patterns. When one interactively selects certain tokens, Errudite provides a list of potentially relevant queries to help find related instances.

Attribute distribution

To guide exploration, group creation, and refinement, Errudite supports defining complex attributes (in this figure, the entity type of the ground truth answer), and inspecting their distributions in and out of created groups.

A Recap!

In this work, we characterize deficiencies with current error analysis methods in NLP: they are subjective, they are laborious, they ignore the correct cases, and they do not have tests. Errudite does a line-by-line response, making the error analysis more precise, reproducible, scalable, and testable.

Error Analysis Beyond NLP

The current Errudite implementation (specifically, its DSL) focuses on NLP. However, machine learning systems also impact our lives through non-textual media (thing of self-driving cars, power networks, etc.). We are convinced that, to help make the right decisions — deploy the right model, pursue the right research direction — our three principles can, and should, be applied to more areas than NLP. Similar tools putting these principles into practice can be easily built, so long as they support: (1) building precise instance groups with composable building blocks in a domain-specific language; (2) scaling the analysis to cover all the relevant successes and failures by automatically building large groups with filtering queries, and providing visual summaries for them; and (3) testing error hypotheses using counterfactual analysis by rewriting the instances with rules.

For instance, we can imagine the Computer Vision domain benefiting from a DSL that supports object detection. Various perturbation methods have been “rewriting” images to test model robustness, and we believe that they could be modified for the purpose of understanding why models fail in certain groups.

Error Analysis for Your Own Tasks

The Errudite software is architected to be extensible to other NLP tasks. If you would like to try it out, please visit the Errudite repo, and see the tutorials!

This article was authored by Tongshuang (Sherry) Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Dan Weld.

Errudite Open Source repo:

Errudite ACL 2019 Paper:

Data visualization and interactive analysis research at the University of Washington.