Disclaimer: The following discussion is aimed at clarifying terminology and concepts, but may be disappointing when expecting actionable insights. In the best case, it may provide inspiration for how to approach quality assurance in ML-enabled systems and what kind of techniques to explore.
TL;DR: I argue that machine learning corresponds to the requirements engineering phase of a project rather than the implementation phase and, as such, terminology that relates to validation (i.e., do we build the right system, given stakeholder needs) is more suitable than terminology that relates to verification (i.e., do we build the system right, given a specification). That is, machine learning suggests a specification (like specification mining and invariant detection) rather than provides an implementation for a known specification (like synthesis).
I have long been confused by the term machine-learning bug, especially when referring to the fact that a machine-learned model makes incorrect predictions. Papers discussing quality of machine learning models and systems tend to be all over the place about this. It is tempting to use terms like faults, bugs, testing, debugging, and coverage from software quality assurance also for models in machine learning, but discussions are often vague and inconsistent, given the lack of hard specifications. We tend to accept that model predictions are wrong in a certain number of cases — for example, 95% accuracy on training and test set might be considered quite good — so we don’t tend to call every incorrect prediction a bug. When a model performs poorly overall, we may be inclined to explore whether we picked the right learning technique, the right hyperparameters, or the right data — are these bugs? When a model makes more systematic mistakes (e.g., consistently perform poorly for people from minorities), we tend to explore solutions to narrow down causes, such as biased training sets —this may feel like debugging, but are these bugs?
The term bug is usually used to refer to a misalignment between a given specification and the implementation. Though we rarely ever have a full formal specification for a system, we can typically still say that a program misbehaves when it crashes or produces wrong outputs, because we have a pretty good sense of the system’s specification (e.g., thou shall not exit with a null pointer exception, thou shall compute the tax of a sale correctly). In the form of method contracts, the specification helps us to assign blame when something goes wrong (e.g., you provided invalid data that violates the precondition vs. I computed things incorrectly and violated postconditions or invariants).
Talking about specifications in machine learning is difficult. We have data, that somehow describes what we want, but also not really: We want some sort of generalization of the data, but also don’t want a precise generalization of the training data — it’s quite okay if the model makes wrong predictions on some of the training data if that means it generalizes better. Maybe, we can talk about some implicit specification derived from some higher system goals (e.g., thou shall best predict the stock market development, thou shall predict what my customer wants to buy), but its unclear where such specification comes from or how it could be articulated. Maybe a probabilistic specification would be just to maximize accuracy on future predictions? This vague notion of specification is confusing (at least to me) and makes it very hard to pin down what “testing”, “debugging”, or “bug” mean in this context.
To make things more concrete, let’s take the well known and highly controversial COMPAS model of predicting recidivism for convicted criminals, here approximated with an interpretable model created by Cynthia Rudin for her article Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead with CORELS:
IF age between 18–20 and sex is male THEN predict arrest (within 2 years)
ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest
ELSE IF more than three priors THEN predict arrest
ELSE predict no arrest
Given some information about an individual at a parole hearing, the model predicts whether the individual is likely to commit another crime, which may be used for sentencing or deciding on when to release that person. What is the specification for how COMPAS should work? Ideally it would always correctly indicate whether any specific individual would, in the future, commit another crime — but we don’t know how to specify this. Even in a science fiction scenario it seems more likely that we figure out how to manipulate behavior rather than to predict human future behavior. Hence any prediction must generalize and our goal might be to create a best approximation of typical behavior patterns. In practice, a model as the one shown above is learned from past data of released criminals to generalize patterns about who is more or less likely to commit future crimes. The model won’t perfectly fit the training data and won’t perfectly predict the future behavior of all individuals; at best we can measure accuracy on some training or evaluation data, and maybe see how many individuals predicted not to commit another crime actually do so (the opposite is hard to measure if we don’t release them in the first place). The model will be imperfect, and as broadly discussed among researchers and journalists, may even produce feedback loops and influence how people behave. Note that challenge is how to evaluate whether the model is “acceptable” in some form, not whether it is “correct”.
Models are learned specifications
I think the confusion comes from a misinterpretation of what a machine-learned model is. A machine-learned model is not an attempt at implementing an implicit specification; a machine-learned model is a specification! It is a learned description of how the system shall behave. The specification may not be a good specification, but that’s a very different question from asking whether our model is buggy!
Consider the COMPAS model again. It is a specification of how the system should predict the recidivism risk. We could now take that specification and implement it, say, in Java. If the Java implementation messed up the age boundaries, this would be an implementation bug — our implementation does not behave as specified in the model. However, if the Java implementation is implemented as specified, but we don’t like the behavior, we should not blame the implementation, we should blame the specification. The problem is not that we have implemented the system incorrectly for the given specification, but that we have implemented the wrong specification.
In software engineering, there is a fundamental and common distinction between validation and verification (wikipedia). Validation is checking whether the specification is what the stakeholders what — whether we build the right system. Verification is checking whether we implement the system corresponding to our specification — whether we build the system right. Validation is typically associated with requirements engineering, verification with quality assurance of an implementation (e.g., testing).
If the machine-learned model is the specification, we usually do not worry about implementing the specification correctly, because we directly derive the implementation from the specification or even just interpret the specification. Bugs in the traditional sense — implementing the specification incorrectly — are rarely a concern. Verification is not the problem. What we should worry about is validation, that is, whether we have learned the right specification.
A requirements engineer’s validation mindset
Requirements engineers deal with validation problems all the time — they are the validation experts. Their main task is to come up with specifications and make sure that they are actually the right specifications.
Typically requirements engineers will read documents and interview stakeholders to understand the problem and the needs of the system (I’m oversimplifying and limit the discussion to interviews). Requirements engineers will proceed iteratively and propose some specifications, maybe build a prototype, and then go back to the customer and other stakeholders to see whether they have gotten the specifications right. Each iteration might provide insights about what was missing in the previous specification and how we might come up with a better one (e.g., interviews with stakeholders not previously considered). There are lots of challenges when identifying specifications, which is what requirements engineers study, such as interviewing representative stakeholders and avoiding injecting bias in those interviews. There are also lots of techniques, such as story boarding, prototyping, and A/B testing to check whether a specification meets user needs or system goals.
Also conflicting requirements are common during validation. Different stakeholders may have different goals and views (e.g., the company’s goals vs. the employees’ goals vs. the customers’ goals vs. privacy requirements imposed by law). A key job of a requirements engineer is to detect and resolve vague, ambiguous, and conflicting specifications. Detection can be performed through manual inspection of text but also with automated reasoning about models (e.g., there is a huge body of work on using model checking to find conflicts in specifications). Resolving issues found during validation often requires to go back to users and other stakeholders to discuss and negotiate tradeoffs.
Machine learning is like requirements engineering
Assuming machine-learned models are specifications, machine learning has a very similar role and very similar challenges to requirements engineering in the software engineering lifecycle.
- We need to identify relevant and representative data, just as requirements engineers need to identify representative stakeholders: Selecting insufficient or biased data may lead to inadequate specifications.
- We need to make sure that the models represent what our stakeholders want, e.g., trading off the larger goal of minimizing jail times while reducing recidivism with the risk of making mistakes. We cannot satisfy all stakeholders equally and will need to make compromises in our specifications.
- In both machine learning and requirements engineering, we may need to compromise user-desired functionality with laws governing privacy, fairness, or security; we may need to mediate between conflicting demands, say companies’ interest in increasing revenue versus users’ safety concerns. Typically, in both machine learning and requirements engineering, it requires many iterations to identify what kind of system would serve stakeholders best and how to resolve conflicts. In the figure above, notice that we need to align multiple specifications, some of which may be learned, others given by users, or set by law.
- When we identify that the specification does not fit, we have often gained valuable insights and in some cases learn something immediately actionable. For example, we may identify missing requirements or underrepresented views, leading to collecting more data or different kind of data for learning or requirements engineering. Not surprisingly, there are feedback loops both in traditional requirements engineering and in machine learning.
Software engineering researchers may appreciate an analogy with specification mining and invariant detection concepts in our field. In both cases, we come up with candidate specifications of a system, which stakeholders need to validate for fit, but which can be helpful for understanding or debugging the system.
Benefits from a requirements engineering mindset
Thinking of quality assurance in machine learning as validation rather than as verification solves a number of problems and encourages exploring different perspectives:
- It becomes clear why the term model bug makes little sense. Working with verification techniques such as test case generation to improve coverage and somehow trying to solve the oracle problem seems not to be the path toward validation. Instead we should adopt a requirements engineering mindset of asking whether a model fits. (Yes, bugs can still exist in the machine learning framework itself or in the inference infrastructure, but that’s a different issue).
- Validation requires to go back to stakeholders to check fit. While this can be approximated with a training-test split of the given data, typically telemetry is collected to monitor fit in production (e.g., to detect concept drift to identify when the specification no longer fits) or to enable A/B testing to check which of alternative requirements fit better.
- Maybe not surprisingly, we should expect model-fit problems to come from how the data was selected and how much data was selected (in line with how and how many stakeholders are selected for interviews), how data was prepared (in line with how interviews must be carefully conducted to avoid bias), and how the right modeling approach was chosen (in line with what process was followed for synthesizing requirements from interviews).
- We should invest in techniques to identify why models do not fit and for whom they do not fit. While not fundamentally different from debugging, this perspective may encourage us to focus on identifying which data is missing, misleading, or underrepresented — and on how the data is analyzed and generalized.
- Checking fairness constraints and other invariants is a natural form of checking compatibility among multiple specifications. Testing and formal methods can likely help here at the model level. For many stakeholders, giving partial specifications in terms of invariants or constraints may be natural to specify partial requirements (e.g., fairness constraints and metamorphic relations, such as outputs shall be independent of gender, outputs for important inputs should be robust to certain perturbations, all other features held constant outputs for larger values of a feature should always be larger than for smaller values). Mining invariants may be a good way to gain better understanding of blackbox models. It should not be surprising if conflicts are found; as with requirements conflicts, resolving them will require prioritizing and compromise, often negotiating with multiple stakeholders or trading off multiple objectives.
- Requirements engineers pay detailed attention to the interface between the world and the machine, that is, how a program interacts with the world, what it can know about the world, and how it can affect the world. This mindset is very useful to think about the potential influence of machine-learning systems, including thinking about potential side effects and feedback loops.
- Unfortunately most machine-learned models are not as easily interpretable as the depicted recidivism model above. Indeed, we can often only explain little about learned models and treat them as black boxes (e.g., deep neural networks). How to best validate opaque specifications, how to ensure the compatibility of multiple specifications some of which are black boxes, and how to make sure that black-box specifications fit the stakeholders needs is largely an open research challenge — I suspect it is a challenge to which the requirements-engineering community can probably contribute.
In summary, when “testing” machine learning systems, we should frame the discussion as validation rather than as verification. I suspect, we can learn a lot from requirements engineering. At a minimum, this view might help us to get our terminology straight: We should talk about model fit and conflicting specifications, rather than about model bugs. I guess now I need to go and read a requirements engineering book more carefully.
Further readings: At Carnegie Mellon, Eunsuk Kang and I teach a course Software Engineering for AI-Enabled Systems. The course material is all public under a creative commons license and I would be happy to see it or something similar taught elsewhere. Requirements engineering deserves significant attention in such a course. I also published an annotated bibliography on papers on this topic.
Thanks for bearing with me through this thought experiment. The analogy emerged during discussions I had at the Dagstuhl Seminar 20091 “Software Engineering for ML-AI-based Systems”, when trying to nail down a definition for model and data bugs which proved incredibly challenging. I’m grateful to the patience of many participants who, throughout the week, listened to my theory and gave feedback, including but not limited to Xiangyu Zhang, Andreas Metzger, Jin Guo, Jie M. Zhang, Michael Pradel, Miryung Kim, Tim Menzies, and Earl T. Barr.