Machine Learning is Requirements Engineering — On the Role of Bugs, Verification, and Validation in Machine Learning

Christian Kaestner
Mar 8 · 11 min read
Dagstuhl, Feb 28, 2020

Disclaimer: The following discussion is aimed at clarifying terminology and concepts, but may be disappointing when expecting actionable insights. In the best case, it may provide inspiration for how to approach quality assurance in ML-enabled systems and what kind of techniques to explore.

TL;DR: I argue that machine learning corresponds to the requirements engineering phase of a project rather than the implementation phase and, as such, terminology that relates to validation (i.e., do we build the right system, given stakeholder needs) is more suitable than terminology that relates to verification (i.e., do we build the system right, given a specification). That is, machine learning suggests a specification (like specification mining and invariant detection) rather than provides an implementation for a known specification (like synthesis).

I have long been confused by the term machine-learning bug, especially when referring to the fact that a machine-learned model makes incorrect predictions. Papers discussing quality of machine learning models and systems tend to be all over the place about this. It is tempting to use terms like faults, bugs, testing, debugging, and coverage from software quality assurance also for models in machine learning, but discussions are often vague and inconsistent, given the lack of hard specifications. We tend to accept that model predictions are wrong in a certain number of cases — for example, 95% accuracy on training and test set might be considered quite good — so we don’t tend to call every incorrect prediction a bug. When a model performs poorly overall, we may be inclined to explore whether we picked the right learning technique, the right hyperparameters, or the right data — are these bugs? When a model makes more systematic mistakes (e.g., consistently perform poorly for people from minorities), we tend to explore solutions to narrow down causes, such as biased training sets —this may feel like debugging, but are these bugs?


The term bug is usually used to refer to a misalignment between a given specification and the implementation. Though we rarely ever have a full formal specification for a system, we can typically still say that a program misbehaves when it crashes or produces wrong outputs, because we have a pretty good sense of the system’s specification (e.g., thou shall not exit with a null pointer exception, thou shall compute the tax of a sale correctly). In the form of method contracts, the specification helps us to assign blame when something goes wrong (e.g., you provided invalid data that violates the precondition vs. I computed things incorrectly and violated postconditions or invariants).

Talking about specifications in machine learning is difficult. We have data, that somehow describes what we want, but also not really: We want some sort of generalization of the data, but also don’t want a precise generalization of the training data — it’s quite okay if the model makes wrong predictions on some of the training data if that means it generalizes better. Maybe, we can talk about some implicit specification derived from some higher system goals (e.g., thou shall best predict the stock market development, thou shall predict what my customer wants to buy), but its unclear where such specification comes from or how it could be articulated. Maybe a probabilistic specification would be just to maximize accuracy on future predictions? This vague notion of specification is confusing (at least to me) and makes it very hard to pin down what “testing”, “debugging”, or “bug” mean in this context.

To make things more concrete, let’s take the well known and highly controversial COMPAS model of predicting recidivism for convicted criminals, here approximated with an interpretable model created by Cynthia Rudin for her article Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead with CORELS:

IF age between 18–20 and sex is male THEN predict arrest (within 2 years)
ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest
ELSE IF more than three priors THEN predict arrest
ELSE predict no arrest

Given some information about an individual at a parole hearing, the model predicts whether the individual is likely to commit another crime, which may be used for sentencing or deciding on when to release that person. What is the specification for how COMPAS should work? Ideally it would always correctly indicate whether any specific individual would, in the future, commit another crime — but we don’t know how to specify this. Even in a science fiction scenario it seems more likely that we figure out how to manipulate behavior rather than to predict human future behavior. Hence any prediction must generalize and our goal might be to create a best approximation of typical behavior patterns. In practice, a model as the one shown above is learned from past data of released criminals to generalize patterns about who is more or less likely to commit future crimes. The model won’t perfectly fit the training data and won’t perfectly predict the future behavior of all individuals; at best we can measure accuracy on some training or evaluation data, and maybe see how many individuals predicted not to commit another crime actually do so (the opposite is hard to measure if we don’t release them in the first place). The model will be imperfect, and as broadly discussed among researchers and journalists, may even produce feedback loops and influence how people behave. Note that challenge is how to evaluate whether the model is “acceptable” in some form, not whether it is “correct”.

Models are learned specifications

I think the confusion comes from a misinterpretation of what a machine-learned model is. A machine-learned model is not an attempt at implementing an implicit specification; a machine-learned model is a specification! It is a learned description of how the system shall behave. The specification may not be a good specification, but that’s a very different question from asking whether our model is buggy!

Consider the COMPAS model again. It is a specification of how the system should predict the recidivism risk. We could now take that specification and implement it, say, in Java. If the Java implementation messed up the age boundaries, this would be an implementation bug — our implementation does not behave as specified in the model. However, if the Java implementation is implemented as specified, but we don’t like the behavior, we should not blame the implementation, we should blame the specification. The problem is not that we have implemented the system incorrectly for the given specification, but that we have implemented the wrong specification.

In software engineering, there is a fundamental and common distinction between validation and verification (wikipedia). Validation is checking whether the specification is what the stakeholders what — whether we build the right system. Verification is checking whether we implement the system corresponding to our specification — whether we build the system right. Validation is typically associated with requirements engineering, verification with quality assurance of an implementation (e.g., testing).

Verification and validation
Verification and validation
Validation vs Verification

If the machine-learned model is the specification, we usually do not worry about implementing the specification correctly, because we directly derive the implementation from the specification or even just interpret the specification. Bugs in the traditional sense — implementing the specification incorrectly — are rarely a concern. Verification is not the problem. What we should worry about is validation, that is, whether we have learned the right specification.

A requirements engineer’s validation mindset

Requirements engineers deal with validation problems all the time — they are the validation experts. Their main task is to come up with specifications and make sure that they are actually the right specifications.

Typically requirements engineers will read documents and interview stakeholders to understand the problem and the needs of the system (I’m oversimplifying and limit the discussion to interviews). Requirements engineers will proceed iteratively and propose some specifications, maybe build a prototype, and then go back to the customer and other stakeholders to see whether they have gotten the specifications right. Each iteration might provide insights about what was missing in the previous specification and how we might come up with a better one (e.g., interviews with stakeholders not previously considered). There are lots of challenges when identifying specifications, which is what requirements engineers study, such as interviewing representative stakeholders and avoiding injecting bias in those interviews. There are also lots of techniques, such as story boarding, prototyping, and A/B testing to check whether a specification meets user needs or system goals.

Also conflicting requirements are common during validation. Different stakeholders may have different goals and views (e.g., the company’s goals vs. the employees’ goals vs. the customers’ goals vs. privacy requirements imposed by law). A key job of a requirements engineer is to detect and resolve vague, ambiguous, and conflicting specifications. Detection can be performed through manual inspection of text but also with automated reasoning about models (e.g., there is a huge body of work on using model checking to find conflicts in specifications). Resolving issues found during validation often requires to go back to users and other stakeholders to discuss and negotiate tradeoffs.

Machine learning is like requirements engineering

Assuming machine-learned models are specifications, machine learning has a very similar role and very similar challenges to requirements engineering in the software engineering lifecycle.

Machine Learning as Requirements Engineering
  • We need to identify relevant and representative data, just as requirements engineers need to identify representative stakeholders: Selecting insufficient or biased data may lead to inadequate specifications.

Software engineering researchers may appreciate an analogy with specification mining and invariant detection concepts in our field. In both cases, we come up with candidate specifications of a system, which stakeholders need to validate for fit, but which can be helpful for understanding or debugging the system.

Benefits from a requirements engineering mindset

Thinking of quality assurance in machine learning as validation rather than as verification solves a number of problems and encourages exploring different perspectives:

  • It becomes clear why the term model bug makes little sense. Working with verification techniques such as test case generation to improve coverage and somehow trying to solve the oracle problem seems not to be the path toward validation. Instead we should adopt a requirements engineering mindset of asking whether a model fits. (Yes, bugs can still exist in the machine learning framework itself or in the inference infrastructure, but that’s a different issue).

In summary, when “testing” machine learning systems, we should frame the discussion as validation rather than as verification. I suspect, we can learn a lot from requirements engineering. At a minimum, this view might help us to get our terminology straight: We should talk about model fit and conflicting specifications, rather than about model bugs. I guess now I need to go and read a requirements engineering book more carefully.

Further readings: At Carnegie Mellon, Eunsuk Kang and I teach a course Software Engineering for AI-Enabled Systems. The course material is all public under a creative commons license and I would be happy to see it or something similar taught elsewhere. Requirements engineering deserves significant attention in such a course. I also published an annotated bibliography on papers on this topic.


Thanks for bearing with me through this thought experiment. The analogy emerged during discussions I had at the Dagstuhl Seminar 20091 “Software Engineering for ML-AI-based Systems”, when trying to nail down a definition for model and data bugs which proved incredibly challenging. I’m grateful to the patience of many participants who, throughout the week, listened to my theory and gave feedback, including but not limited to Xiangyu Zhang, Andreas Metzger, Jin Guo, Jie M. Zhang, Michael Pradel, Miryung Kim, Tim Menzies, and Earl T. Barr.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Christian Kaestner

Written by

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

More From Medium

More from Analytics Vidhya

More from Analytics Vidhya

More from Analytics Vidhya

The Illustrated Word2vec

More from Analytics Vidhya

More from Analytics Vidhya

Financial Transaction Fraud Detection


Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade