Unsolved research problems vs. real-world threat models

Context: This blog post is based on a lightning talk I gave at the Partnership on AI All-Partners meeting in Nov 2018, and at the Puerto Rico AGI conference in Jan 2019.

Caveat: I’m expressing thoughts in a personal capacity, not as a representative of my employer.

I personally think adversarial examples are highly worth studying, and should inspire serious concern. However, most of the justifications for why exactly they’re worrisome strike me as overly literal.

I think much of the confusion comes from conflating an unsolved research problem with a real-world threat model.

I’ll start by explaining what I mean by adversarial examples. I’ll walk through one justification I often see repeated (not only by journalists, but also in the Introduction section of many papers on the topic)— namely “people could put stickers on stop signs to crash cars” — and walk through a quick threat model analysis, to show why I think this is not a compelling motivation if taken literally. Then I’ll gesture at a few justifications I do find more compelling, by framing typical small-perturbation adversarial examples as an unsolved research problem, with a very real (but less direct) connection to real-world problems.

What are adversarial examples?

Adversarial examples are inputs that are designed to cause a machine learning model to make a mistake [Goodfellow et al 2017].

A common (but not necessary) additional assumption is that these inputs are constructed by making small modifications of clean test-set inputs: these are sometimes called “epsilon-ball adversarial examples” or “small-perturbation adversarial examples”.

For example, if you have a model that classifies images, and it classifies this picture of a panda correctly with fairly high confidence, then it’s actually fairly easy to find a very similar image, where each pixel is changed just a tiny bit (so it still looks like a panda) but the new image is incorrectly classified as a gibbon, with extremely high confidence.

After changing each pixel a tiny bit, the new image is incorrectly classified with extremely high confidence.

It’s worth noting that small-perturbation adversarial examples are not a weird quirk of deep learning, or a problem that only specific models have and most models don’t. With some caveats (that I won’t go into here), virtually all known machine learning models are susceptible to adversarial examples on high-dimensional inputs, and there are no good solutions.

Adversarial stop signs: worrying, but not as a literal threat model

Given that we know ML models are susceptible to these striking failure modes, one might reasonably wonder what this means for deploying ML in real-world contexts with real-world consequences.

As an example, let’s say you’re designing a self-driving car, and you’d like it to be able to recognize stop signs. But you’ve heard of adversarial examples, and you’re curious if those will be a problem for your car.

If you’re designing a self-driving car to recognize stop signs, you might wonder whether you need to worry about adversarial attacks causing the stop signs to be incorrectly recognized.

My background as an engineer is in a research setting, not in system design and deployment, and so I’m not an expert at analyzing how real-world systems will fail. But I do have friends and colleagues who work in computer security, and the one question they’ve taught me to always ask is “What’s your threat model?”

The blog post Approachable Threat Modeling by Kevin Riggle is my favorite (very approachable!) explanation of what threat modeling involves:

Threat modeling is just the process of answering a few straightforward questions about any system you’re trying to build or extend.
* What is the system, and who cares about it?
* What does it need to do?
* What bad things can happen to it through bad luck, or be done to it by bad people?
* What must be true about the system so that it will still accomplish what it needs to accomplish, safely, even if those bad things happen to it?
For the sake of brevity, I’ll refer to these questions as Principals, Goals, Adversities, and Invariants.

Let’s try applying this framework.

In the case of self-driving car design, let’s imagine our goal is for the car to always stop at a “stop intersection”.

We’d like this to be true, even if someone has put a weird glitchy sticker on the stop sign to cause it to be incorrectly recognized.

Let’s not stop there, though — let’s list all the adversities and adversaries that we can think of that might possibly threaten our system. For example, we also want the car’s “stop intersection” behavior to work in fog, or snow, or graffiti, or in the presence of vandals, or if the intersection is under construction, and so on. Our list should ultimately include a long litany of possible problems, including very deeply mundane and unsophisticated problems.

Including the possibility that the stop sign has simply… fallen over.

Our list of possible problems should include the case in which the stop sign has fallen over. [Gilmer et al. 2018]

If your car would crash if the stop sign had simply fallen over, then you have much bigger and more basic safety problems than small-perturbation adversarial examples!

In summary, I’m saying that yes, it’s possible for someone to put a glitchy sticker on a stop sign. And the stop sign would not be detected by any number of standard vision systems. And a car that exclusively relied on that vision system alone might indeed potentially just barrel through the intersection and crash. This is absolutely a thing that could happen in the real world if the car were designed such that a misclassified stop sign would cause a crash; it’s not fake or made-up or impossible.

But I haven’t really specified an example of real adversary who might actually have the capabilities, resources, knowledge, and motivation to make and install that sticker. And not just for it to be possible, but the best way to achieve the adversary’s goals. I can only imagine that — depending on exactly why they want to cause car crashes— they could come up with some cheaper, easier way to achieve that.

In essence, if I’m telling you that “vandals who run gradient descent to produce glitchy stickers that they print out and stick on road signs” is the literal real-world outcome I’m concretely trying to prevent, my threat model is incomplete. It’s still worth researching (for reasons I explain in the next major section), but only if taken less literally.

So are adversarial stop signs worrying at all?

I’m not saying “Everyone stop worrying! ML models are totally robust and fine.” This is the opposite of my point. I’m trying to say “The problem is worse than adversarial stickers.”

To explain what I mean, let’s go back to our threat model for a moment and look at the invariants (“What must be true about the system so that it will still accomplish what it needs to accomplish, safely, even if all the bad things happen to it?”)

If our goal is for the car to always stop — not just in the presence of an adversarial sticker, but even under more likely conditions in which the physical stop sign is not present or visible at all — then it follows directly that we CANNOT rely on road sign detection alone to choose when to stop at intersections. Period. Our threat model has told us that using one detection model alone is not robust enough for safety-critical use.

So now we have a really interesting design problem! How can we recognize stop intersections without using road sign detection exclusively? Maybe we should correlate with GPS and map data? Take extra caution at intersections that aren’t marked with either “stop” or “yield”?

By considering our threat models, we’ve realized that the problem is worse than we thought. It’s not enough to harden our vision model against small-perturbation adversarial stickers, even if we were able to. We have to go even further, and remove “total dependence on the sign detection system alone” from our list of possible ways to achieve the goal.

Why care?

So, given that “adversarial stickers” should be on the list of adversities alongside many more mundane concerns, that together necessitate much more sweeping mitigation strategies… then why might we care about small perturbation adversarial examples at all?

I’ve come across two reasons that I find somewhat compelling:

One: they’re a proof of concept: an incontrovertible demonstration that a certain type of problem exists. As a result of easily finding small-perturbation adversarial examples, we can say with certainty that if the safety of your system depends on the classifier never making obvious mistakes, then that guarantee is false, and your system is unsafe.

Current image classifiers cannot reliably distinguish between unambiguous bird and bicycle images. [Unrestricted Adversarial Examples Challenge].

I should emphasize that making small perturbations is not the only way to find misclassified examples. An adversary could find the mistakes using some other method, like trying random translations and rotations until they find a mistake, or using clever angles or lighting.

However, we already know that obvious mistakes exist and can be easily found because of these small perturbation examples.

Moreover, the “adversary” need not be a human actor searching deliberately: a search for mistakes can happen unintentionally any time a selection process with adverse incentives is applied. (Such as testing thousands of inputs to find which ones get the most clicks or earn the most money).

Your adversary could find the mistakes using some other method, like trying random translations and rotations, or using clever angles or lighting. [Brown et al. 2018]

Two: for researchers, they’re a domain in which research progress is possible. I asked fellow adversarial examples researchers why they thought the “small perturbation” setting was a useful domain of study despite not being a compelling real-world threat model, and they gave a few reasons:

  1. Changing each pixel slightly is easy to specify as an algebraic operation, which makes formal analysis possible.
  2. It’s a problem that real classifiers have, so researchers can study it on real datasets (instead of synthetic data)
  3. We already have evidence that researchers have been able to discover and learn things about robustness that would’ve been hard to learn if we didn’t have a good toy problem. For example, we now know that an image that fools one model is likely to fool another independent model.

Although this is far from a compelling argument that the “small perturbation” setting is the best or only setting to study robustness, it certainly holds water for me as a set of reasons to work in the area — much more than literal justifications.

It basically amounts to a claim that adversarial examples are an unsolved research problem that not only sheds light on a larger category of demonstrable problems, but also can be meaningfully tackled.

Personally, I tend to view the adversarial examples lens as just one paradigm (in the Kuhnian sense) that can be used to demonstrate and study failures of robustness in ML systems, and to hopefully get traction on solutions. It has its limitations as a paradigm, and I’m excited to watch the ML community iterate and refine its approach to robustness, by developing and propagating new and improved paradigms that incorporate lessons learned along the way.

Unsolved research problems aren’t real-world threat models (but both are important)

I think there’s an overall picture here that I’d like you to come away with, which extends beyond just the domain of adversarial examples:

  • Unsolved research problems often entail constructing toy domains where it’s easier to isolate key difficulties and make research progress. Although unlikely to literally resemble likely real-world outcomes, they can be taken as inspiration for possible problems with a deployed system. And conceptual progress made on a toy problem can guide the field towards new paradigms.
  • In deployed systems, the most glaring concerns are almost certainly more basic than the “research problems”, and so you need a concrete threat model to guide you towards effective mitigation strategies. Your concerns are likely to be broader and worse than the research problem suggests. You’ll likely need to make sweeping design changes, rather than adding small fixes.
In deployed systems, the most glaring concerns are almost certainly more basic than the “research problems”.

One important component of my general stance towards robustness, is that I think the statement “But these problems are not new to deep learning!” or “We have even more basic problems than this!” is not a reason to assume everything will be fine. It should serve as a sharp reminder to plan carefully, check your assumptions, and take the full context into account.

In essence, if you’re deploying a system — whether or not it contains ML — you do need an actual, specific plan for anticipating and mitigating negative outcomes.

If you’re a researcher, I would urge you not to justify research on a toy problem by claiming that it literally represents an immediate real-world threat, unless you also provide a threat model. I’d prefer to see justifications explaining why your toy problem is a fruitful testbed for new conceptual insights, and why we might expect these insights to shed light on real-world problems further down the line.

And if you’re having a conversation about adversarial examples, I strongly encourage you to clarify this distinction with the person you’re talking to!

A handful of resources that inspired me to want to make this point

These ideas are not new and original to me. Many people have made this point before. When I was originally giving this talk, I was most directly inspired by the following resources:

Thanks to Jeremy Howard for always asking great questions and specifically nudging me to create a public version; Sam Finlayson for saying similar things in the FAQ on his Medical Adversarial Attacks Policy Paper (worth a read!); Jeffrey Ladish and Jean Kossaifi for helpful suggestions; and Kevin Riggle for good conversations about threat modeling.