CNN Adversarial Robustness

Part I: The Theory (digital adversaries)

Aleksandr Petiushko
The Startup
Published in
14 min readNov 8, 2020


A short intro into CNN

Starting from 1943 (the first mathematical formalization of the definition “artificial neuron” by Warren McCulloch and Walter Pitts in the paper “A logical calculus of the ideas immanent in nervous activity”) the neural nets become:

  • Wider (more parameters),
  • Deeper (more sequential processing layers),
  • Better! (solving the tasks more correctly)
Perceptron scheme (left) and modern GoogleNet (right). Image credit:

For now it is the well-known fact that the main tool for image and video smart processing are convolutional neural nets (CNNs). It can be used for finding the objects and recognizing its classes, and finally can answer the long-lasting question of the whole computer vision: is it a cat or a dog on the picture (related kaggle challenge)?

But is CNN really a silver bullet? Is all attention granted to CNNs by computer vision researchers reasonable? Let us concentrate on two distinct problems:

  1. Who is now the better “recognizer” (human or CNN) for some widely used datasets?
  2. How robust are CNNs regarding its inputs? Can the recognition process be broken?

CNNs golden age

Concerning the first problem let us consider two popular image datasets — ImageNet (1000 classes image dataset) and Labeled Faces in the Wild (LFW) (dataset with face images of more than 5000 persons).

ImageNet samples
LFW samples

Let’s start with ImageNet classification:

  • Top-5 error (the probability of the correct class to be outside of 5 most confident CNN outputs) for the human: 5.1% (proof by Karpathy)
  • Top-5 error for CNN: less than 2% (at least by the end of 2019; now I think even less, maybe around 1%. The reader can refer, e.g., to the paper “Fixing the train-test resolution discrepancy”)

Let’s move on to verification (check whether the pair of facial images is depicting the same person) on LFW:

After all, one can make the following conclusion:

Image credit:

Such fragile CNNs

But let’s not jump to conclusions immediately. For the second problem let’s revisit the classical works on the CNNs robustness research. It turned out that one can add to the input almost invisible to the human eye perturbations in such a way that this perturbation completely changes the CNN output.

For example, below you can see how the classification result from “Panda” changed to “Gibbon” after adding super invisible input perturbations:

Image credit:

Such almost invisible perturbations leading to changing of the CNN output are called adversarial examples (or adversarial attacks on CNN).

At first glance it may seem that such adversarial examples exist only for classification neural nets (for changing the output class from correct to incorrect one). Nevertheless it was found that the similar problems can be shown for detection CNNs (outputting spatial rectangular prediction of the object inside an image) and even for segmentation CNNs (providing the class label for every pixel on an image) — for the examples please refer to the paper “Adversarial examples for semantic segmentation and object detection”:

Correct predictions (left) and incorrect ones after adversarial attack (right)

Finally, foreseeing objections of the type “it’s all about images, in our perfect NLP there is no room for adversarial examples!” [NLP = Natural Language Processing], I would like to refer to the excellent example from the paper “Adversarial examples for evaluating reading comprehension systems”, which is related to so called question-answering (QA) systems, where the small portion of some text (3–5 sentences) is shown to the system under investigation and then this system is being asked to answer the simple question on understanding this text.

Here is the initial text:

Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager.

The question:

What is the name of the quarterback who was 38 in Super Bowl XXXIII?

The correct answer (here it matches the model under investigation):

John Elway.

But if we add as the last sentence the completely unrelated part:

Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV.

Than the output of the model will change (and becomes completely wrong!):

Jeff Dean.

The sources of instability and some ways to overcome it

In fact, there is a number of different possible reasons for such adversarial examples to exist (let’s not make holy war here about it — researchers still not have the single Answer to the Ultimate Question of Life, the Universe, and Everything). Anyway we can concentrate on one (and possibly the most simply interpretable) reason: insufficient generalization ability of CNNs. As a result the classification borders which are constructed in the process of CNN training often lie quite closely to the training samples, and in order to “move away” from the region of the correct class (marked below as green) to the region corresponding to the incorrect one (marked below as red) the very small perturbation is really needed:

Image credit:

Sidenote: in fact, the augmentation procedure applied to input data when training CNNs (adding to the initial training samples its slightly transformed variants: various rotations, darker/brighter versions, different scales, artificial occlusions etc) is trying to solve this problem to some extent — to move border a little bit far away form the training samples in some directions (corresponding to the chosen ways of augmentation). Nevertheless, there still remain a plenty of directions with the boundaries very close to the training samples.

So now we know that one can fool CNN by a small pixel perturbation. The logical proposal then is to add during training not only the training sample but also its pixel vicinity (under some norm — e.g. l∞; please see a short introduction to norms below if needed). Below the great illustration from the paper “Towards deep learning models resistant to adversarial attacks” is shown. Here red stars are such pixel vicinities of training samples that after applying the initial separating hyperplane (here — just black line) they occurred to lie in the region of the neighbor (so, wrong) class. So the intuition is to transform the separating boundary in such a way so as these stars remain inside the regions of its true classes:

Left: initial linearly separated data. Center: separating hyperplane is wrong when applying to pixel vicinities. Right: improved borderline

Let’s make some calculations trying to apply the intuition above. If we consider:

  • Initial image of size 100×100 pixels, 3 RGB colors,
  • Our human eye cannot distinguish with high confidence the color perturbation of +/- 1 color point (out of 256 in 8-bit encoding),

then for every training sample we need to add the following number of its pixel neighbors:

I’d like to mention that this number is much bigger than the number of atoms in the observable part of the Universe (~10⁸⁰), so our approach based on the simple intuition “to add training vicinities” to improve the CNN robustness is quite... non-working :)

We can see that the simple exhaustive search inside pixel vicinity is very cumbersome. But one can note, that in fact we do not need the part of training example vicinity which is already in the correct classification region! In contrast, we need to pay attention to the parts of the vicinity which lie in the incorrect classification region (sort of hard mining). Such a method of training with usage of additional most hard samples from the training example vicinity exist and is called adversarial training. One can refer to the original paper on adversarial training for CNNs.

Pros of adversarial training are obvious. It is not needed to make the exhaustive search for all the samples in the vicinity of any training example, but need to concentrate on a few (even could be one) hardest samples. How to find these “hardest” samples — is a good question (and, in fact, is still open). Usually some method of adversarial example generation is used, and the model robust to this specific method of generation is provided as the outcome of such a process.

Cons of adversarial training can be summarized as follows. Firstly, obtaining the sort of defense from some method of adversarial example generation used for adversarial training, we cannot guarantee anything taking into account any other method of adversarial example generation. Secondly, any method of adversarial example generation adds a significant computational overhead (often much bigger than usual gradient descent step inside error backpropagation approach widely used for CNNs training).

Some definitions: formal part

Let’s introduce a little more formalism for the sake of later understanding:

  • x ∈ B=[0,1]^{C×M×N} — input image of size C×M×N, where C — the number of color channels (1 for grayscale, 3 for RGB), M×N — the spatial size (height × width);
  • y — correct class label for input x;
  • θ — parameters of CNN-classifier;
  • L(θ, x, y) — loss function;
  • f(x) — the output of classifier (recognized class), and we are trying to make f(x) = y when training;
  • r ∈ B=[0,1]^{C×M×N} — the additive perturbation for the input x.

Now we can define the goal of adversarial attack: to change the output of the classifier f from the correct class label to the incorrect one by means of minimal in terms of some norm (practically used norms are l₀, l₁, l₂ and l∞ — let’s denote any of it by lₚ) perturbation r, or more formally to minimize ||r||ₚ so as:

  • f(x) = y (initially the output is correct),
  • f(x+r) ≠ y (“break” the CNN output with perturbation r),
  • x+r ∈ B (we are still in the space of correct images).

Sidenote. At the same manner we can define the so called classifier robustness — i.e. to find the perturbation class S(x,f)⊆ B so as the classifier will not change its output: f(x+r) = f(x) = y ∀ r ∈ S(x, f).

Using the notations above, the usual CNN training can be formalized as the minimization of the following expectation:

In the adversarial training at the beginning the hardest (i.e., maximizing the loss function) example is generated (e.g. by means of some adversarial attack method) from some vicinity Δ of input (e,g, under some lₚ norm), and only then the minimization is applied:

lₚ norms definition reminder

Let’s recall the most widely used lₚ norms (one can skip this section if needed).

l₂ — usual Euclidian norm:

l₁ — so called Manhattan or city block distance (when we construct a metric from the norm):

l∞maximum of absolute values of elements, this is correlated with the process of how a human eye perceive the visual information:

l₀ — the number of non-zero elements:

Sidenote. For 0 < p < 1 the “norm” lₚ so as

is in fact not the real mathematical norm.

Adversarial examples in computer vision: the history

Fooling images

Suppose the classifier outputs the probabilities of input image to be any of N predefined classes (e.g. N=1000 in case of ImageNet). Also suppose (it’s not always true but nevertheless) that for images from these N classes the output probabilities are reasonable (of course, some mistakes can occur). But what if the input image is not from these specific N classes — or, mathematically speaking, is out of the classifier domain? And it would be quite good if the output vector of probabilities is the uniform vector of type (1/N, …, 1/N) — roughly speaking, our classifier cannot understand. Much worse when the input image which is out of our classifier domain is leading to a very specific and determined (with high probability for some specific class) answer.

So one of the first approaches to study the CNN robustness was to find the correlation between the output probabilities and inputs of different nature (in-domain and out-of-domain). It turned out that there exist inputs (either structurally constructed or close to random noise) that can lead to a high probability for any (even specified in advance) class.

Such examples were named as “fooling images” and were generated with the evolutionary algorithms:

Noise-like (top) and structured-like (bottom) fooling images

For the details one can refer to the original paper “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images


The first attacking method on CNN used quasi-Newton minimization method with limitations on memory and variables L-BFGS-B — but this method is quite slow as well as usually requires the external optimizer.

So pretty soon it was proposed to use linearized part of loss function in the vicinity of input image x and move towards the gradient w.r.t. input x (refer to the original paper “Explaining and harnessing adversarial examples”):

where 0 < ϵ < 1 — some constant related to the speed / amount of movement towards gradient, and yₜ ≠ y — the wrong (target) class towards which we are “pushing” (fooling) the CNN classifier to move.

This method was called Fast Gradient Sign Method, or FGSM.

Recall: for finding the CNN parameters θ the error backpropagation method is used, where we take the gradient of loss function w.r.t. CNN parameters θ, i.e.

And the last interesting [here] note from this paper is that now l∞ norm is used as the most close one to the process of how a human eye perceive the visual information.


Often the estimation of the linearized part of some loss function in the vicinity of input image x is quite rough, and one step of FGSM sometimes is not enough to provide the good adversarial example.

To overcome the above mentioned issue the iterative method called Iterative FGSM, or I-FGSM, is used. The goal of this method is to move towards classification boundary more precisely.

Denoting as Π the projection operation, the I-FGSM can be expressed as following:

Additionally, if we want our final adversarial example xₙ to be not farther away from the initial input x than some ϵ under l∞ norm after not more than n steps:

than we can make the empirical rule on the speed of movement towards gradient:

Note 1. I-FGSM is more known as Projected Gradient Descent, or PGD, although this name appeared almost a year after I-FGSM.

Note 2. In fact, I-FGSM / PGD is still the most widely used in practice method of adversarial examples generation because of being very simple as well as quite well working and robust (although not so fast as FGSM, of course).


As one could note, the adversarial example generation methods become more and more look like the steps of an optimizer used to find the optimal CNN parameters θ. The only difference here is that in usual CNN training the optimization is done w.r.t. the CNN parameters θ, and in adversaries generation the optimization is done w.r.t. the CNN input x.

As a result, the whole trend of transferring the successful gradient descent tricks to adversarial example generation area arises, and the most known such an example — the usage of momentum.

The paper “Boosting adversarial attacks with momentum” suggests to use the gradient smoothing inside I-FGSM — Momentum I-FGSM, or MI-FGSM.

The algorithm can be summarized as the following:

where by g we denote this smoothed gradient:

On the comparison table of the last three methods can be seen that MI-FGSM is the better one (the more number in the table, the more successful attack is):

One Pixel

All the adversarial attacks considered above were in fact in the digital domain, when one can change any pixel (usually — all the pixels of an image). Nevertheless, there exist (and JSMA is one of the first, although not very practical and convenient) methods taking into account not maximal luminosity perturbation (by l∞ norm) but maximal amount of pixels to be changed (by l₀ norm).

Such methods of adversarial examples generation bring a bridge between the digital and the practical domain (topics about physical world adversaries are expected in the next part), and the most extreme of such methods is the one pixel attack.

The paper “One pixel attack for fooling deep neural networks” is about the extreme case of l₀-norm bounded attack. For this purpose authors use a sort of evolutionary algorithm (in fact, the algorithm of differential evolution) for adversaries search:

  • Population consists of 400 species, each species is defined by 5 numbers: 2 coordinates (x,y) and 3 color channels (r,g,b),
  • Offspring generation is done by linear combination of three random parents.

Here one can see the example of such one pixel attack on model trained on CIFAR-10 dataset of images, where the initial image with a dog on it becomes classified as any other 9 classes (plane, …, truck) when changing one and only one pixel!

Main takeaways

  1. For now CNNs in many computer vision tasks work much better than human expert,
  2. CNNs can be easily “fooled” using its instability w.r.t. its input,
  3. The most popular ingredient of successful adversarial attack on CNNs is the usage of a gradient w.r.t. its input.

Next part trailer

Next time (hopefully! at least I hope :) we’ll consider the following topics:

  • What is physically implementable adversarial attack;
  • The main tricks for generating physical adversarial examples;
  • How to fool the real face detector;
  • How to fool the best open faceID system;
  • Methods to defense from physical adversarial attacks.

Thanks for reading!




Aleksandr Petiushko
The Startup

Ph.D., lecturer at Lomonosov Moscow State University. Scientific expert, team leader at Huawei Moscow Research Center.