Paper Discussion: Explaining and harnessing adversarial examples

Discussion of the paper “Explaining and harnessing adversarial examples” presented at ICLR 2015 by Goodfellow et al.

Mahendra Kariya
7 min readNov 14, 2018
Source: https://twitter.com/bretmcg/status/779795602372759553

At ICLR 2015, Ian GoodFellow, Jonathan Shlens and Christian Szegedy, published a paper Explaining and Harnessing Adversarial Examples. Let’s discuss some of the interesting parts of this paper in this post.

Before we begin, let’s start with a quick one-liner about what is an adversarial example. Quoting the OpenAI blog,

Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines.

Background

Szegedy et al. (2014) discovered that machine learning models can misclassify data which is only slightly different from the data that the model has seen. The interesting part is that such data (also called adversarial examples) are not specific to a particular type of NN architecture. Exact same adversarial examples can be misclassified by different NN architectures trained on different datasets. This practically means that the models are not learning the true underlying properties of the data. These algorithms are instead building a Potemkin village. Some of the speculative explanations for the cause of these adversarial examples are

  • Non-linearity of neural networks
  • Insufficient regularisation
  • Insufficient model averaging

Important takeaways

Some of the important takeaways from this paper are that we don’t need to consider the non-linearity of neural networks. Adversarial examples can be created by exploiting the linear behavior in high dimensional spaces. The paper introduces a faster method to generate adversarial examples, called Fast Gradient Sign Method. The paper also shows that adversarial training can be used as a regularisation technique.

Linear models and adversarial examples

A simple linear model can be described as transpose(W) * x, where W is the weight matrix and x is the input. The input is made up of many features, and the precision of any given particular feature is limited.

Now let’s add a small noise η, such that η < ϵ (ϵ is smaller than the precision of the features) to every feature of x. We can call this new input x̄. So,

x̄ = x + η

We can write the dot product between the weight matrix and x̄ as

transpose(W) * x̄ = transpose(W) * x + transpose(W) * η

This means that the activation of the network increases by transpose(W) * η. The increase in activation can be maximised by assigning η = sign(W).

Such small changes in activation can grow linearly with the increase in dimensions. Hence, for high-dimensional problems, we can make small changes to the input that can add up to one big change to the output. Thus, even a smaller model is vulnerable to adversarial examples, provided the input is high dimensional.

Fast Gradient Sign Method

The authors start with a simple intuition of why linear perturbations can damage non-linear models like neural networks. They point out that LSTMs, ReLUs and Maxout networks are intentionally designed to behave in a linear way so that the optimization becomes easier. Even sigmoidal networks are tuned to spend majority of the time in the linear area. Hence, these networks can’t resist linear adversarial examples.

The authors then go on to describe the Fast Gradient Sign Method of generating adversarial examples. Let’s take a simple neural network with x as input, y as target, θ as parameters of the network and J(θ, x, y) as the cost function. To obtain the optimal max norm constrained perturbation η, we can linearize the cost function at θ and get

η = ϵ sign(∇ₓ J(θ, x, y))

Backpropagation can be used to compute the gradient (∇ₓ) of the cost function.

Authors then give a simple example to build some intuition for how such adversarial examples are generated.

For the sake of simplicity, let’s take only the 3’s and 7’s from MNIST. Instead of a neural network, let’s build a simple logistic regression model to do this binary classification.

Now, look at the figure below. Sub-figure (a) represents the weights of the logistic regression model trained above. (b) represents the sign of the weights. Sub-figure (c) has samples from the actual dataset and (d) depicts the adversarial example generated by multiplying ϵ = 0.25 to sign of the weights (sub-figure (b)) and adding it to the actual dataset (sub-figure (c)).

The logistic regression model has an error rate of 1.6% on the dataset in (c). But on (d), the error rate is 99%. Note that we as humans can correctly identify the adversarial examples in (d).

Experiments and results

The authors report the classification error rate of adversarial examples as tabulated below. Please note that these results were obtained when adversarial training was not used as regularization.

Adversarial training as regularizer

Data augmentation is a very popular technique for regularization. Normally, data augmentation is used to transform the data so that the model is exposed to translations that may occur in the test set. Authors suggest adding adversarial examples as another form of data augmentation. Though such translations are unlikely to occur in the real world, the model will get exposed to its own flaws. This is called Adversarial Training.

To do so, authors suggest using an adversarial objective function based on the Fast Gradient Sign Method.

𝔍(θ, x, y) = α J(θ, x, y) + (1-α) J(θ, x + ϵ sign(∇ₓ J(θ, x, y))

The value α = 0.5 has been used for all experiments. When trained with the adversarial objective function, the error rate of the maxout network (with dropout) reduced from 0.94% without adversarial training to 0.84% with adversarial training. Note that the error rate reported in the table above is only for adversarial examples, while the error rate reported in this paragraph is for the entire test set. Another thing to note here is that authors have made some changes to the network architecture. You can look at Section 6 of the paper for more details.

Apart from reporting the error rate for entire test set, authors have also reported the error rate just on adversarial examples. As per the table above, we can see that for Maxout networks, the error rate for adversarial examples was 89.4%. With adversarial training, this reduced to 17.9%. Unfortunately, the confidence of the adversarially trained model while misclassifying adversarial examples is still high, at 81.4%. (Earlier, it was 97.6%).

The weights of the adversarially trained model are reported to be much more localized and interpretable, as can be seen below.

Source: https://arxiv.org/pdf/1412.6572.pdf

Why do adversarial examples generalize?

As discussed in the beginning of the post, adversarial examples are not specific to a particular architecture. They generalize across different architectures trained on different datasets. Moreover, even the misclassified classes are mostly the same across various models.

The paper discusses these properties of adversarial examples. Let’s start with the first property viz. adversarial examples being the same across different architectures. Note that the adversarial noise η only cares about the sign and not the magnitude of the gradient. As long as the direction of η has positive dot product with the gradient and ϵ is sufficiently large, we can generate an adversarial example. The authors traced out different values of ϵ (see figure below). They discovered that adversarial examples do not occur in fine pockets. Instead, they can be found in contiguous regions of 1-D subspace defined by the Fast Gradient Sign Method. This means adversarial examples are in abundance. Hence, they are common across different models.

Source: https://arxiv.org/pdf/1412.6572.pdf

Let’s move on to the second property: different classifiers assign the same class to adversarial examples. The authors hypothesize that the neural networks that we have today are able to learn approximately the same weights when trained on different subsets of training data. The stability of the learned weights results in the stability of misclassification of adversarial examples. After running some experiments to test this hypothesis, it is concluded that a significant portion of misclassification errors are consistent with linear behavior being a major cause.

Code Implementation

If you’d like to generate adversarial examples and adversarially train your models, there is a nice open source library written by Ian Goodfellow and Nicolas Papernot called cleverhans. Apart from the Fast Gradient Sign Method discussed in this post, this library also supports a bunch of other methods to generate adversarial examples. You can use cleverhans with tensorflow as well as keras and there are a few tutorials available here.

I will end this post with a couple of lines taken straight away from the paper.

The existence of adversarial examples suggests that being able to explain the training data or even being able to correctly label the test data does not imply that our models truly understand the tasks we have asked them to perform. Instead, their linear responses are overly confident at points that do not occur in the data distribution, and these confident predictions are often highly incorrect.

--

--