Tricking a Machine into Thinking You’re Milla Jovovich

And other types of adversarial attacks in machine learning

What is an Adversarial Attack?

In early 2014, Szegedy et al. (2014) showed that minimally altering the inputs to machine learning models can lead to misclassification. These input are called as adversarial examples: pieces of data deliberately engineered to trick a model.

This picture of a fish (left) is correctly classified, but the addition of a small perturbation (middle) generated by the Fast Gradient Sign Method (FGSM) causes a classifier to misclassify the resulting image (right) as a cat.

Since then we have seen an arms race between adversarial attacks and defenses. For example, a defense mechanism called defensive distillation (Papernot et al., 2015) which was considered state of the art in 2015 was attacked successfully by the Carlini & Wagner (C&W) methods with 100% success rate in 2016. Moreover, seven novel defense mechanisms accepted to the Sixth International Conference on Learning Representations (ICLR) 2018 were successfully circumvented (Athalye et al., 2018) just days after the acceptance decision. This shows how difficult it is to robustly defend against adversarial attacks.

Why Should I Care About Adversarial Attacks?

The implications of the existence of adversarial examples in the real world cannot be underestimated. Consider a home owner who uses a face recognition system as a security feature. We can now generate an adversarial eyeglass (Sharif et al., 2016) that can be printed and placed on a real eyeglass frame to fool face recognition models.

A man wearing an adversarial eyeglass (top) is grossly misclassified as Milla Jovovich (bottom) (Sharif et al., 2016).

Another real-world example where adversarial examples could be dangerous is in the manipulation of road signs. Evtimov et al. (2017) generated an adversarial stop sign that is always misclassified as a speed limit sign even when viewed from different distances and angles. The implications for autonomous vehicles are clear.

Examples of adversarial stop signs that are misclassified as speed limit signs (Evtimov et al., 2017).

Furthermore, Carlini & Wagner (2018) showed that a speech recognition model can also be fooled by adding background noise to an input. They modify the audio file corresponding to the sentence “without the dataset the article is useless” to cause a speech recognition model to transcribe it as “okay google browse to evil dot com”. The modified audio, which you can listen to here, sounds almost identical to human. Such adversarial audio has the potential to cause serious mischief when used on unsuspecting speech interfaces in smartphones, smart homes, or self-driving cars.

A Quick Glossary

Let’s take a look at several terms that are often used in the field of adversarial machine learning:

  • Whitebox attack: attack scenario where the attackers have complete access to the model that they want to attack. As such they know the model’s architecture and parameters.
  • Blackbox attack: attack scenario where the attackers can only observe the outputs of a model that they are trying to attack. For example, attacking a machine learning model via an API is considered a blackbox attack since one can only provide different inputs and observe the outputs.
  • Targeted attack: attack scenario where the attackers design the adversaries to be mispredicted in a specific way. For instance, our audio example earlier: from “without the dataset the article is useless” to “okay google browse to evil dot com”. The alternative is an untargeted attack, in which the attackers do not care about the outcome as long as the example is mispredicted.
  • Universal attack: attack scenario where the attackers devise a single transform such as image perturbation that adversarially confuses the model for all or most input values (input-agnostic). For an example, see Moosavi-Dezfooli et al. (2016).
  • Transferability: a phenomenon where adversarial examples generated to fool a specific model can be used to fool another model that is trained on the same datasets. This is often referred to as the transferability property of adversarial examples (Szegedy et al., 2014; Papernot et al., 2016).

We’ll now look at several interesting methods to generate adversarial attacks based on knowledge of the attackers, i.e. whitebox or blackbox, in the computer vision domain (classification models to be exact), charting the evolution of the field. In the next post, we will look at the other side of the arms race: the arsenal of mechanisms for adversarial defense.

Timeline of the adversarial attacks covered in this article.

How are Adversarial Examples Generated?

Ontology of adversarial attacks based on knowledge of the attackers discussed in this article. Note that this does not necessarily represent all attack methods that exist today.

Whitebox Additive Adversarial Perturbations Based on dL/dx

This family of attacks is based on the idea of perturbing the input in a way that maximally changes the loss function of the model. In case of neural networks, this means that we need to perform back propagation to calculate the derivative of the loss function with respect to its input (as opposed to the parameters like we usually do when training the neural networks). Specifically, an attacker is interested in finding the optimal direction for the perturbation and nudging the input in this direction in the hope that the model will misclassify the perturbed input.

Illustration of the whitebox attacks for both the additive adversarial perturbations based on dL/dx and iterative optimization based attacks. Once dL/dx is calculated (step 1), one may view the attack process as a game where a player (the attacker) can adjust the pixel values (step 2) of the input based on some hints, i.e. the gradient dL/dx, to fool a model (step 3).

Fast Gradient Sign Method (FGSM)

FGSM (Goodfellow et al., 2014) searches for the direction in which the loss function increases fastest for a target machine learning model. FGSM is an example of a whitebox attack because the attacker needs to know the model’s architecture and parameters to perform back propagation. Once the gradient is computed, one can push the input towards the adversarial gradient by a small amount.

FGSM formulation. Here, x’ is the adversarial example that should look similar to x when ϵ is small, and y is the model’s output. ϵ is a small constant that controls the magnitude of the perturbation, and J denotes the loss function of the model.

There is no guarantee that the generated adversarial examples by this method are similar to its real counterpart. Practically, one needs to make a tradeoff between small perturbations that are visually similar to the original input, and whether the model actually misclassifies the perturbed input.

Basic Iterative Method (BIM)

BIM (Kurakin et al., 2017) is an extension of FGSM in which one performs FGSM multiple times with small step size. Some other papers also refer BIM as the Iterative FGSM (I-FGSM).

BIM formulation where J denotes the loss function of the model, N denotes the number of iteration, and α is a constant that controls the magnitude of the perturbations (Kurakin et al., 2017). The Clip{} function ensures that the adversarial example generated is still within the range of both the ϵ ball (i.e. [x-ϵ, x+ϵ]) and the input space (i.e. [0, 255] for pixel values).

(R)andom + FGSM (R+FGSM)

In R+FGSM, Tramer et al. (2017) suggest to add some random perturbations sampled from a Gaussian distribution before calculating the first derivative of the loss with respect to the input.

R + FGSM formulation where α is another constant that controls the magnitude of random perturbations sampled from a normal distribution (Tramer et al., 2017).

The motivation for R+FGSM is to circumvent defenses that rely on gradient masking (Papernot et al., 2016), which is a very important concept in adversarial machine learning. Gradient masking techniques seek to obscure or hide the gradient of the model to make it harder for the attacker to calculate the exact dL/dx. We’ll cover this in the next post on adversarial defenses.

Spoiler alert: the seven defenses accepted to the ICLR 2018 that have been successfully attacked were shown to rely on obfuscated gradients (Athalye et al., 2018), which is a form of gradient masking.

Whitebox Attacks Based on Iterative Optimization of Surrogate Objective Functions

These attacks are also whitebox and rely on dL/dx. However, they do not attempt to naively use the computed gradient directly as an added perturbation. Instead, these attacks define adversarial attack as an optimization problem to find an update to an input that optimizes an objective function. Modelling this as an optimization problem allows one to be flexible in folding in more adversarial criteria into the objective function.

L-BFGS Attack

Szegedy et al. (2014) defined adversarial example as inputs that look very similar to their real counterparts according to a distance metric (e.g. L2 distance a.k.a. Euclidean distance or mean squared error), but one that causes a classifier to misclassify it. The Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) is a non-linear gradient based numerical optimization algorithm. However, since Szegedy et al. (2014) defined the problem as an optimization problem which can be solved using L-BFGS, the attack is now often referred as the L-BFGS attack. The L-BFGS attack aims to find a perturbation r that minimizes:

L-BFGS attack seeks to solve this optimization problem where r is the perturbation (Szegedy et al., 2014).

In above formulation, the goal is to make the classifier f to misclassify x + r as class l. The loss function used here is the cross-entropy loss, but can be replaced with other surrogate functions as we will see in the next attack. Here, line search is used to find the minimum constant c where c>0 until an adversary is found.

Carlini & Wagner Attack (C&W)

Carlini & Wagner (2016) extended L-BFGS attack by modifying the objective function instead of using the standard cross-entropy loss:

The loss function used in C&W attack. Note the change in notation where f now represents the loss function of the classifier, not the classifier itself. Here, Z(x’) denotes the logits (the outputs of a neural network before the softmax layer) when passing adversarial input (x’) and t represents the target misclassification label (the label that we want the adversary to be misclassified as), while κ is a constant that controls the desired confidence score (Carlini & Wagner, 2016).

The intuition for this objective function is to optimize for the distance between the target class t and the most-likely class. If t currently has the highest logit value, then the difference of the logits will be negative, and so the optimization will stop when the logit difference between t and the runner-up class is at most κ. In other words, κ controls the desired confidence for the adversarial example (e.g. when κ is small, the adversarial example generated will be a low confidence adversarial example). On the other hand, if t does not have the highest logit, then minimizing f brings the gap between the highest class’ logit and the target class’ logit closer together, i.e. either reducing the highest class’ confidence and/or increasing the target class’ confidence. Finally, the objective of the optimization problem now becomes to minimize:

Slightly modified optimization objective. Here, w is the variable that we want to optimize over (Carlini & Wagner, 2016).

Carlini & Wagner (2016) actually proposed three different attacks under three different perceptual similarity metrics (L0, L2, and L∞). For simplicity, I am just showing the L2 attack in this article, but feel free to check their other attacks in the paper. As mentioned earlier, these attacks successfully circumvented defensive distillation.

Adversarial Transformation Network (ATN)

The idea of ATN (Baluja & Fischer, 2017) is to use another neural network whose objective is to either generate (1) adversarial examples that look similar to the valid inputs (Adversarial Autoencoding or AAE) or (2) adversarial perturbations which when added to the original instance will produce adversarial examples (Perturbation ATN or P-ATN). The goal of the generator is to minimize similarity loss between the generated image and the valid input (e.g. L2 loss), while also trying to minimize the classification loss between the classifier’s predictions and the fake targets.

Illustration of AAE. Note that this figure is not from the paper, but created for visualization purpose only. Baluja & Fischer (2017) used L2 loss for both loss terms in their paper for simplicity.
Illustration of P-ATN. Note that this figure is not from the paper, but created for visualization purpose only. Baluja & Fischer (2017) used L2 loss for both loss terms in their paper for simplicity.

Note that a generator can only be trained to generate adversarial examples (or perturbations) that will be misclassified as a certain class by the target classifier. Therefore, one needs to train different ATNs in order to generate adversarial examples that are misclassified as different classes. Although not mentioned here, Baluja & Fischer (2017) also proposed a “reranking” function to modify the training label so that the generated adversarial examples only minimally modify the output of the target classifier. Curious readers are encouraged to check their paper :)

Spatially Transformed Network (stAdv)

The idea behind stAdv (Xiao et al., 2018) attack is very similar to L-BFGS and C&W attacks. In fact, stAdv uses the loss function in C&W attack for the classification objective. The difference between stAdv and C&W L2 attack is that instead of trying to optimize for L2 distance as the perceptual similarity metric, stAdv aims to achieve perceptual similarity by optimizing for geometrical similarity. In other words, rather than directly modifying the pixel values, they minimally modified the spatial location of the pixels. This is done by deriving flow fields, which describe the movement performed upon each pixel.

stAdv suggests to minimize this loss function as the perceptual similarity metric rather than minimize for L2 distance. Here, (u,v) refers to spatial location of each pixel (p), N(p) refers to the neighbouring pixels around p within a specified radius, and q is one of the neighbouring pixels. Finally, f is the flow field that indicates the amount of spatial transformation (Xiao & Zhu et al., 2018).

The resulting adversarial example can then be calculated by the following formulation:

How to calculate the adversarial example given the spatial location updates of each pixel (Xiao & Zhu et al., 2018).

The motivation behind this attack is that distance metrics like L2 distance do not necessarily represent good perceptual metrics. Conversely, limiting the spatial deformation in an image usually produces a perturbed image that resembles the original image. We can see the results in the figure below where the pixels have been shifted around. The red arrows show how the pixels are moved from benign to adversarial image.

Results of stAdv. The adversarial image on the right is misclassified as a digit “2” instead of “0” (Xiao et al., 2018).

Blackbox Adversaries Based on Decision Boundary Approximation

In a blackbox setting, attackers do not have access to the model’s structure, and so cannot calculate dL/dx directly. Therefore, this family of attacks relies on various ways of approximating how a model behaves based on provided inputs. One may think of this as a scenario between a psychologist (an attacker) and a patient (a model), where the psychologist asks many questions to a patient, and analyze the behavior of a patient based on her responses.

Substitute Blackbox Attack

The intuition behind the substitute blackbox attack (Papernot et al., 2016) is to approximate the decision boundary of the blackbox model that we want to attack. In order to do this, the approach is to train a substitute model on a synthetic dataset that is similar to the dataset that the blackbox model is trained on. For example, suppose we want to attack a blackbox model trained on MNIST to perform handwritten recognition, in the simplest case we can generate the synthetic data manually by using our own handwriting. The trick here is that the label for the synthetic dataset should come from the blackbox model’s prediction.

Illustration of the substitute blackbox attack. There are four main steps in performing the attack: 1) train the substitute model to approximate the decision of blackbox model, 2) generate adversarial examples by performing a whitebox attack (e.g. FGSM) on the substitute model, 3) validate that the adversarial examples fool the substitute model, and 4) the generated adversarial examples should be transferable to fool the blackbox model.

Papernot et al. (2016) noted that an attacker is often constrained from making unlimited query to the target model in the real world. In order for this method to be tractable, a dataset augmentation technique called the Jacobian-based dataset augmentation was introduced. This augmentation technique is based on calculating the gradients of the label assigned by the target model with respect to the inputs in order to generate several additional samples around a small initial synthetic dataset. However, since the attacker does not know anything about the target model, the gradients are calculated through the parameters of the substitute model instead. Papernot et al. (2016) argued that this augmentation technique makes this method more efficient to approximate the decision boundary of the target model without having to make large number of queries.

The training procedure for the substitute model using the proposed data augmentation method is as follows. The attacker first created a small initial training set, where it can be initialized by picking one sample from each possible class from a dataset that represents the input domain of the target model. The substitute model is then trained on the synthetic dataset using labels provided by the target model (e.g. by querying the target model). After the training process is done, new datapoints are generated by perturbing each sample in the existing dataset according to the calculated gradients. Finally, the new inputs are added to the existing dataset, i.e. the size of the synthetic dataset grows per iteration. This process is then repeated several times.

Once the substitute model is trained, we can generate adversarial examples that fool the substitute model using whitebox methods since we have full access to the substitute model. As demonstrated by Papernot et al. (2016), adversarial examples generated this way can then be used to fool the blackbox model thanks to the transferability property. Moreover, this attack can often be used to circumvent defenses that rely on gradient masking such as the defensive distillation (Papernot et al., 2015).

Blackbox Adversaries Based on Heuristic Search

Unlike other attacks that explicitly rely on dL/dx, adversarial examples can also be found by performing heuristic search. For example, one can create a set of rules that characterizes adversarial examples and use search algorithms to find an input that satisfies those rules.

Boundary Attack

Boundary attack (Brendel et al., 2018), also a form of blackbox attack, works by evaluating a sequence of perturbed images through the model. For a non-targeted attack, the starting image can be sampled from uniform noise. In the case of a targeted attack, the starting image is an example from the target misclassification class. The method then modifies the image iteratively to look more like an example from another class while continuing to preserve its adversarial nature. The intuition behind the boundary attack is to slowly move to the direction of the decision boundary and do random walk along the boundary.

Simplified algorithm of the non-targeted boundary attack (adapted from the paper). Since the attacker only requires to evaluate the model’s prediction, this attack falls into the category of blackbox attack.

In practice, Brendel et al., 2018 set some constraints that have to be met after sampling the noise η in the algorithm above. The first and second constraints ensure that the image is still within [0, 255] (e.g. for 8-bit RGB image) when η is added to the image and that the perturbation is small, respectively. The last constraint is to ensure that η will reduce the distance between perturbed image and the original input while still being adversarial. We refer readers to their paper for the implementation details.

Illustration of the targeted boundary attack. One can generate an adversarial example by keep adding noise ηsampled from some “bank of noise” (e.g. Gaussian noise) to a benign example until the image looks like another image from a different class, while still be classified as the true class of the original image (at t = 0).

The figure above illustrates the targeted boundary attack, where we start from a valid image from a class that we want the adversary to be misclassified as (a fish) and move towards the direction of the valid input from another class (a cat) over several iterations.


Let’s summarize the different types of attacks covered here:

  • Some attacks rely on first order derivative by calculating the derivative of the loss with respect to an input, and pushing that input towards the direction where the loss will increase (FGSM, BIM, R+FGSM).
  • Other attacks are based on iterative optimization process on various objective functions (L-BFGS, C&W, stAdv) whether using L-BFGS, Adam (Kingma & Ba, 2014), or other optimization methods. The advantage of modelling adversarial attack as an optimization problem is to allow an attacker to fold in more adversarial criteria to the objective function. Furthermore, adversarial examples can also be generated by training a generative transformation model to optimize for the objective functions (ATN).
  • We can rely on the transferability property of adversarial examples and attack a blackbox model by attacking a substitute model that is trained on the synthetic datasets labelled by the blackbox model (substitute blackbox attack).
  • Finally, another blackbox attack can be achieved by starting from a datapoint that is outside of the target class data manifold and trying to move closer to the decision boundary between adversarial and non-adversarial class, performing random walk along the decision boundary through rejection sampling method (boundary attack).

I hope this article is useful to read, and that it sparks more interest in the field of adversarial machine learning. In the next post, we will see several defense methods and how most of these defenses can be circumvented (‘meta-adversarial learning’; learning to generate adversarial attack techniques which fool adversarial defenses!). Please feel free to provide suggestions in the comment section if I am missing something, or if you have any specific requests for the next post. Until next time!

Thanks to Anqi Xu, Archy de Berker, Morgan Guegan, and Wei-Wei Lin for valuable comments and illustrations!