Tricking a Machine into Thinking You’re Milla Jovovich

And other types of adversarial attacks in machine learning

Rey Reza Wiyatno
Aug 9, 2018 · 14 min read

What is an Adversarial Attack?

In early 2014, Szegedy et al. (2014) showed that minimally altering the inputs to machine learning models can lead to misclassification. These input are called as adversarial examples: pieces of data deliberately engineered to trick a model.

This picture of a fish (left) is correctly classified, but the addition of a small perturbation (middle) generated by the Fast Gradient Sign Method (FGSM) causes a classifier to misclassify the resulting image (right) as a cat.

Why Should I Care About Adversarial Attacks?

The implications of the existence of adversarial examples in the real world cannot be underestimated. Consider a home owner who uses a face recognition system as a security feature. We can now generate an adversarial eyeglass (Sharif et al., 2016) that can be printed and placed on a real eyeglass frame to fool face recognition models.

A man wearing an adversarial eyeglass (top) is grossly misclassified as Milla Jovovich (bottom) (Sharif et al., 2016).
Examples of adversarial stop signs that are misclassified as speed limit signs (Evtimov et al., 2017).

A Quick Glossary

Let’s take a look at several terms that are often used in the field of adversarial machine learning:

  • Blackbox attack: attack scenario where the attackers can only observe the outputs of a model that they are trying to attack. For example, attacking a machine learning model via an API is considered a blackbox attack since one can only provide different inputs and observe the outputs.
  • Targeted attack: attack scenario where the attackers design the adversaries to be mispredicted in a specific way. For instance, our audio example earlier: from “without the dataset the article is useless” to “okay google browse to evil dot com”. The alternative is an untargeted attack, in which the attackers do not care about the outcome as long as the example is mispredicted.
  • Universal attack: attack scenario where the attackers devise a single transform such as image perturbation that adversarially confuses the model for all or most input values (input-agnostic). For an example, see Moosavi-Dezfooli et al. (2016).
  • Transferability: a phenomenon where adversarial examples generated to fool a specific model can be used to fool another model that is trained on the same datasets. This is often referred to as the transferability property of adversarial examples (Szegedy et al., 2014; Papernot et al., 2016).
Timeline of the adversarial attacks covered in this article.

How are Adversarial Examples Generated?

Ontology of adversarial attacks based on knowledge of the attackers discussed in this article. Note that this does not necessarily represent all attack methods that exist today.

Whitebox Additive Adversarial Perturbations Based on dL/dx

This family of attacks is based on the idea of perturbing the input in a way that maximally changes the loss function of the model. In case of neural networks, this means that we need to perform back propagation to calculate the derivative of the loss function with respect to its input (as opposed to the parameters like we usually do when training the neural networks). Specifically, an attacker is interested in finding the optimal direction for the perturbation and nudging the input in this direction in the hope that the model will misclassify the perturbed input.

Illustration of the whitebox attacks for both the additive adversarial perturbations based on dL/dx and iterative optimization based attacks. Once dL/dx is calculated (step 1), one may view the attack process as a game where a player (the attacker) can adjust the pixel values (step 2) of the input based on some hints, i.e. the gradient dL/dx, to fool a model (step 3).
FGSM formulation. Here, x’ is the adversarial example that should look similar to x when ϵ is small, and y is the model’s output. ϵ is a small constant that controls the magnitude of the perturbation, and J denotes the loss function of the model.
BIM formulation where J denotes the loss function of the model, N denotes the number of iteration, and α is a constant that controls the magnitude of the perturbations (Kurakin et al., 2017). The Clip{} function ensures that the adversarial example generated is still within the range of both the ϵ ball (i.e. [x-ϵ, x+ϵ]) and the input space (i.e. [0, 255] for pixel values).
R + FGSM formulation where α is another constant that controls the magnitude of random perturbations sampled from a normal distribution (Tramer et al., 2017).

Whitebox Attacks Based on Iterative Optimization of Surrogate Objective Functions

These attacks are also whitebox and rely on dL/dx. However, they do not attempt to naively use the computed gradient directly as an added perturbation. Instead, these attacks define adversarial attack as an optimization problem to find an update to an input that optimizes an objective function. Modelling this as an optimization problem allows one to be flexible in folding in more adversarial criteria into the objective function.

L-BFGS attack seeks to solve this optimization problem where r is the perturbation (Szegedy et al., 2014).
The loss function used in C&W attack. Note the change in notation where f now represents the loss function of the classifier, not the classifier itself. Here, Z(x’) denotes the logits (the outputs of a neural network before the softmax layer) when passing adversarial input (x’) and t represents the target misclassification label (the label that we want the adversary to be misclassified as), while κ is a constant that controls the desired confidence score (Carlini & Wagner, 2016).
Slightly modified optimization objective. Here, w is the variable that we want to optimize over (Carlini & Wagner, 2016).
Illustration of AAE. Note that this figure is not from the paper, but created for visualization purpose only. Baluja & Fischer (2017) used L2 loss for both loss terms in their paper for simplicity.
Illustration of P-ATN. Note that this figure is not from the paper, but created for visualization purpose only. Baluja & Fischer (2017) used L2 loss for both loss terms in their paper for simplicity.
stAdv suggests to minimize this loss function as the perceptual similarity metric rather than minimize for L2 distance. Here, (u,v) refers to spatial location of each pixel (p), N(p) refers to the neighbouring pixels around p within a specified radius, and q is one of the neighbouring pixels. Finally, f is the flow field that indicates the amount of spatial transformation (Xiao & Zhu et al., 2018).
How to calculate the adversarial example given the spatial location updates of each pixel (Xiao & Zhu et al., 2018).
Results of stAdv. The adversarial image on the right is misclassified as a digit “2” instead of “0” (Xiao et al., 2018).

Blackbox Adversaries Based on Decision Boundary Approximation

In a blackbox setting, attackers do not have access to the model’s structure, and so cannot calculate dL/dx directly. Therefore, this family of attacks relies on various ways of approximating how a model behaves based on provided inputs. One may think of this as a scenario between a psychologist (an attacker) and a patient (a model), where the psychologist asks many questions to a patient, and analyze the behavior of a patient based on her responses.

Illustration of the substitute blackbox attack. There are four main steps in performing the attack: 1) train the substitute model to approximate the decision of blackbox model, 2) generate adversarial examples by performing a whitebox attack (e.g. FGSM) on the substitute model, 3) validate that the adversarial examples fool the substitute model, and 4) the generated adversarial examples should be transferable to fool the blackbox model.

Blackbox Adversaries Based on Heuristic Search

Unlike other attacks that explicitly rely on dL/dx, adversarial examples can also be found by performing heuristic search. For example, one can create a set of rules that characterizes adversarial examples and use search algorithms to find an input that satisfies those rules.

Simplified algorithm of the non-targeted boundary attack (adapted from the paper). Since the attacker only requires to evaluate the model’s prediction, this attack falls into the category of blackbox attack.
Illustration of the targeted boundary attack. One can generate an adversarial example by keep adding noise ηsampled from some “bank of noise” (e.g. Gaussian noise) to a benign example until the image looks like another image from a different class, while still be classified as the true class of the original image (at t = 0).


Let’s summarize the different types of attacks covered here:

  • Other attacks are based on iterative optimization process on various objective functions (L-BFGS, C&W, stAdv) whether using L-BFGS, Adam (Kingma & Ba, 2014), or other optimization methods. The advantage of modelling adversarial attack as an optimization problem is to allow an attacker to fold in more adversarial criteria to the objective function. Furthermore, adversarial examples can also be generated by training a generative transformation model to optimize for the objective functions (ATN).
  • We can rely on the transferability property of adversarial examples and attack a blackbox model by attacking a substitute model that is trained on the synthetic datasets labelled by the blackbox model (substitute blackbox attack).
  • Finally, another blackbox attack can be achieved by starting from a datapoint that is outside of the target class data manifold and trying to move closer to the decision boundary between adversarial and non-adversarial class, performing random walk along the decision boundary through rejection sampling method (boundary attack).

Element AI Lab

We’ve moved!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store