Deep Dive Into Adversarial Attacks

Published in

The Startup

7 min readAug 30, 2020

In this article, we will be talking about how Deep Neural Networks are not at their full potential yet, and they are vulnerable to some specially crafted inputs called adversarial examples, which poses security concerns in the physical world with advancement in applications of the deep neural network, like facial recognition, object detection, etc.

What are the Adversarial Examples?

“They are the inputs to any machine learning models that an attacker intentionally designed to cause the model to mistake.” — Goodfellow et al.

Adding an unnoticeable perturbation, “panda” is classified as “gibbon”. — Image Credit: Goodfellow et al. , 2014b

One can think of it as an optical illusion for machine learning models. The basic idea is to introduce a small perturbation into the input which is invisible to humans but can fool the well-performing models, e.g. LeNet, Resnet, YOLO, etc.

Some History

Biggio et al. (2013), first to investigate the poison attacks on the Support Vector Machine (SVM), followed by Szegedy et al. (2014) first to introduce the adversarial examples against deep neural networks by using L-BFGS optimization method and generated “targeted” adversarial examples. The next was the Fast Gradient Sign Method (FGSM) by Goodfellow et al. , 2014b, a one-step method to fast generate adversarial examples (both non-targeted and targeted attacks). And following the trend, many algorithms came like Deepfool, JSMA, BIM/PGD, C&W, etc. Amongst all the attacks, C&W by Carlini and Wagner is the strongest till now with the 100% success rate on benchmark model Inception V3 model using ImageNet dataset.

But all the methods which came till 2016 were white-box based threat models, i.e. the attacker priorly know about the internal configuration, hyperparameters, datasets used, the architecture of the model. But on a practical basis, the attacker will only be available with the input samples, and the output classes/prediction scores (case to case basis), black-box based threat models. Papernot et al. (2017) were first to introduce the practical black-box attack algorithm, Substitute Model based strategy, which is based on the significant property of adversarial examples “Transferability”, it tells that if adversarial examples crafted to fool model A, then it can also fool model B which has similar structures to model A.

Thus, the adversarial examples generation (attack) can be classified into White-Box Attacks, Black-Box Attacks, and in some cases Gray-Box Attacks (Semi-White). For gray box attacks, the best example is GAN based approaches as in that after crafting adversarial examples using white-box settings attacker don’t need the model.

White Box Attacks

There have been many white-box attacks till now, but here we will discuss one of the most effective methods, Carlini and Wagner’s L2 metric based targeted attack.

They formally define the problem of finding an adversarial instance for an image x as follows:

They defined seven objective functions f (.) and the below formulation gave the best experimental results and also gave the possible explanation for that, which is discussed below:

The function f(x,y) can also be viewed as the loss function for the given data (x,y) which penalises situation where there are some labels i with scores Z(x, i) larger than Z(x,y), this can also be viewed as the margin loss function. The benefit of using the margin loss function rather than cross-entropy is that if classifier classifies x’ to class t, then margin loss value goes zero which automatically tells the objective function to minimize the L2 distance between the input x and adversarial instance x’.

the standard definition for designing adversarial example using unique perturbation delta

Now, to ensure that the image created is in the valid constraints on perturbation delta, we must have the image in box constraints [0,1]. There can be different methods to approach this problem like using directly the “box constraints” and use L-BFGS method to optimize. But, they investigated three different strategies for it, first was Projected Gradient Descent, i.e. using the one-step gradient descent and then clip all the coordinates in box domain; the second was to use Clipped Gradient Descent in this rather then clipping on each iteration it incorporates the clipping function in the objective function to be minimized; the third was the new strategy they introduced and called it the change of variables as they changed the domain range of x by introducing new variable w:

here as we know the range of tanh(.) is [-1,1] thus it follows x’ in range of [0,1]. This type of strategy bypasses the problem of getting stuck in corner cases / extreme regions. Thus, can be said as smooth clipped gradient descent also.

Thus we can replace the f(x’) to f(w) then for the minimization of objective function the search becomes in the range of w which is unconstrained and thus standard optimization tools of DNN i.e. backpropagation can be used effectively to obtain the corresponding adversarial example x.

The only unsolved mystery here remains now are constant c and K. To solve that problem for constant c the basic algorithm can be used i.e. binary search to find the optimal value of c to minimize the objective function. For the value of K i.e., nothing but confidence score here in margin loss function experimentally most appropriate used value is zero but it can change in the case to case basis.

The experimented results by proposed targeted L2 attack:

L2 attack applied to the MNIST dataset targeted to each and every class for each sample input image. Image Credit: Carlini and Wagner (2017)

The C&W attack evaded gradient-based defense strategy, Defensive Distillation with 100% success rate.

Black Box Attacks

In the previous topic, we discussed how backpropagation algorithm can be used to generate our targeted adversarial examples but that is only possible if we have the model’s internal architecture to reproduce and use it to generate the examples but as in black box based threat model we can be only given with the prediction scores/labels for the testing set but no model’s architecture, weights, the simplest example is publically available API’s like Google Cloud Vision API, one can use it as an ideal black-box model.

So, as of now, there have been many black-box attacks since 2016, but we will discuss one of the fundamental black-box attacks i.e. also the first black box attack by Papernot et al. (2017), Substitute Model.

So, first let us assume we have the target model, a multiclass DNN classifier (let us call it oracle model), it outputs the classified label. Thus, we have been given input and output label (as we know label tells us about the model much less than probability vectors).

Substitute Model Training: Training a model F approximating oracle model O, selecting a model F architecture to imitate the decision boundary of the model O so that if we create adversarial examples for model F then by property of transferability model O also gets fool with high success rate.

Basic steps for the substitute model training:

Synthesizing a substitute training dataset, by making an initial “replica” training set by manual handcraft data or random sample from the test set.
Feed the substitute model with the labels from the oracle model to train model F and get the corresponding parameters for the substitute model.
Augment the data, Jacobian-based Dataset Augmentation, Adversary evaluates the sign of the Jacobian matrix dimension corresponding to the label assigned to the input x by oracle model, and add a new term lambda* sgn(J(x)[O(x)]) to the original point x.

4. Now, utilizing the knowledge model F has gathered using the Jacobian based dataset augmentation and a substitute model method of training we can use any gradient-based white-box attack with good transferability rate like FGSM, PGD, JSMA, etc.

In the whole process, the main concern is the number of queries made to model O as that is limited and so in practical black box attack so, they try to minimize that query using the reservoir sampling which makes the reduction in the exponential complexity for data augmentation query. The other concern can be a step size (lambda) i.e. if the larger step size decrease in convergence stability while the smaller yields slow convergence thus, we can use a periodic step size:

The proposed black-box attack misclassified the Google oracles with 97.17% with 6 epochs (rho) and Amazon oracles with 96.78%.

Conclusion

Thus, in this article, we discussed what are adversarial examples? the significant history of the adversarial attacks. The two elaborated White-box and Black-box attacks, which can tell significantly about how are the adversarial examples are getting created and can be used to fool the practical world classifiers.

References

[1] “Poisoning Attacks against Support Vector Machines”, Biggio et al. 2013.[https://arxiv.org/abs/1206.6389]

[2] “Intriguing properties of neural networks”, Szegedy et al. 2014. [https://arxiv.org/abs/1312.6199]

[3] “Explaining and Harnessing Adversarial Examples”, Goodfellow et al. 2014. [https://arxiv.org/abs/1412.6572]

[4] “Towards Evaluating the Robustness of Neural Networks”, Carlini and Wagner 2017b. [https://arxiv.org/abs/1608.04644]

[5] “Practical Black-Box Attacks against Machine Learning”, Papernot et al. 2017. [https://arxiv.org/abs/1602.02697]

[6] “Attacking Machine Learning with Adversarial Examples”, Goodfellow, 2017. [https://openai.com/blog/adversarial-example-research/]