Generating Adversarial Samples in Keras (Tutorial)

Eric Muccino
Aug 14, 2019 · 7 min read


As deep learning technologies power increasingly more services, associated security risks become more critical to address. Adversarial Machine Learning is a branch of machine learning that exploits the mathematics underlying deep learning systems in order to evade, explore, and/or poison machine learning models [1,2]. Evasion attacks are the most common adversarial attack method due to their ease of implementation and potential for being highly disruptive. During an evasion attack, the adversary tries to evade a fully trained model by engineering samples to be misclassified by the model. This attack does not assume any influence over the training data.

Evasion attacks have been demonstrated in the context of autonomous vehicles where the adversary manipulates traffic signs to confuse the learning model [3]. Research suggests that deep neural networks are susceptible to adversarial based evasion attacks due to their high degree of non-linearity as well as insufficient model averaging and regularization [4].

Adversarial attacks can be classified as either black-box or white-box attacks. During a white-box attack, an attacker has full access to the model, including the architecture, weights, and training algorithm. During a black-box attack, the attacker only has the ability to use the model as an oracle, observing model outputs by querying the model with inputs. If given the opportunity, white-box attacks are easier to perform since the attacker has complete information about the model. For this reason, defense techniques against white-box attacks also work to defend against black-box attacks.

During a white-box evasion attack, an adversary will select an input instance that he or she wants to be misclassified. The attacker will select an optimization algorithm and loss function to tune input features of a sample based on desired classification goals. An attacker may seek a non-targeted misclassification through the use of a loss function that minimizes output of the true class. Adversaries may instead seek a targeted misclicassification through the use of a loss function that maximizes output of a selected target class. In both cases, regularization is included in the loss function to minimize the total change to the sample input features [4,5].

In the context of computer vision, this process results in the creation of images that are visibly passable as legitimate samples but are capable of fooling the model through careful exploitation of narrow gaps in decision boundaries. In this post, we will see how easy it is to generate adversarial images to evade a trained classifier using a white-box approach with the Keras machine learning library. We will be using the MNIST handwritten digit data set for this tutorial.

Tutorial Using MNIST in Keras

This tutorial uses Keras version 2.2.4 with TensorFlow 1.13 back end.

The first step is to train an image classifier. We will use a simple Convolutional Neural Network. Note that the imports will be used for the remainder of the code presented in this post.

Our trained model gives 99.99% accuracy on the training set and 99.39% accuracy on the test set. Now that we have a high performing classifier, let’s try to break it.

First we need to select on image to be the basis of our adversarial example. I’ve selected the first image in the training set which happens to be a 5. We’ll evaluate this image with our classifier to double check that it is classified correctly. Next, let’s apply some heavy random noise to the image and evaluate the noisy image on the classification model as well.

Original Image

The original image is predicted by our model to be a 5 with 100% probability.

Original Image + 0.3 stdv Noise

With noise, our classification model still predicts the image to be a 5 with 98.97% probability. This shows that our model should be resilient against small perturbations to the image. However, we’re about to see that this isn’t the case.

In order to generate an adversarial example, we want to find a collection of small perturbations to our image that produces a desired output for our classifier. To do this, we create a new model that is identical to our trained classifier, but with the addition of preprocessings layers, as seen in the network diagram. The adversarial_noise layer is a Dense layer that is fully connected to a placeholder input containing a singular constant of 1. Use of bias is turned off for this layer. This means that the output of each neuron in the adversarial_noise layer is equivalent to the weight that connects it to the unity input. The adversarial_noise layer has 784 outputs, one for each pixel in the input image. As the network trains, each weight corresponds with the adversarial noise applied to a unique pixel in the image. A regularization (e.g l1, l2) is applied to the kernel of this layers to keep the pixel perturbations small and/or innumerous.

Next, the outputs are reshaped to match the image shape, then added to the input image. A custom activation function is used to clip the values to be between 0.0 and 1.0. This is to prevent the model from applying noise that produces pixel values outside the scale of our input image. Finally, our augmented image is fed through our classifier to obtain an output. The classifier weights are frozen so that only the adversarial_noise layer weights are trained.

Adversarial Sample Generator

To achieve a non-targeted misclassification, we use a custom loss function which is simply the negative of categorical_crossentropy. Then we train the model using our original image as input and the true class as output. Our custom loss function will minimize the accuracy, producing a non-targeted misclassification.

For a targeted misclassification, we use the categorical_crossentropy loss function and supply the model with our desired target as output during training.


Here are the results for both non-targeted and targeted misclassifications using l1,l2, and l1_l2 regularizations. All regularzations were applied with coefficients of 0.01. The targeted misclassifications were trained to predict 9. Interestingly, all 3 non-targeted misclassifications examples gave final predictions of 4.

From left to right:
Non-targeted Misclassificaiton (L1): 100% 4
Non-targeted Misclassificaiton(L2): 99.99% 4
Non-targeted Misclassificaiton (L1_L2): 100% 4

From left to right:
Targeted Misclassificaiton(L1): 99.81% 9
Targeted Misclassificaiton(L2): 98.33% 9
Targeted Misclassificaiton(L1_L2): 98.83% 9


As demonstrated in this tutorial, generating adversarial samples for a classifier is extremely easy to do in Keras. The generated samples having less visible noise than the random noise sample, yet they are completely misclassified by the model. The highly non-linear nature of the Convolutional Neural Network Classifier allows small fractures in the decision boundary space to be exploited. Deeper networks (e.g InceptionV3) are susceptible to adversarial samples that arevisibly indistinguishable from the original image.

White-box evasion attacks, as demonstrated in this post, are just one type of a wide variety of adversarial attacks. Recent research has established various defenses against adversarial attacks, but each defense technique has considerable drawbacks and/or limitations. Some of the most prolific defense strategies include; adversarial training, gradient hiding, and feature squeezing. However, none of these techniques completely eliminate the threat of adversarial attacks [4,6,7,8].

Mindboard is deeply interested in the field of Adversarial Machine Learning. Our Data Science team is currently engaged in research with the goal of revolutionizing the space of Adversarial ML Defense. Follow Mindboard to stay up to date with future research.


[1] Battista Biggio, Giorgio Fumera, and Fabio Roli. 2014. Security Evaluation of Pattern Classifiers under Attack. IEEE Trans. Knowl. Data Eng. 26, 4 (2014), 984–996.

[2] Battista Biggio, Giorgio Fumera, and Fabio Roli. 2014. Security evaluation of pattern classifiers under attack. IEEE transactions on knowledge and data engineering 26, 4 (2014), 984–‘–996.

[3] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2016. Adversarial Machine Learning at Scale. CoRR abs/1611.01236 (2016). arXiv:1611.01236

[4] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. CoRR abs/1312.6199 (2013).

[5] Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. 2016. Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22–26, 2016. 582–597.

[6] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and Harnessing Adversarial Examples. CoRR abs/1412.6572 (2014).

[7] Chunchuan Lyu, Kaizhu Huang, and Hai-Ning Liang. 2015. A Unified Gradient Regularization Family for Adversarial Examples. In 2015 IEEE International Conference on Data Mining, ICDM 2015, Atlantic City, NJ, USA, November 14–17, 2015. 301–309.

[8] Uri Shaham, Yutaro Yamada, and Sahand Negahban. 2015. Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization. CoRR abs/1511.05432 (2015).


The Mindboard Data Science Team explores cutting-edge technologies in innovative ways to provide original solutions, including the Masala.AI product line. Masala provides media content rating services such as vRate, a browser extension that detects and blocks mature content with custom sensitivity settings. The vRate browser extension is available for download via the Chrome Web Store. Check out for more info.


Case Studies, Insights, and Discussions of our…