Tricking The World’s Most Accurate Deep Learning Models

An insight into the vulnerability of using deep learning

Editorial @ TRN

Published in

The Research Nest

7 min readJun 26, 2020

Ever trained an image recognition model?

What accuracy did you get? 90, 95, or maybe a near-perfect 99 percent?

No matter what your answer is, we want to ask for a follow-up.

If you get a great accuracy on training as well as test images, does it mean your model is ready to be deployed?

Well, even though it once did, now it may not be ready.

Your model may work for all the images the world has to offer, but it can be fooled. Even a 100/100 score on the test run is not sufficient now.

But why?

Model tampering.

Have you ever thought about how easily your model can be fooled? And here we are not talking about making a 3D replica of a face and breaking the face recognition on your iPhone.

Something far simpler which can force your model to misclassify.

Before we go further let us see what we mean.

Let us take the best image recognition models, trained by experts on one of the largest image databases: ImageNet.

Here are our contenders:

NASNetLarge (Top-5 accuracy: 0.960)
InceptionResNetv2 (Top-5 accuracy: 0.953)
ResNet152v2 (Top-5 accuracy: 0.942)

We sure can see that these are very accurate models. All the accuracies are validation accuracies on ImageNet, one of the largest dataset of its kind.

So, these models should be able to identify an ice-cream right?

It can do way better than that, an ice-cream, really?

Well, it should do it for now. Let’s see if the models can predict whether the following image depicts an ice-cream or not.

Here is how we can use these models, pre-trained on ImageNet, for predicting the ice-cream.

Importing the libraries

import tensorflow as tf
import matplotlib as mpl
import matplotlib.pyplot as plt

Importing the pre-trained model from keras.applications

pretrained_model = tf.keras.applications.ResNet152V2(include_top = True,weights=’imagenet’)
pretrained_model.trainable = False# ImageNet labels
decode_predictions = tf.keras.applications.resnet.decode_predictions

Using a pre-processing function, to prepare an image for model

# Helper function to preprocess the image so that it can be inputted in MobileNetV2
def preprocess(image):
 image = tf.cast(image, tf.float32)
 image = image/255
 image = tf.image.resize(image, (224, 224))
 image = image[None, …]
 return image# Helper function to extract labels from probability vector
def get_imagenet_label(probs):
 return decode_predictions(probs, top=1)[0][0]

Finally using predict function to get the class

image_path = ‘icecream.jpg’
image_raw = tf.io.read_file(image_path)
image = tf.image.decode_image(image_raw)image = preprocess(image)
image_probs = pretrained_model.predict(image)

Print the class as well as the confidence score

plt.figure()
plt.imshow(image[0])
_, image_class, class_confidence = get_imagenet_label(image_probs)
plt.title(‘{} : {:.2f}% Confidence’.format(image_class, class_confidence*100))
plt.show()

Well, all of them are predicting it perfectly. With 90+ confidence scores. That is awesome.

We have one question to ask you, do you trust these models?

Obviously! They scored above 90 percent accuracy.

Well, let us run these models on another image. Below is the new image used.

Well, it's the same image (or so, you may think)!

Still. Let’s predict as we did before.

These are the resultant scores on the new image.

Hey! What happened to the model? Did two out of the three models predict it as an “ear”, that too with low confidence?

This is what happens if we perform, what is called an adversarial attack.

Even though you cannot see the difference between the new and previous images, both are quite different. We added something known as perturbations on the image.

Perturbations are specifically calculated pixels to distort the original image. This new image is known as an adversarial image.

Now, do note that this adversarial attack is a white box attack. That means we have full access to the network architecture as well as the gradients (weights) associated with it.

We create perturbations by taking the specific gradients which are associated with the image that we predict. In this case, it is an ice-cream. So, we take the gradients responsible for predicting an image as ice-cream, and then we use the following equation to increase the error for it.

adv_x = x + ϵ ∗ sign(∇xJ(θ,x,y))

Here x is the original image.
Epsilon is the multiplier that we use to control the level of adversity in the image.
Theta represents the model parameters (gradients). And J represents the loss.

This attack is known as FGSM: Fast Gradient Signed Method

When we use the gradients to create perturbations, to increase the error, that process is known as gradient signing.

The process usually runs pretty fast because we already have the gradients associated with the class label. Hence we only change those gradients and add it to the original image.

Hence the name Fast Gradient Signing Method.

Now let’s look at the code, of how to create these adversarial images.

This is the function that we use to make our adversarial images.

loss_object = tf.keras.losses.CategoricalCrossentropy()def create_adversarial_pattern(input_image, input_label):
 with tf.GradientTape() as tape:
 tape.watch(input_image)
 prediction = pretrained_model(input_image)
 loss = loss_object(input_label, prediction) # Get the gradients of the loss w.r.t to the input image
 gradient = tape.gradient(loss, input_image) # Get the sign of the gradients to create the perturbation
 signed_grad = tf.sign(gradient)
 return signed_grad

There are two functions that we need to understand. First, the tape.gradient function. This is a simple function that is used to record the differentiation. You may read more about it here.

The second is the tf.sign function. It simply returns the sign of the values. If negative then it returns -1, if positive then 1 and 0 if 0.

# Get the input label of the image.
icecream_index = 928
label = tf.one_hot(icecream_index, image_probs.shape[-1])
label = tf.reshape(label, (1, image_probs.shape[-1]))

Note we specify the index of the class label. This is important so that we can get the gradients for that particular index.

So, that is how we get the loss values responsible for the image class (ice-cream) prediction.

Let’s see how the perturbations look like:

perturbations = create_adversarial_pattern(image, label)
plt.imshow(perturbations[0])

We will generate adversarial images on two different values of epsilon. And predict the classes by using the same pre-trained models we used to predict them earlier.

We simply use the adversarial image and multiply it with the epsilon values.

Now we finally put these images up for predictions.

epsilons = [0, 0.01, 0.1]
descriptions = [(‘Epsilon = {:0.3f}’.format(eps) if eps else ‘Input’)
 for eps in epsilons]for i, eps in enumerate(epsilons):
 adv_x = image + eps*perturbations
 adv_x = tf.clip_by_value(adv_x, 0, 1)
 display_images(adv_x, descriptions[i])

Note: A higher value of epsilon means more error and hence fooling it more. But it would distort the original image higher. So, it would be visible to the naked eye (Like in epsilon 0.100 value).

Here they are: In format (No perturbations, with epsilon value 0.01 and with epsilon value 0.1)

ResNet results

2. Inception results

3. NASNet results

Well, they were fooled nicely :)

This is just a simple white-box attack. But there are many more advanced attacks. This paper contains a comprehensive list of all such attacks and some suggested defenses, which you can read for your awareness.

When we talk about such attacks, we always try to think of the bigger picture. What if we deploy a solution that doesn’t have defenses to them implemented? Anyone with enough knowledge about these attacks can fool even the “state-of-the-art algorithms”. We just saw that.

Imagine a scenario in military applications if computer vision was used for missiles to attack some terrorists. The attack would be in an isolated territory and hence the model could be trained to identify a person. If a person is detected, the missile would go off. But if the terrorists had some adversarial patches, they could easily fool the missile. A simple image patch can make blunders. These kinds of attacks can destroy robotics and computer vision applications.

These are scary; but if the adversarial patches are under research, so are the defenses against them. It is better to identify them now and find a solution.

Next time you develop a computer vision solution, do think once: Is this it? Can I trust it?
We hope someday we will get the answer. Maybe yes. Maybe not.

Editorial note-

This article was conceptualized by Aditya Vivek Thota and written by Dishant Parikh of The Research Nest.

Stay tuned for more such insightful content with a prime focus on artificial intelligence!

Tricking The World’s Most Accurate Deep Learning Models

An insight into the vulnerability of using deep learning

Written by Editorial @ TRN