FGSM Attacks on MNIST Fashion Dataset

Published in

BerkeleyISchool

5 min readJun 6, 2023

In this article we will look at how the FGSM adversarial attack can corrupt classification results and reduce the accuracy of machine learning models, in particular the MNIST Fashion Dataset.

We will be using Tensorflow for all code examples.

Underlying dataset

The MNIST Fashion dataset consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label for each 10 classes. The Fashion MNIST is an alternative to MNIST, with the 10 classes ranging from t-shirts to ankle boots.

Found at:

Fashion MNIST

An MNIST-like dataset of 70,000 28x28 labeled fashion images

www.kaggle.com

Let’s load the dataset :

# Load the MNIST Fashion dataset
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

FGSM attacks

The Fast Gradient Sign Method (FGSM) is a specific technique used in adversarial attacks against machine learning models.

The FGSM attack aims to reduce the accuracy of a machine learning model by adding carefully crafted perturbations to the input data. The attack leverages the gradients of the model’s loss function with respect to the input data to generate adversarial examples. This attack is effective because it exploits the linearity of the gradients. By taking the sign of the gradients, the attack focuses on the direction that maximizes the loss function. The small perturbation ε determines the strength of the attack, striking a balance between making the adversarial example perceptually similar to the original and causing a significant impact on the model’s predictions.

FGSM attacks follow 6 steps:

Select a target model: Choose a machine learning model to attack. This model could be a classifier used for image recognition, natural language processing, or any other model.
Choose an input example: Select an input example (e.g. an image) that you want to generate an adversarial example for. The original input in the example (4.) is denoted as x.
Calculate the gradient: Calculate the gradients of the model’s loss function with respect to the input example x. These gradients indicate the direction of the steepest ascent in the loss landscape.
Determine the perturbation: Determine the perturbation to be added to the original example x to generate an adversarial example. This perturbation is calculated by taking the sign of the gradients and scaling it by a small value ε. Mathematically, the perturbation δ is given by δ = ε * sign(∇loss(x)).
Generate the adversarial example: Add the perturbation δ to the original input example x to obtain the adversarial example x_adv. This is done element-wise, meaning each pixel or feature of the input is modified accordingly.
Adversarial example evaluation: You can now use the adversarial example x_adv to test the model’s behavior. You can check if the model misclassifies the adversarial example or exhibits any other unintended behavior.

Experiment

Let’s first normalize our dataset, and get a first look at what we’re working with:

# Normalize the pixel values
train_images = train_images.astype('float32') / 255.0
test_images = test_images.astype('float32') / 255.0

# MNIST class names
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

print("Train data shape:", train_labels, train_images.shape, train_labels.shape)
print("Test data shape:", test_labels, test_images.shape, test_labels.shape)
unique, counts = np.unique(train_labels, return_counts=True)
print("Image sample distribution", unique, counts)

# Display original training image samples
print("\nSample test images and labels")
plt.figure(figsize=(10,8))
for i in range(0,20):
    plt.subplot(4,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(test_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[test_labels[i]])
plt.show()

And now let’s write a FGSM function:

def fgsm_attack(model, data, epsilon):
    # Retrieve the inputs and labels from the data
    inputs, labels = data

    # Create a TensorFlow session
    sess = tf.compat.v1.Session()

    # Convert the inputs and labels to TensorFlow tensors
    inputs_tf = tf.convert_to_tensor(inputs)
    labels_tf = tf.convert_to_tensor(labels)

    # Calculate the gradients of the loss with respect to the inputs
    with tf.GradientTape() as tape:
        tape.watch(inputs_tf)
        logits = model(inputs_tf, training=False)
        loss = tf.keras.losses.sparse_categorical_crossentropy(labels_tf, logits)

    gradient = tape.gradient(loss, inputs_tf)

    # Compute the sign of the gradients
    gradient_signs = tf.sign(gradient)

    # Generate the perturbed inputs by adding epsilon times the sign of the gradients
    perturbed_inputs = tf.add(inputs_tf, epsilon * gradient_signs)

    # Clip the perturbed inputs to ensure they stay within the valid range (0 to 1)
    perturbed_inputs = tf.clip_by_value(perturbed_inputs, 0, 1)

    # Run the TensorFlow session to obtain the perturbed inputs
    perturbed_inputs_np = sess.run(perturbed_inputs)

    # Close the TensorFlow session
    sess.close()

    return perturbed_inputs_np

And now the fun part…

The attack!

epsilon = 0.1  # Adjust the value as needed
perturbed_inputs = fgsm_attack(model_1, (x_test, y_test), epsilon)

Let’s view our results :

# Select a random example from the test set
example_index = np.random.randint(len(x_test))
original_image = x_test[example_index]
original_label = y_test[example_index]

# Perform the FGSM attack
perturbed_image = fgsm_attack(model, (np.expand_dims(original_image, axis=0), np.expand_dims(original_label, axis=0)), epsilon)
perturbed_image = perturbed_image.squeeze()

# Get the model prediction on the perturbed image
predicted_label = model.predict(perturbed_image.reshape(1, 32, 32, 3))
predicted_label = np.argmax(predicted_label)

# Plot the original and perturbed images side by side
plt.subplots(1, 3, 1)
plt.imshow(original_image, cmap='gray')
plt.set_title('Original Image\nActual Label: \n{}'.format(class_names[test_labels[example_index]]) )
plt.axis('off')

plt.title('FGSM Image\nEpsilon: {}'.format(epsilon))
plt.axis('off')
plt.subplot(1, 3, 3)

plt.imshow(perturbed_image.reshape(28, 28), cmap='gray')
plt.set_title('FGSM Data \nModel Prediction: \n{}'.format(label_names[predicted_label])
plt.axis('off')
plt.show()

And here are our results:

Conclusion

You may be wondering, is it possible to prevent a FGSM attack? There are a few methods for mitigating or reducing the impact of a FGSM attack:

Adversarial Training

Retrain the model using adversarial examples generated by FGSM during the training process.

2. Defensive Distillation

Train the model with softened probabilities (temperature scaling) to make the model more robust to adversarial examples.

3. Robust Feature Extraction

Implementing feature squeezing.

4. Gradient Masking

Hide the gradients of the model during the training process
Add noise to the gradients during backpropagation

5. Input Preprocessing

Input normalization
Data augmentation
Image resizing

I will go into these mitigation methods in further detail in a future article.

Happy reading!

Anastasia is pursuing a Master of Information and Cybersecurity at UC Berkeley’s School of Information. This article was inspired by her final project for the Applied Machine Learning for Cybersecurity course. Anastasia worked on her final project with Jackson Gor and Jenn Yonemitsu, both students who are also pursuing their Master of Information and Cybersecurity at UC Berkeley’s School of Information.