FGSM Attacks on MNIST Fashion Dataset

Anastasia Sizensky
BerkeleyISchool
Published in
5 min readJun 6, 2023

In this article we will look at how the FGSM adversarial attack can corrupt classification results and reduce the accuracy of machine learning models, in particular the MNIST Fashion Dataset.

We will be using Tensorflow for all code examples.

Underlying dataset

The MNIST Fashion dataset consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label for each 10 classes. The Fashion MNIST is an alternative to MNIST, with the 10 classes ranging from t-shirts to ankle boots.

Fashion MNIST dataset

Found at:

Let’s load the dataset :

# Load the MNIST Fashion dataset
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

FGSM attacks

The Fast Gradient Sign Method (FGSM) is a specific technique used in adversarial attacks against machine learning models.

The FGSM attack aims to reduce the accuracy of a machine learning model by adding carefully crafted perturbations to the input data. The attack leverages the gradients of the model’s loss function with respect to the input data to generate adversarial examples. This attack is effective because it exploits the linearity of the gradients. By taking the sign of the gradients, the attack focuses on the direction that maximizes the loss function. The small perturbation ε determines the strength of the attack, striking a balance between making the adversarial example perceptually similar to the original and causing a significant impact on the model’s predictions.

FGSM attacks follow 6 steps:

  1. Select a target model: Choose a machine learning model to attack. This model could be a classifier used for image recognition, natural language processing, or any other model.
  2. Choose an input example: Select an input example (e.g. an image) that you want to generate an adversarial example for. The original input in the example (4.) is denoted as x.
  3. Calculate the gradient: Calculate the gradients of the model’s loss function with respect to the input example x. These gradients indicate the direction of the steepest ascent in the loss landscape.
  4. Determine the perturbation: Determine the perturbation to be added to the original example x to generate an adversarial example. This perturbation is calculated by taking the sign of the gradients and scaling it by a small value ε. Mathematically, the perturbation δ is given by δ = ε * sign(∇loss(x)).
  5. Generate the adversarial example: Add the perturbation δ to the original input example x to obtain the adversarial example x_adv. This is done element-wise, meaning each pixel or feature of the input is modified accordingly.
  6. Adversarial example evaluation: You can now use the adversarial example x_adv to test the model’s behavior. You can check if the model misclassifies the adversarial example or exhibits any other unintended behavior.
FGSM attack
FGSM attack

Experiment

Let’s first normalize our dataset, and get a first look at what we’re working with:

# Normalize the pixel values
train_images = train_images.astype('float32') / 255.0
test_images = test_images.astype('float32') / 255.0

# MNIST class names
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

print("Train data shape:", train_labels, train_images.shape, train_labels.shape)
print("Test data shape:", test_labels, test_images.shape, test_labels.shape)
unique, counts = np.unique(train_labels, return_counts=True)
print("Image sample distribution", unique, counts)

# Display original training image samples
print("\nSample test images and labels")
plt.figure(figsize=(10,8))
for i in range(0,20):
plt.subplot(4,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(test_images[i], cmap=plt.cm.binary)
plt.xlabel(class_names[test_labels[i]])
plt.show()
Output from code
Output

And now let’s write a FGSM function:

def fgsm_attack(model, data, epsilon):
# Retrieve the inputs and labels from the data
inputs, labels = data

# Create a TensorFlow session
sess = tf.compat.v1.Session()

# Convert the inputs and labels to TensorFlow tensors
inputs_tf = tf.convert_to_tensor(inputs)
labels_tf = tf.convert_to_tensor(labels)

# Calculate the gradients of the loss with respect to the inputs
with tf.GradientTape() as tape:
tape.watch(inputs_tf)
logits = model(inputs_tf, training=False)
loss = tf.keras.losses.sparse_categorical_crossentropy(labels_tf, logits)

gradient = tape.gradient(loss, inputs_tf)

# Compute the sign of the gradients
gradient_signs = tf.sign(gradient)

# Generate the perturbed inputs by adding epsilon times the sign of the gradients
perturbed_inputs = tf.add(inputs_tf, epsilon * gradient_signs)

# Clip the perturbed inputs to ensure they stay within the valid range (0 to 1)
perturbed_inputs = tf.clip_by_value(perturbed_inputs, 0, 1)

# Run the TensorFlow session to obtain the perturbed inputs
perturbed_inputs_np = sess.run(perturbed_inputs)

# Close the TensorFlow session
sess.close()

return perturbed_inputs_np

And now the fun part…

The attack!

epsilon = 0.1  # Adjust the value as needed
perturbed_inputs = fgsm_attack(model_1, (x_test, y_test), epsilon)

Let’s view our results :

# Select a random example from the test set
example_index = np.random.randint(len(x_test))
original_image = x_test[example_index]
original_label = y_test[example_index]

# Perform the FGSM attack
perturbed_image = fgsm_attack(model, (np.expand_dims(original_image, axis=0), np.expand_dims(original_label, axis=0)), epsilon)
perturbed_image = perturbed_image.squeeze()

# Get the model prediction on the perturbed image
predicted_label = model.predict(perturbed_image.reshape(1, 32, 32, 3))
predicted_label = np.argmax(predicted_label)

# Plot the original and perturbed images side by side
plt.subplots(1, 3, 1)
plt.imshow(original_image, cmap='gray')
plt.set_title('Original Image\nActual Label: \n{}'.format(class_names[test_labels[example_index]]) )
plt.axis('off')

plt.title('FGSM Image\nEpsilon: {}'.format(epsilon))
plt.axis('off')
plt.subplot(1, 3, 3)

plt.imshow(perturbed_image.reshape(28, 28), cmap='gray')
plt.set_title('FGSM Data \nModel Prediction: \n{}'.format(label_names[predicted_label])
plt.axis('off')
plt.show()

And here are our results:

Sneaker after FGSM attack

Conclusion

You may be wondering, is it possible to prevent a FGSM attack? There are a few methods for mitigating or reducing the impact of a FGSM attack:

  1. Adversarial Training

Retrain the model using adversarial examples generated by FGSM during the training process.

2. Defensive Distillation

Train the model with softened probabilities (temperature scaling) to make the model more robust to adversarial examples.

3. Robust Feature Extraction

Implementing feature squeezing.

4. Gradient Masking

  • Hide the gradients of the model during the training process
  • Add noise to the gradients during backpropagation

5. Input Preprocessing

  • Input normalization
  • Data augmentation
  • Image resizing

I will go into these mitigation methods in further detail in a future article.

Happy reading!

Anastasia is pursuing a Master of Information and Cybersecurity at UC Berkeley’s School of Information. This article was inspired by her final project for the Applied Machine Learning for Cybersecurity course. Anastasia worked on her final project with Jackson Gor and Jenn Yonemitsu, both students who are also pursuing their Master of Information and Cybersecurity at UC Berkeley’s School of Information.

--

--