Adversarial Attacks and Defenses in Machine Learning

Viacheslav Dubrov
6 min readApr 18, 2023

--

Adversarial Attacks and Defenses in Machine Learning

In the previous article, “Understanding Machine Learning Robustness: Why It Matters and How It Affects Your Models”, we explored the significance of robustness in machine learning and discussed various challenges and strategies for achieving it. As we continue our journey of constructing resilient AI systems, this second article will concentrate on adversarial attacks and defenses in machine learning. Additionally, we will provide a Python example demonstrating how to apply an adversarial attack to a model and defend it against the attack using a popular Python package, the Adversarial Robustness Toolbox (ART).

Types of adversarial attacks and how to defend your model against them:

1. Evasion attack

Evasion attacks involve feeding carefully crafted adversarial examples into a machine learning model during its inference phase. These examples, which are subtly perturbed versions of legitimate inputs, aim to deceive the model into producing incorrect outputs. Evasion attacks can be either targeted or untargeted. In targeted attacks, the goal is to cause the model to produce a specific incorrect output, while in untargeted attacks, the aim is to cause any incorrect output.

There are numerous types of evasion attacks, with some of the most popular ones including:

  • Fast Gradient Sign Method (FGSM): This attack method perturbs input data by adding noise proportional to the gradient of the loss function. It is fast and straightforward, but less effective compared to other methods.
  • Basic Iterative Method (BIM): BIM improves upon FGSM by applying the noise iteratively, thus increasing the likelihood of success.
  • DeepFool: This method attempts to find the minimal perturbation required to fool a classifier. It is highly effective but computationally expensive.
  • Carlini & Wagner (C&W) Attack: This powerful attack minimizes the perturbation while ensuring the classifier mislabels the adversarial example. It is highly effective but more computationally demanding than other methods.

To defend against evasion attacks, various strategies can be employed, including:

  • Adversarial Training: This technique involves incorporating adversarial examples into the training set to make the model more robust against evasion attacks.
  • Gradient Masking: By smoothing the model’s decision boundary, this method reduces the effectiveness of gradient-based attacks. However, the Open.ai team criticizes this method.
  • Defensive Distillation: This process trains a secondary model using the “softened” output probabilities of the original model, making it more resistant to adversarial examples.

You can find examples of adversarial training using ART here.

2. Poisoning Attack

Poisoning attacks involve manipulating the training data to compromise a machine learning model’s performance. By injecting malicious data into the training set, the attacker can manipulate the model’s behavior during the inference phase.

There are two primary categories of poisoning attacks:

  • Label Poisoning: The attacker modifies the labels of some training instances, causing the model to learn incorrect associations between inputs and outputs.
  • Data Injection: The attacker inserts malicious instances into the training set, which can either resemble existing instances or be entirely different, to influence the model’s decision boundary.

Defense Strategies:

  • Data Sanitization: This approach involves filtering and cleaning the training dataset to remove suspicious instances or inconsistencies.
  • Robust Optimization: By incorporating robust optimization techniques during model training, the impact of poisoning attacks can be mitigated.
  • Outlier Detection: Identifying and excluding outliers in the training data can reduce the influence of poisoning attacks on the model’s decision boundary.

Python Poisoning attacks examples by ART.

3. Inversion Attacks (also called Inference Attacks)

I couldn’t find a nice gif for an Inversion attack demonstration, so here is a photo of my dogs.

Inversion attacks aim to reverse-engineer the machine learning model by obtaining sensitive information about its training data. This type of attack primarily targets privacy-sensitive applications such as biometric identification systems.

Examples of inversion attacks include:

  • Model Inversion: The attacker reconstructs a close approximation of an original training instance by exploiting the model’s output probabilities.
  • Membership Inference: The attacker determines whether a particular instance was part of the training set by analyzing the model’s output probabilities.

Defense Strategies:

  • Differential Privacy: Implementing differential privacy techniques during model training adds controlled noise to the data, limiting the amount of information that can be inferred about individual instances.
  • Homomorphic Encryption: By using homomorphic encryption, a model can be trained and make predictions on encrypted data, protecting the privacy of the training instances.
  • Secure Multi-Party Computation (SMPC): SMPC enables multiple parties to collaboratively train a model on their combined data without revealing their individual instances, protecting against inversion attacks.

4. Model Extraction Attacks

Model extraction attacks involve stealing a copy of the target model or approximating it by querying the target model’s API. With a copy or approximation of the target model, an attacker can perform a variety of malicious activities, such as creating a competitive product or launching further attacks.

Defense Strategies:

  • API Rate Limiting: By limiting the number of API queries an attacker can make, this method reduces the effectiveness of model extraction attacks.
  • Output Perturbation: Adding noise to the model’s output can hinder an attacker’s ability to accurately approximate the target model.
  • Watermarking: Embedding a unique signature or watermark into the model can help in tracing and identifying stolen models.

Model stealing attack example from ART.

Python example

I have already shared resources and links to the ART repository that showcase various examples of adversarial attacks and defense strategies. However, if you’re looking for a simple Python code example to quickly try your first adversarial training against evasion attacks, here it is:

First, ensure you have ART and PyTorch installed:

pip install adversarial-robustness-toolbox torch torchvision

The next Python code example demonstrates how to apply an evasion attack (Fast Gradient Sign Method, or FGSM) to a simple convolutional neural network (CNN) trained on the Fashion-MNIST dataset and how to defend the model against the attack using adversarial training with the help of the Adversarial Robustness Toolbox (ART).

import torch
import torchvision
from torch import nn, optim
from art.attacks.evasion import FastGradientMethod
from art.defences.trainer import AdversarialTrainer
from art.estimators.classification import PyTorchClassifier
from sklearn.metrics import accuracy_score

# Load the Fashion-MNIST dataset
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
trainset = torchvision.datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=100, shuffle=True)
testset = torchvision.datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=100, shuffle=False)

# Define a simple CNN model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):
x = self.conv1(x)
x = nn.ReLU()(x)
x = self.conv2(x)
x = nn.ReLU()(x)
x = nn.MaxPool2d(2)(x)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = nn.ReLU()(x)
x = self.dropout2(x)
x = self.fc2(x)
output = nn.LogSoftmax(dim=1)(x)
return output

# Train the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Net().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(10):
for i, (inputs, labels) in enumerate(trainloader, 0):
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

# Wrap the trained PyTorch model with ART PyTorchClassifier
classifier = PyTorchClassifier(model=model, loss=criterion, optimizer=optimizer, input_shape=(1, 28, 28), nb_classes=10, clip_values=(0, 1))

# Test the model's accuracy on the original test samples
x_test, y_test = zip(*[(x, y) for x, y in testloader])
x_test, y_test = torch.cat(x_test).numpy(), torch.cat(y_test).numpy()
predictions = classifier.predict(x_test)
accuracy = accuracy_score(y_test, predictions.argmax(axis=1))
print(f"Original test data accuracy:{accuracy * 100:.2f}%")

# Perform an evasion attack (FGSM) on the test samples

attack = FastGradientMethod(estimator=classifier, eps=0.3)
x_test_adv = attack.generate(x=x_test)

# Test the model's accuracy on the adversarial test samples

predictions_adv = classifier.predict(x_test_adv)
accuracy_adv = accuracy_score(y_test, predictions_adv.argmax(axis=1))
print(f"Adversarial test data accuracy: {accuracy_adv * 100:.2f}%")

# Defend the model against the evasion attack using adversarial training
adv_trainer = AdversarialTrainer(classifier, attacks=attack, ratio=0.5)
adv_trainer.fit(x_test, y_test, batch_size=100, nb_epochs=10)

# Retest the model's accuracy on the original test samples after adversarial training

predictions_def = classifier.predict(x_test)
accuracy_def = accuracy_score(y_test, predictions_def.argmax(axis=1))
print(f"Defended test data accuracy: {accuracy_def * 100:.2f}%")

# Retest the model's accuracy on the adversarial test samples after adversarial training
predictions_adv_def = classifier.predict(x_test_adv)
accuracy_adv_def = accuracy_score(y_test, predictions_adv_def.argmax(axis=1))
print(f"Defended adversarial test data accuracy: {accuracy_adv_def * 100:.2f}%")

Conclusion

In conclusion, adversarial attacks and defenses in machine learning is a vast area, covering various techniques for compromising and protecting models. This article has outlined key adversarial attacks and their corresponding defense strategies, along with a Python example using the Adversarial Robustness Toolbox (ART). As this topic is extensive, I’ve provided numerous links and references for further research. By understanding and applying these techniques, AI practitioners can create more robust and secure machine learning models, contributing to the development of reliable and trustworthy AI systems.

--

--