Improving Adversarial Robustness through targeted retraining

Our budding cybersecurity engineer Linus Yeong explored the use of Targeted Retraining to improve the adversarial robustness of AI models — to evaluate the vulnerabilities of AI models under various types of adversarial attacks. He outlines his retraining and testing process as part of his Merit Cyber Scholarship internship with the DSTA Cybersecurity Programme Centre.

Published in

d*classified

5 min readMay 23, 2024

Introduction

As AI systems become more prevalent, it is our Cybersecurity community’s imperative to develop robust defenses to fortify AI models and AI-enabled systems against adversaries who exploit vulnerabilities to breach security or extract sensitive information. For instance, adversaries can trick AI models with misleading data inputs, leading to incorrect results. Consider an adversarial attack on an image classification model: an attacker might subtly alter a picture of a cat with minor noise or perturbations. Although both images appear identical to humans — perhaps one looks slightly grainier — they still clearly show a cat. Yet, the AI model might misclassify the altered image as a dog. Strange, right?

Figure 1: Adversarial attack on an image of a cat. Real-Time Adversarial Attack Detection with Deep Image Prior Initialized as a High-Level Representation Based Blurring Network. Accessed on 18 Sep 2023, https://www.mdpi.com/2079-9292/10/1/52.

This poses an increasing threat to AI, particularly in the realms of safety and security. Consider a self-driving vehicle that relies on an image recognition model to make decisions. When it detects a red light, it should halt. However, if an attacker manages to subtly alter the image and trick the model into misclassifying the red light as green, the consequences could be catastrophic.

Targeted Retraining

To defend against such attacks, a common strategy involves retraining the model with a dataset that includes both clean and adversarial samples. However, this approach often encounters the Robustness Accuracy Tradeoff, where improving the model’s accuracy on adversarial samples leads to a decrease in accuracy on clean samples. To mitigate this issue, we explore targeted retraining, where the model is retrained using adversarial samples from critical classes and clean samples from other classes. The idea is that misclassifying a less critical class results in a false alarm, whereas misclassifying a critical class could have severe consequences, such as mistaking a combat vessel for a yacht. Targeted retraining aims to balance the robustness accuracy tradeoff by limiting the model’s exposure to adversarial samples.

Recently, a benchmark called RobustBench, which employs AutoAttack (AA), was developed to evaluate models trained with adversarial attacks. We will use targeted AA adversarial samples to retrain a model and compare the results between standard and targeted training methods. The adversarial samples are generated using the Adversarial Robustness Toolbox, a Python library for machine learning security. The specifications and methodology for the training are detailed below:

Table 1: Experiment Setup Specifications

Figure 2: The retraining and testing processes

Given the constraints of time and hardware resources, we will focus on retraining three sets of clean and adversarial samples. It’s worth highlighting that the selected samples are all true positives, meaning the model originally classified them correctly before retraining. This targeted approach ensures that we make the most effective use of our resources while maintaining the integrity of our evaluation process.

Results

Our results are presented in Figure 3, and illustrate several key observations when comparing targeted AA-retrained models to untargeted ones. The targeted retraining approach resulted in higher clean accuracies for both general and target classes. However, while adversarial accuracy decreased for the general class, it increased for the target class. Despite a notable drop in clean accuracy for the target class, there was a significant boost in its adversarial accuracy. The most substantial improvement in adversarial accuracy for the target class was observed with FGSM samples. Additionally, a marked increase in adversarial accuracy for FGSM attacks was noted, which might imply that AA can effectively handle FGSM-like attacks. However, this could be misleading, as the model might be overfitted to defend against a specific epsilon value of FGSM and could fail against other epsilon values.

Figure 3: Performance (in terms of clean and adversarial accuracy) of untargeted and targeted AA-retrained models. Clean accuracies (for General Class and Target Class) refer to the model’s accuracy on the unperturbed original samples. Adversarial accuracies (<Attack>-General and <Attack>-Target) refer to the model’s accuracy on the samples perturbed using the corresponding attack.

Limitations

The study faced several limitations, including restricted training duration, limited testing size, and issues with overfitting. These constraints hindered the model’s ability to distinguish real inputs from adversarial ones effectively. Future improvements could involve fine-tuning training parameters and exploring diverse training methods to enhance the model’s robustness further.

Conclusion

Our investigation into targeted retraining highlights its potential to bolster the adversarial robustness of AI models, especially for critical classes. While the study faced limitations such as restricted resources and overfitting, the results are promising. Targeted retraining offers a viable strategy to balance the robustness accuracy trade-off, ensuring that AI models remain resilient and accurate when deployed in real-world scenarios. Future research should focus on refining training parameters and expanding training methods to further enhance the effectiveness of adversarial defenses in AI systems.

Cybersecurity Engineering with national impact

Are you a cybersecurity ninja in disguise, ready to unleash your potential? It’s time to join the DSTA’s Cybersecurity Programme Centre, where we red-team and bolster the robustness of future of AI-enabled systems! Become part of our elite team and help develop innovative defenses that outsmart adversaries. Find your community with us — work with secret-edge groundbreaking research to enhance AI robustness and security, ensuring our systems remain rock-solid. If you’re eager to tackle the toughest challenges and bring your expertise to the forefront, we want you on our team.