Adversarial Attacks and Defences

Introduction to Adversarial Attacks

Adversarial attacks in AI refer to manipulations that trick ML models into producing incorrect results by taking advantage of the way these models are trained and function. These manipulations are done maliciously to the AI system’s efficiency.

Understanding what is an ‘Adversarial Example’

An object that has the ability to trick and mislead a neural network algorithm into thinking it belongs in a particular object category when it doesn’t is known as an “adversarial example.”

But what risk does this impose?

Consider an image of a monkey that has been displayed and someone is asked to identify it. The person will identify the picture based on the physical visible characteristics that make a monkey. But, if an AI is asked to identify the same, it is ideally supposed to recognize a specific set of pixel values that correspond to the picture of the monkey. But, if the pixel values of another image, say a car are altered in the right way, the AI will identify the car as a monkey.

Types of Adversarial Attacks

Evasion attacks: Evasion attacks confuse or evade the model during the inference phase. It alters the data ie. perturbation to confuse the classifier, inserts the adversarial example into the model input, and thereby produces false classification results.

Inference attacks: These attacks can also be termed as ‘Data Theft’. It attempts to recover private data that was presented to the model during the training phase, thereby posing a privacy and confidentiality risk.

Extraction attack: These attacks steal the model i.e. infers with the architecture of the model and tweak the parameters by using confidence scores.

Examples of Adversarial attacks

a. Gradient-Based Attacks:

Gradient-based attacks slightly alter the back-propagation algorithm to create a perturbation vector for the input image. The back propagation algorithm calculates the error between the desired output and output generated by the neural network corresponding to a particular input. Gradient attacks on the other hand consider input to be a variable and model parameters to be constant.

Gradients corresponding to each input element — for instance, pixels in the case of images — can therefore be obtained. These gradients have been used in a variety of ways to produce the perturbation vector, which increases the likelihood that the new adversarial example will be incorrectly classified — as long as it meets the requirement of being extremely similar to the input. These are referred to as gradient-based attacks because the perturbation vectors are obtained by using the input gradients.

b. Decision-Based Attacks:

The aim of a decision-based adversarial attack is to generate adversarial examples on a

trained model based on solely observing output labels returned by the targeted model.

It involves crafting imperceptible perturbations to input data, aiming to mislead a machine-learning model into incorrect classifications. Without access to internal model parameters, attackers generate perturbations iteratively to cause misclassifications, making the attacks challenging yet realistic in black-box scenarios.

Motivation behind Adversarial Attacks

The intent behind Adversarial revolves around 3 key points:

  1. Deception: To deceive the model and thereby the user by making changes to the input.
  2. Evasion: To evade security features and other prerequisites to access confidential information.
  3. Confusion: To cause confusion and chaos by generating repeated wrong outputs, thereby hindering the efficiency of the user.

Some commonly targeted systems include:

a. Image Classification Systems

b. Natural Language Processing Models

c. Autonomous Vehicles

d. Biometric Systems

Adversarial Defence Techniques

  • Training Data Augmentation –developing adversarial examples and including them in the training set of a DNN.
  • Regularization — the gradient descent is guided toward parameters that are more resilient to adversarial attacks by including terms in the training loss function.
  • Optimization — replacing the training goal of minimizing loss with minimizing the biggest loss an adversarial attack could achieve.
  • Distillation — reducing the gradients in a DNN’s loss terrain through training will lessen the influence of adversarial examples.

Conclusion

To conclude, adversarial attacks pose significant risks by exploiting vulnerabilities in AI systems, leading to incorrect classifications or compromised privacy. Gradient-based attacks manipulate input gradients to craft adversarial examples, while decision-based attacks rely solely on output labels to deceive models. Motivations behind such attacks include deception, evasion, and causing confusion. Targeted systems range from image classification to autonomous vehicles. Defense techniques involve training data augmentation, regularization, optimization, and distillation to enhance resilience against adversarial manipulation, aiming to safeguard AI systems and their users.

--

--