Adversarial attacks: A detailed review — Part 1

Published in

Subex AI Labs

6 min readJan 9, 2023

Deep Learning has proven to be a very efficient tool in recent times when it comes to solve challenging problems across various fields such as Healthcare (computer-aided diagnosis, drug discovery), Finance (fraud detection), Automobiles (self-driving cars, robotics), Media (News Aggregation and Fake News Detection) and other day-to-day utilities (such as virtual assistants, language translation, information extraction)

However it is now known that Deep Learning is vulnerable to adversarial attacks that can manipulate its predictions by introducing almost indistinguishable perturbations in audio, images or videos. In this series of articles we will understand what is an adversarial attacks and how it tries to manipulate the deep learning models in order to get the desired output from the model.

Few examples of adversarial attacks (a) panda misclassified as gibbon by addition of little noise (b) stop sign mis-classified as speed limit sign (c) Person wearing a particular pattern not being detected

Outline

First we are going to specify the definitions of commonly appearing technical terminologies in publications, and then we will understand what is an adversarial attack and formal definition of the problem statement. In the next section, we will classify the attacks into different categories based on various attributes.

Common terminologies and definitions:

Here we will try to understand some of the most common terminologies that occur very commonly in papers related to this attack and will also be occurring frequently in this article.

Adversarial example/image - It is an example/image that is intentionally manipulated to cause incorrect model prediction. This example (or a series of such examples) are provided as input to the model.

Adversarial perturbation: It is the component of an adversarial example/image that causes the incorrect prediction. Commonly, it is a low magnitude additive noise-like signal.

Adversary: It is the agent (i.e. the attacker) creating an adversarial example. Alternatively, the adversarial signal/perturbation is also referred to as the adversary, albeit much less often.

Defense/adversarial defense: It is a broader term used for any mechanism of introducing robustness in a model, or external/internal mechanisms to detect adversarial signals, or image processing to negate adversarial effects of input manipulations.

Target image: It is the clean example/image being manipulated by the adversary.

Target label: It is the (desired) incorrect label of the adversarial example. The term is more relevant for classification problems.

What is an Adversarial attack?

An adversarial attack tries to trick the deep learning model to get the desired output by making the minimum amount of alterations to the input image or extract some useful information from the model using various tactics. The attacker might have varying range of access to the the target deep learning models, its weights and training dataset. Based on the degree of access that the attacker has, attacks can be classified into various categories which have been described in the article later.

Formally the problem can be describe using the equation below:

Formal problem statement describing advesarial attacks

It is a common practice to try and keep the pre-defined scalar threshold (η) to a minimum value, so that difference seems very minimal to human subject. Similarly, most common value of p is one or two, although it is not limited to that.

Classification of adversarial attacks

A Machine Learning system can be viewed with a generalized data processing pipeline (see Figure below). At inference, (a) input features are collected from sensors or data repositories, (b) processed in the digital domain, (c) used by the model to produce an output, and (d) the output is communicated to an external system or user and acted upon. To illustrate, consider a generic pipeline, autonomous vehicle, and network intrusion detection systems in Figure 1 (middle and bottom). Given that the attack can be of a varying range based on the goals of an adversary and his capabilities in accessing the model and data, we will try to understand various attacks based on their effect on this generic pipeline.

A generic ML system pipeline (with examples)

Attacks are classified based on three properties: Attack surface, Adversarial capabilities, Adversarial goals.

Attack surface

Given a pipeline of steps, an attacker can choose to target a particular step (or surface) of the pipeline in order to achieve his/her goal. The main attack scenarios identified by the attack surface are sketched as follows:

Evasion attack: This is the most common type of attack in the adversarial setting. The adversary tries to evade the system by adjusting malicious samples during testing phase. This setting does not assume any influence over the training data.
Poisoning attack: This type of attack, known as contamination of the training data, is carried out at training phase by injecting skillfully crafted samples to poison the system in order to compromise the entire learning process.
Exploratory attack: These attacks do not influence training dataset. Given black‐box access to the model, they try to gain as much knowledge as possible about the learning algorithm of the underlying system and pattern in the training data.

Adversarial capabilities:

It refer to the amount of information available to an adversary about the system. We explore the range of attacker capabilities by further dividing them into inference and training phases.

Training phase capabilities: Most of the attacks in the training phase are accomplished by learning, influencing or corrupting the model by direct alteration of the dataset. The attack strategies are broadly classified into the following three categories based on the adversarial capabilities:

▹ Data injection: When adversary cannot access training data or learning algorithm, but has ability to augment a new data to the training set. He can corrupt the target model by inserting adversarial samples into the training. ▹ Data modification: The adversary does not have access to the learning algorithm but has full access to the training data. He poisons the training data directly by modifying the data before it is used for training the model ▹ Logic corruption: The adversary is able to meddle with the learning algorithm. Apparently, it becomes very difficult to design counter strategy against these.

2. Testing phase capabilities: Adversarial attacks at the testing time do not interfere with the targeted model but rather forces it to produce incorrect outputs. These attacks can be categorized as white‐box or black‐box attack.

▹White-box attack: In white‐box attack on a machine learning model, an adversary has total knowledge about the model used (e.g. type of neural network along with number of layers, information about the algorithm used in training, parameters (θ) of the fully trained model architecture).The adversary utilizes this information to analyze the feature space where the model might be vulnerable, that is, for which the model has a high error rate. The access to internal model weights for a white‐box attack corresponds to a very strong adversarial attack.

▹Black-box attack: Black‐box attack assumes no knowledge about the model and uses information about the settings and prior inputs to exploit the model. Black‐box attacks are further subdivided into the three categories: Non‐adaptive black‐box attack, Adaptive black‐box attack, Strict black‐box attack.

Adversarial goals

Based on what is the objective of adversary, attacks can be divided into following four categories:

Confidence reduction: The adversary tries to reduce the confidence of prediction for the target model. For example, a legitimate image of a ‘stop’ sign can be predicted with a lower confidence having a lesser probability of class belongingness.
Mis‐classification: The adversary tries to alter the output classification of an input example to some other class. For example, a legitimate image of a ‘stop’ sign will be predicted as any other class different from the class of stop sign.
Targeted mis‐classification: The adversary tries to craft the inputs in such a way that the model produces the output of a particular target class.
Source/target mis‐classification: The adversary tries to classify a particular input source to a predefined target class. For example, th input image of ‘stop’ sign will be predicted as ‘go’ sign by the classification model.

All categories and sub-categories of adversarial attack can be sumarised via the flow-chart shown below:

Flow-chart for types of adversary attack

Thus, we have studied and understood what an adversarial attack is and how many ways can it be classified based on different properties. In the next parts, we will study what are some of the most common types of attacks and how adversarial attacks can scale beyond the task of image classification to object detection, object tracking, NLP and audio tasks.