Brief Introduction to Adversarial Attacks on Deep Learning

Published in

GatorHut

9 min readDec 11, 2023

Deep learning has become the workhorse for applications ranging from self-driving cars to medical image analysis and human face recognition. These breakthroughs are attributed to advances in Neural Networks, as well as the availability of huge amounts of data and computational power. Characteristic examples of these breakthroughs are autonomous cars that don’t require human intervention, systems that are better than human experts in disease diagnosis, and face recognition software that is capable of surpassing human capabilities.

The more we understand the capabilities of deep learning, the more potential security risks we are exposing ourselves to become apparent. Back in 2017, at the 31st Conference on Neural Information Processing Systems in Long Beach, a group of data scientists presented evidence of the toaster theory. By placing a sticker of a toaster nearby or on an object (input) for image recognition software, the software (or neural network) would classify it as a toaster. In their experiment, a banana was used as the input. On its own, it gets classified as a banana with high accuracy. But as soon as we add the toaster sticker, the network thinks it’s a toaster. Adversarial machine learning, a technique that attempts to fool models with deceptive data, is a growing threat in AI. Organizations worldwide use it to develop, deploy, scale, and secure their computer vision applications in one place.

Adversarial attacks

An adversarial attack consists of subtly modifying an original image in such a way that the changes are almost undetectable to the human eye. The modified image is called an adversarial image, and when submitted to a classifier is misclassified, while the original one is correctly classified. The real-life applications of such attacks can be severe –for instance, one could modify a traffic sign to be misinterpreted by an autonomous vehicle, and cause an accident. Another example is the potential risk of inappropriate or illegal content being modified so that it is undetectable by the content moderation algorithms used in popular websites or by police web crawlers.

Adversarial attacks that render DNNs vulnerable in real life represent a serious threat in autonomous vehicles, malware filters, or biometric authentication systems. The typical purpose of an adversarial attack is to add a natural perturbation to an image so that the target model misclassifies the sample, but it is still correctly classified by the human eye. Adversarial attacks are split into main classes:

By the attacker’s access to the initial parameters of the model:

White-box. The adversary is entirely aware of the targeted model (i.e., its architecture, loss function, training data, etc.).
Black-box. The adversary does not have any information about the targeted model.

2. By the method of creating an adversarial image:

Non-targeted. Adversary assigns the adversarial image to any class, regardless of the class of the true image. The output can be a random class excluding the original one.
Targeted. Adversary assigns the adversarial image to a specific class. The output of this neural network is only one certain class.

Noise

Noise is a small vector introduced to generate adversarial sample images. Its elements are equal to the sign of the gradient elements of the cost function concerning the input. It can be composed using the following formula:

Where η — noise, ε — small value, θ — parameters of the model, x — input image, and y — target

J(θ, x, y) — cost used to train the neural network

∇x — gradient of the loss function relative to the input image

White box attack

They are the easiest to perform since they have full knowledge of the model’s parameters. This means that the attacker has full knowledge of θ and can use the gradient information to produce adversarial examples. FGSM and C&W are examples of white-box attacks.

FastGradient Sign method (FGSM) is a simple and fast gradient-based method used to generate adversarial examples to minimize the maximum amount of perturbation added to any pixel of the image to cause misclassification. This method has comparably efficient computing times. But the perturbations need to be added to every feature. The measure of modification is usually the rate, which measures the absolute maximum change in one pixel.

Generating an adversarial image is shown in the image above. There is added noise to the image with a panda (neural network classifies the image with a confidence of 57.7%). It’s invisible to the human eye, but the neural network reacts to this noise. Therefore, the neural network now classifies images as a gibbon with a confidence of 99.3%. The essence of this class of attacks is that attackers modify the original image in the direction of the gradient of the loss function relative to the input image. The higher the model’s confidence in predicting the label, the more difficult it is to attack the model. If the confidence is more than 90%, then ε should be larger.

Carlini & Wagner Attack (C&W) is a white box method that is more efficient at generating adversarial examples; it was shown to be able to defeat state-of-the-art defenses, such as defensive distillation and adversarial training. It is more computationally intensive than FGSM.

Black box attacks

They are significantly harder to perform. In this case, the attacker does not have information about the parameters of the model, nor does have access at the training stage. This means that no gradient information can be used in determining malicious examples. The model does either output confidence scores for each class or just the predicted labels.

Square Attack is a black box evasion adversarial attack based on querying classification scores without the need for gradient information. As a score-based black box attack, this adversarial approach can query probability distributions across model output classes but has no other access to the model itself.

HopSkipJump Attack is a black box attack proposed as a query-efficient attack, but one that relies solely on access to any input’s predicted output class. It does not require the ability to calculate gradients or access to score values like the Square Attack and will require just the model’s class prediction output.

Evasion attack and Poisoning attack

In evasion attacks, an adversary creates adversarial examples by adding small perturbations to testing samples to induce their misclassification at model deployment time. Poisoning attacks require the modification of training data (either the data samples or labels) to poison a model at training time. The difference between evasion attacks, which are used at model deployment time, and poisoning attacks, which require training data modification, is shown graphically in Figure.

Model Evasion attacks are the most common sort of attack employed in penetration and malware scenarios since they are carried out during the deployment phase. By hiding the information in spam emails or malware, attackers frequently try to avoid being discovered. Examples of evasion are spoofing attacks against biometric verification systems. The goal of a Model evasion Attack is to cause the machine learning model to misclassify observations during the testing phase as shown in the figure.

Poisoning Attack is essentially adversarial contamination of training data. Usually, the adversary adds some pattern (poison) to training images or blends images of different classes. In Figure, the adversary has added white blocks at the bottom part of the training images. During the process of poisoning, the adversary also changes the labels of the poisoned images to the target. During the training process, the network will perceive such white blocks as features of a target class. Accordingly, after training, the model will generally react to ordinary images. Still, as soon as it encounters an adversarial pattern ( white blocks), it will trigger the output that the adversary has intended.

Model extraction or model stealing, is the extraction of enough data from the model to allow for the complete rebuilding of the model. Model extraction attacks can be used, for instance, to steal a stock market prediction model, which the adversary could use for their financial benefit.

A byzantine attack is when machine learning is scaled, it often relies on multiple computing machines. Some of these devices may deviate from their expected behavior, e.g. to harm the central server’s model or to bias algorithms towards certain behaviors.

Other techniques for generating adversarial examples

Jacobian-based Saliency Map Attack (JSMA): This method uses feature selection to minimize the number of features modified while causing misclassification. Flat perturbations are added to features iteratively according to saliency value by decreasing order. It is more computationally intensive than FGSM.

Deepfool Attack: This untargeted adversarial sample generation technique minimizes the Euclidean distance between perturbed samples and original samples. Decision boundaries between classes are estimated, and perturbations are added iteratively.

Generative Adversarial Networks (GAN): GAN has been used to generate adversarial attacks, where two neural networks compete with each other. Thereby one is acting as a generator, and the other behaves as the discriminator. The discriminator tries to distinguish real samples from ones created by the generator.

Zeroth-order optimization attack (ZOO): The ZOO technique allows the estimation of the gradient of the classifiers without access to the classifier, making it ideal for black-box attacks.

Defense methods against adversarial attacks

Adversarial examples are not dependent on the model but on the dataset itself. Since the generation of custom datasets is a costly and time-consuming process, well-known data is used for almost all applications. This makes a major security flaw even if the production model of a company is hidden. To tackle this issue, several defenses evolved in the arms race between attackers and defenders.

There are general recommendations for ensuring security:

1. Use of several systems of classifiers

2. Confidentiality training

3. Cleaning the train set from poison patterns

In addition to the basic recommendations, there are more specific defense methods to increase resistance against various adversarial attacks. One of the effective defense methods against FGSM attacks is Adversarial training. The principle of operation of this method lies in the fact that a model is trained with examples from the original dataset. Adversarial examples are generated in the learning process to prepare the network. However, this method requires generated adversarial examples in addition to the original training examples for training, and as a consequence, the train set size increases, hence the time for training. Adversarial training does not guarantee full resistance to adversarial attacks, but it reduces the number of errors.

Adversarial training refers to Robustness-Based Defense methods. Robustness-based defense aims at classifying adversarial examples correctly. Data scientists often use Detection Based Defense where we detect whether the images are clean or perturbed before passing it to the model. If the photos are perturbed, an error is raised. Otherwise, we pass the input to the model.

Conclusion

With machine learning rapidly becoming core to organizations’ value proposition, the need for organizations to protect them is growing fast. Google, Microsoft, Amazon, and IBM have started to invest in securing machine learning systems. Moreover, governments have started to implement security standards for machine learning systems. One of the top priorities for any organization should be to hire the right people for deep learning. They need to see the value in this kind of knowledge so it can trickle down through the organization to the developers, and create some lasting change.

References

1. https://cje.ejournal.org.cn/article/doi/10.1049/cje.2021.00.126

2. https://ieeexplore.ieee.org/document/8014906

3. https://infoscience.epfl.ch/record/295716

4. https://www.ti.rwth-aachen.de/publications/output.php?id=30&table=inbook&type=pdf

5. https://iopscience.iop.org/article/10.1088/1742-6596/1797/1/012005

6. https://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1742&context=etd_projects

7. Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. “Explaining and harnessing adversarial examples.”

8.openaccess.thecvf.com/content/CVPR2021/papers/Wenger_Backdoor_Attacks_Against_Deep_Learning_Systems_in_the_Physical_World_CVPR_2021_paper.pdf