Adversarial Attacks Explained (And How to Defend ML Models Against Them)

Published in
6 min readSep 7, 2022

Simply put, the adversarial attack is a deceiving technique that is “fooling” machine learning models using a defective input. Adversarial machine learning is aimed to cause a malfunction of an ML model (think of a self-driving car that takes a stop sign as a speed limit or a Tesla autopilot car moving in the opposite direction from intended).

Adversarial attacks become possible because of inaccurate or misrepresenting data used during the training or using maliciously designed data for an already trained model. Let us delve into the nuts and bolts step-by-step.

Why are adversarial attacks dangerous?

While ML is a relatively new domain (less than ten years old), it is developing tremendously, gaining wide popularity across lots of industries. We will witness how non-internet sectors like agriculture, education, logistics, manufacturing, and energy sectors will bring up to $13 trillion of GDP growth by 2030, as per the McKinsey research. But adversarial attacks might cause severe problems across all these sectors.

For example, research shows how adversarial attacks on medical machine learning can make ML algorithms classify benign moles as malignant. Consider the impact of such malicious actions at scale in any other business vertical.

How long did adversarial attacks enter the picture?

In 2014 there were no papers regarding the adversarial attacks on preprint server But as per the moment of writing (August 2021), there are around 1000 research papers on adversarial attacks and their examples. It seems like it is going to be the next arms race while AI adoption is rising globally. One of the first researches by Google and New York University, “Intriguing properties of neural networks,” is dated by 2013 and has shed some light on the essence of adversarial attack.

Thus, the adversarial attack is an optical illusion for the ML model that misperceives the objects while not visible to the naked eye. Check out the following example:

What are the types of adversarial attacks?

Depending on the influence of the classifier (ML algorithm), security violation, and specificity, adversarial attacks could be subcategorized to “white-box” or “black-box” attacks. A white-box attack means that the attacker has access to the model’s parameters, and there is no access to parameters in case of a black-box attack.

What is under the hood of an adversarial attack? In general, adversarial attacks share the same idea. They use (sometimes approximated) knowledge about the model’s internal state to modify input pixels to cause the greatest chance of error. In other words, a small perturbation changes the class label.

Mathematically, it looks like the following:


Model f using the input x can produce prediction y. But here, we have adversarial example d that leads to the prediction y that is not equal to the prediction of the model f with the input x.


L is a generic function that measures the norm of d, and T stands for the upper bound of this norm.

Bearing that in mind, you can encounter a set of algorithms that could generate such perturbations:

Credits to: Malhar, Towards Data Science

Let us define what stands behind the different types of perturbations. L stands for perturbation bound that measures the size of perturbation d, usually Lp norm is used:

L0 norm: it implies modifying the exact number of features of the input. In reality, only a tiny piece of the information is modified, but it can deceive the overall system. Take a real-world example — DNN classifier misperceives the small part on the STOP sign and generates command of going further instead of stop moving:

L1 norm: implies the total sum of all perturbation values involved. In reality, you can not encounter this type of attack often. To quote Pin-Yu Chen et al.: ”However, despite the fact that L1 distortion accounts for the total variation and encourages sparsity in the perturbation, little has been developed for crafting L1-based adversarial examples.”

L2 norm: implies upper bounding the Euclidean distance (Pythagorean distance) of the perturbation d. In other words, it is the squared difference between the images X and Z (calculate the distance between the images X and Z for each pixel, and then sum all pixels.

Examples of this type include the Carlini and Wagner attack is the most effective white-box attack in researches.

L norm: implies the maximum value of perturbation d. They are represented in the researches the most given their robust optimization and mathematical convenience.

Based on the type of attack (white-box or black-box and perturbation bound, adversarial attacks could be categorized further. Check out this table concerning the types f the attack developed by Malhar:

How to defend against adversarial attacks?

Standard techniques to make the ML models more robust (e.g., dropout or weight decay) cannot appropriate defense against adversarial attacks. However, we’ve got methods developed so far.

Adversarial training, when engineers are using adversarial examples to retrain the models to make them more robust against perturbations. However, adversarial training is a slow and expensive method. Thus, every training example should be tested for adversarial weaknesses, while the ML model should be retrained using all of these examples.

One of the latest types of adversarial training is CNN-Cert, which aimed to find the “resistance threshold against perturbations.” It is beneficial for industries using speech and face recognition, self-driving vehicles, and medical imaging, considering the cost of potential mistakes.

Other methods include combining parallel networks to switch them randomly (to make the model stable against adversarial attacks). Also, developing a generalized neural network from other networks is applicable. Unfortunately, generalization is one of the most frustrating tasks for deep learning practitioners.

Defensive distillation was one of the suitable methods recently. To quote Ian Goodfellow and his colleagues, you can “train a secondary model whose surface is smoothed in the directions an attacker will typically try to exploit, making it difficult for them to discover adversarial input tweaks that lead to incorrect categorization.” However, research by the University of California, Berkley showed that attacks could defeat defensive distillation.

Bottom line

ML models are susceptible to adversarial attacks, which seems to be the new arms race in AI. Based on the access to the model’s parameters and perturbation bound, there are different types of attacks, but they share the same logic. Adversarial attacks are focused on changing the class label, using the knowledge (often approximated) about the model’s internal state.

ML practitioners are using adversarial training to cope with attacks, combining parallel networks to randomly switch them, or generalizing neural networks from other networks.




Ukraine-based IT company specialized in development of software solutions based on science-driven information technologies #AI #ML #IoT #NLP #Healthcare #DevOps