How to fool Neural Networks

5 min readApr 10, 2023

“Adversarial attacks” may seem like a cool, sci-fi I’m-doing-cool-and-crucial-work technical buzzword, but the idea itself is still very much understandable and much more layman than it seems. The act of “attacking” a neural network is a recently emerging technique of “red-teaming” ones own systems. Originating from cybersecurity, it facilitated the discovery of system vulnerabilities and was used by both black-hat, malicious hackers and companies trying to audit and improve their cyberdefences and personnel skill.

Definitely, the topic of adversarial networks has been extensively covered, but due to recent developments in neural network architecture and frameworks, I believe it is time to dust off the covers and revisit this interesting subject of “playing the devil’s advocate” with arguably the greatest innovation in Machine Learning in the 21st century.

Inception

December 2013 was an oft-quoted date with regards to adversarial attacks. It was here that the notion of “adversarial attacks” or “perturbations” to model input was first conceived. Intriguing properties of neural networks (Szegedy et al., 2013) analysed several DNNs (Deep Neural Networks) of the time (AlexNet, Classifier + Autoencoder, Fully-Connected Network, etc.) with regards to image detection.

They found weaknesses in the robustness of the neural networks and discovered that a learned perturbation could potentially cause the DNN to missclassify the image. These perturbations were nearly invisible to the naked eye, and the attack generalised well across different models and training sets. This was done by minimising the magnitude of the perturbation, ||r|| and the loss it incurred (L(x + r)) when applied to the image, whilst constraining all resultant image pixel values to [0, 1]. Due to the difficulty of precise computation of the problem, box-constrained LBFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm) helped to achieve a reliable approximation.

The only question is why attack the dog :( From here

Generative Adversarial Networks

Just half a year or so after in June 2014, Generative Adversarial Nets (Goodfellow et al., 2014) proposed the idea of a “competition system” between two neural networks, where one would optimise on producing perturbations and the other on trying to identify attacks and defend against the other. This was arguably the first (or one of the first few) papers that spawned a long line of GANs up till today.

The GAN consists of two components, a Generator and Discriminator, which were the attacker and defender. The Generator would focus on generating attacks on the input, whilst the Discriminator would then classify the images to produce a loss, which indicated how well the Generator performed.

This process would be repeated for many iterations, and the Generator would learn to generate stronger attacks using the loss from the Discriminator. Finally, the trained Generator would be responsible for generating the final output for final evaluation.

Google Developers does a great job at explaining GANs here.

FGSM

Expanding on their earlier paper and building on image perturbation techniques, Explaining and Harnessing Adversarial Examples (Goodfellow et al., 2014) found another way to produce adversarial perturbations on images. The Fast Gradient Sign Method produced similar quasi-random “noise”, which when used for perturbation, achieved decent success in fooling DNNs too.

Shifting the focus from the perturbation itself, FGSM uses the gradients of the targeted DNN to learn the perturbation. Contrary to gradient descent, FGSM actually uses “gradient ascent” by shifting the pixel values in the direction of the cost function. This actually maximises the cost function. The resulting perturbation (as seen above) is multiplied by an arbitrary small weight instead of pure optimisation of the magnitude of the pixel values, resulting in faster computation due to fewer optimisation constraints.

Adversarial Patches

Just when I thought image perturbation was thoroughly explored, Adversarial Patch (Brown et al., 2017) brought attacks to a new level.

A circular or square patch was trained on a separate “white-box” DNN, which was accessible to the adversary, to classify as a certain class (a directed attack to a toaster is seen above). This patch was then “pasted” onto the input and fed through to the DNN to be attacked (or the “grey-box” model). It is important to note that this system of training a model to fool another separate model resembles (and probably is) a GAN.

This was truly cool and good. Not only could attacks now be localised to a smaller, confined area, but were also now practical to carry out in real life. An attacker could simply print out the texture and paste it onto an object, allowing for 2D perturbation attacks on 3D objects.

Contextual Awareness

Proposed in Contextual Priming for Object Detection (Torralba, 2003), this enabled DNNs to fight back against adversarial attacks. Robustness of the DNN could be improved using context detection, where the surrounding pixels up to a certain range are considered by the DNN before making a prediction. This enables the DNN to weed out adversarial distortion and incorrect detections, reducing the effectiveness of the attack.

There are many methods of implementation, but I find Contextual Object Detection with a Few Relevant Neighbours (Barnea et al., 2017) to be an interesting read as well.

Not only can this help to defend against attacks, but can also strengthen the DNN’s accuracy when it comes to object detection, and further context refinements like in Context Refinement in Object Detection (Chen et al., 2018) are possible too.

3-Dimensional Attacks

With better renderers come better attacks. A particularly unique paper which introduced me to GANs was FCA: Learning a 3D Full-coverage Vehicle Camouflage for Multi-view Physical Adversarial Attack (Wang et al., 2021).

Here, adversarial attacks were raised to the third dimension by perturbing the surface of a car’s metal body. This car was then rendered in different backgrounds and viewpoint angles, which successfully caused missclassification, and in some cases, avoidance of detection altogether.

Definitely something huge. This could potentially change the game for camouflaging 3D objects. Where 2D attacks lack transferrability and permanence, this 3D form of attack was universal and can be applied to the same object in many different scenarios.

What comes next?

The introduction of 3D machine learning and rendering certainly brings out many use cases and room for development. Recent developments in attack methodology allow attacks themselves to potentially evade context-aware DNNs, which when coupled with the flexibility of 3D attacks, can result in total stealth and undetection of objects, especially in surveillance and in the military.

However, with stronger adversarial attacks come better DNNs. Evolution of DNNs was most certainly born out of vulnerabilities in previous models. It will definitely be exponentially harder to improve models as we approach the theoretical ceiling for model power and accuracy (if there even is one), but the feedback loop between adversarial attacks and model improvement is crucial if we were to ever have a chance to push models to the limits and beyond.