Fooling CNNs via Adversarial Examples

The fault in our “stars”

Published in

Attentive AI Tech Blog

7 min readMay 9, 2019

There is no denying the fact that the applications and our dependence on CNNs is on the rise. In a world where we use CNNs in applications ranging from detection of inappropriate content on social media sites to interpreting the surrounding environment for proper navigation of AVs( Autonomous vehicles), the robustness of these CNNs comes into question. Imagine “driving” an AV and the car detects the STOP sign as “Speed limit — 45mph”. Lets discuss how this is a very realistic possibility, given the way these models are designed nowadays.

Our inability to fully understand and control the fine details of CNNs is highlighted in something called adversarial machine learning. Discovered first in 2014 by a group from Google, adversarial attacks seek to manipulate a model to return incorrect results with changes in the input, a human won’t mind. Although it started off from images, these attacks are shown to affect NLP models as well (eg. automatic mail responses spit out sensitive information like credit card nos. by creating “right” messages). We’ll focus on images for now.

Figure 1 : Adversarial example fooling a CNN into detecting a pig as airliner

Figure 2 : Adversarial patches making the CNN ignore all other features

Figure 3: Treacherous behavior of CNN’s towards adversarial images (memefied)

Why they work?

I wish there was a conclusive answer. This is a much debated question in data science as of now. The prime suspect, however, is the “Curse of dimensionality”. As we start dealing in higher dimensional data spaces, the data we use for training the models start constituting lesser and lesser portion of the available “volume”. The data we generally deal with lies on a manifold (lower dimensional space within a higher one). Any data point off the manifold is “unseen” for the model which is what most adversarial attacks use this to their advantage. These examples seek to nudge the input to cross the decision boundary, which is generally very close to the manifold for these neural nets(which according to many researchers is the core problem).

Figure 4: Example of a 2D manifold in 3 dimensional space

The ReLU activation function also works to their advantage. With no upper cap to the function, it is feasible to push the activation functions to arbitrarily high values. Ironically, its non-saturation (in response to vanishing gradient problem) was the core reason that most of the models developed now have ReLU activation functions.

Can’t I just train the models against adversarial examples for protection against them?

Yes and No. Training on adversarial examples tends not to generalize well, although giving better results on the specific attacks they were trained for. This makes the approach hopeless, having to alter the model against ever so increasing types of attacks.

Fine. I’ll just keep my model parameters and architecture confidential to avoid their manipulation…

Many “black box” attacks have been developed which don’t need the model specifications to start with, in contrast to “white box” attacks that do. First off, If your model output (not even the label probabilities, just the most probable label) is visible, then it is possible to train models to create adversarial examples. Secondly, adversarial examples are observed to have transferability which also answers another question…

Why can’t I just have an ensemble of different models?

This was one of my first thoughts when I first heard of adversarial examples fooling ML models. The examples must be employing flaws specific and unique to the model, I thought. Surely a single image can’t fool different architectures with different parameters. As it turns out, adversarial examples display the property of transferability i.e. adversarial example used to fool a model will generally fool any other network trained for the same task. This also reinforces the theory that these examples are taking advantage of image space distribution rather than some specific details of a model.

If adversarial examples are so anomalous for practical image spaces, can’t we detect just that?

We surely hope to. This is by far the most promising approach but the task itself is not as easy as it sounds. Many approaches (see MagNet, PixelDefend) have been suggested to detect the adversarial examples’ deviation from the manifold but all of them have been proven penetrable if the defence is known to the adversary.

Uff.. atleast these problems are restricted to digital images…

Not so fast. Intuitively, one assumes that these attacks won’t be able to pass the test of real life conditions like lighting effects, angle change, scale change etc. The most horrifying thing about a wide variety of these attacks is that they are proven to have very high rates of success in fooling ML models in different practical setups. Recently developed adversarial patches (Figure 2) work irrespective of the background and in different practical conditions, generating false directed results if they cover as small as 10% of the image. These patches can be printed in bulk and distributed worldwide to launch large scale attacks.

Where do we stand?

To set the expectations right, let me state that no model has yet been made which is completely immune to adversarial attacks on even the “newbie” MNIST digit classification task. Although some defence strategies are displaying high success rate in defending against these attacks, its now widely accepted that given infinite budget, each architecture can be manipulated to give misdirected outputs. The question now becomes, how resource intensive is intensive enough for us to feel safe where these models are in-charge of critical tasks like AV navigation, surveillance etc.

Case Study: Autonomous Vehicles

One of the biggest concerns of failing neural nets lies in their use in AVs. It’s one thing to use adversarial examples to fool a spam detector or facebook tagging feature and completely another to being able to manipulate the AV vision models to divert the car into oncoming traffic. Following experiments have been performed with results raising some serious concerns:

Tencent’s Keen Security Lab showed how they were able to manipulate a Tesla Model S into switching lanes so that it drives directly into oncoming traffic. All they had to do was place three stickers on the road, forming the appearance of a line indicating the lane is turning.
A group of researchers from the University of Washington, the University of Michigan, Stony Brook University, and the University of California Berkeley published a paper showing that little bit of modifications to road signs, for eg. a STOP sign, result in models(less complex than ones in AVs) reading it as a speed limit — 45 miles per hour road sign.

Figure 5 : Adversarial attacks success rate in different practical setups (ref. https://arxiv.org/pdf/1707.08945.pdf)

Another group of researchers from Princeton University did a similar study highlighting an even bigger problem. You can employ guidelines and people to regularly maintain the traffic signs but what if any other sign is disguised so as to be detected as a road sign.

Figure 6: Different types of adversarial road signs and the classification results they produce (ref. https://arxiv.org/pdf/1802.06430.pdf)

Figure 7: Automatic vehicle misclassifying adversarial street sign (memefied)

Reacting to the first experiment fooling a Tesla Model S, their spokesperson gave a quite unsatisfying answer, “not a realistic concern given that a driver can easily override Autopilot at any time by using the steering wheel or brakes and should always be prepared to do so”.

Way forward…

One strategy to prevent these exploitation would be to employ different sensors (Radar, LiDAR) to make sense of the environment and hope that they cover for each other’s specific vulnerabilities. Personally, I don’t think we can rely on these sensors for completely correct detections( esp. when LiDAR and Radar sensors have no information in colour space).

According to me, the best bet in defence against these adversarial attacks is HD Maps. HD Maps serve as a reference for AVs but their exact use in navigation varies from company to company. These maps contain digitized and geolocated road signs, lanes, street lights, traffic signals etc., features that are essential for navigation. Discrepancy between data in these maps and what is detected by the sensors (like in case of adversarial examples) can be used to alert the driver who could then take control of the vehicle. This feature may be critical for AVs’ successful transition from level 2 automation (The driver must monitor the driving and be prepared to intervene immediately at any time if the automated system fails to respond properly) to level 3 automation (The driver must be prepared to intervene within some limited time, when called upon by the vehicle to do so).

We, at Attentive AI, are dedicated to creating high quality HD Maps in order to assist the automated vehicles. We understand that regular updation of these HD Maps is necessary for them to prove useful in decision making. With the help of our proprietary algorithms, we ensure that these maps are as accurate as possible and can be easily updated regularly.

In conclusion, adversarial examples have somewhat disrupted the enormous success of machine learning (ML) and are causing concern with regards to its trustworthiness but at least we are now aware of this problem, before any major damage is done, and can focus our energy and resources to tackle it before the credibility of these systems are under serious threat.