How to Fool Artificial Intelligence

An Introduction to Adversarial Machine Learning: “Breaking” AI Algorithms

Published in

The Startup

12 min readJan 28, 2021

Self-driving cars. Voice operated smart homes. Automated medical diagnosis. Chatbots that can replace teachers and therapists.

Thanks to artificial intelligence, these technologies no longer live only in far-fetched worlds of science fiction. And they are only four out of hundreds of industry-disrupting tech, enabled by AI.

Artificial intelligence is rapidly integrating into our everyday lives.

Yet, questions linger before we can hand off big buckets of responsibilities to algorithms that magically do the work for us.

Do they always work? Can we trust them all the time?

Can we trust them with our lives?

What if there was a way for bad actors to intentionally trick algorithms into making the wrong decisions?

For better or for worse, this has actually been extensively studied in a subfield of machine learning called Adversarial Machine Learning. It has bred both techniques that are able to “fool” ML models and those that can defend against these attacks.

Adversarial attacks involve feeding models adversarial examples, which are inputs designed by attackers to purposefully trick models into producing wrong outputs.

Before diving deeper into how adversarial examples are created, let’s look at a few of them.

Adversarial Examples in Action

Let’s play a game of spot the difference:

If you came up with nothing, don’t worry. Your screen fatigued eyes have not failed you just yet.

To the human eye, these should look like the exact same picture of a fluffy panda.

But to an image classification algorithm, there is a world of difference between the two. Here’s what GoogLeNet, a winning visual recognition system, saw:

It made a floundering mistake on the second image. In fact, it was even more confident in its misclassification of the panda as a gibbon than its original right decision.

What happened!?

As shown above, a filter was applied to the original image. Although the changes were imperceptible to us––we still see the same black and white coat, the blunt snout, and the classic dark circles that tell us we’re looking at a panda––the algorithm saw something completely different.

The panda image, combined with a layer of noise––frequently referred to as perturbations––is an adversarial example that caused the model’s misclassification.

A few more visual adversarial examples:

Like optical illusions for machines, visual adversarial examples can make models “hallucinate” and see something that’s not there.

A stop sign, after being covered in a few pieces of tape, became a speed limit sign in the eyes of self-driving cars.

Imagine the implications of this once self-driving cars hit the road. While the tape could easily be mistaken as careless graffiti to us, a car would disregard the stop sign and charge forward straight into an accident.

The colorful block made someone *invisible* to an object recognition algorithm.

If incorporated into a t-shirt design, this can effectively become an “invisibility cloak” for automatic surveillance systems.

And an audio example…

A minor change in sound wave amplitudes caused “How are you?” to be heard as “Open the door.”

As you can imagine, the ability to make a drastic change to audio input––one that we can’t even hear ourselves––can have some pretty serious consequences in the future of voice-controlled smart homes.

In conclusion, adversarial examples are very cool….and also very concerning.

To understand what “broke” all of these state-of-the-art algorithms, let’s take a look at how models learn to make decisions in the first place.

Training Machine Learning Algorithms

At the most basic level, machine learning algorithms are made up of artificial neurons. If we think of a neural network as a factory that churns out outputs from inputs, neurons are like the smaller assembly lines––subunits that make up a more comprehensive processing system.

Neurons take in inputs (Xn), which are each multiplied by a weight (Wn), summed together, added to a bias (b), and fed into an activation function before an output is split out.

Both weights and biases are considered model parameters, the model’s internal variables that allow it to process input data in a specific way. During training, the model’s accuracy improves by updating its parameters.

https://cdn-images-1.medium.com/max/1280/1*G9QAXILJJWEiNu6Pf9GeNQ.png

When many layers of neurons are stacked against each other, the outputs of one layer becoming the inputs of the next, we get deep neural networks. This allows for more complex calculations and data processing.

https://cdn-images-1.medium.com/max/1280/1*uKaERRDZxh8kkxMJ9Cozjg.png

Each of these nodes has its own weights and biases, all contributing to the entire model’s set of parameters.

So, how do we train these model parameters so that they give us the most accurate output?

The answer: the loss function.

Conceptually, the loss function represents how far off the model’s output is from the target output. It basically tells us “how good” our model is. Mathematically, a higher loss value represents a less accurate output, while a lower loss value represents a more accurate output.

Being able to find a model’s loss function based on its inputs, real outputs, expected outputs, and parameters is key to training a model.

Gradient Descent: Finding Optimal Model Parameters

Intuitively, we want to minimize the loss function while updating each parameter value so that the model’s predictions can become more accurate. We do this through gradient descent, a process that lets us find the minimum point of a loss function.

Let’s look at an example. Say our loss function f(w) is represented by the quadratic function shown to the left. It is only dependent on one parameter: w.

The algorithm first picks a random initial value w0, then calculates the derivative f’(w₀) of f(w₀) at w₀, Since the slope of the derivative is negative, we know that we can decrease f(x) by increasing w₀.

So, our weight used in the model gets nudged to the right, from w₀ to w₁, as our loss decreases from f(w₀) to f(w₁). This simple adjustment is referred to as a learning step. At (w₁, f(w₁)), the derivative is taken again. Since it again has a negative slope, another learning step is taken to the right and down.

Step by step, we reach the minimum of the curve at wm, where the slope of the derivative is 0. At this point, our loss is the lowest it can be.

And…viola! We end up with a weight w that has been trained to help our model make more accurate decisions.

This process occurs at a much higher complexity when the model’s parameters involve a large set of weights and biases. Instead of a quadratic curve, the loss function graph would be plotted against all the weights and biases. It would be of dimensions that we can’t even visualize.

However, the intuition remains the same. We would take the derivative of the loss function with respect to each parameter, and update the set of parameters accordingly in a step-wise manner.

Now that we have a foundational understanding of how machine learning models learn, we can dive into how adversarial examples can be crafted to “break” them.

How are adversarial examples created?

This process also relies on the loss function. Essentially, we are attacking machine learning models in the same way they learn.

Whereas training happens through updating the model parameters while minimizing the loss function, adversarial examples are generated through updating the inputs while maximizing the loss function.

But, wait a second. Wouldn’t that just give us inputs that look extremely different from the original? How did the adversarial examples we saw, like the panda from the beginning of this article, manage to look the same as the original picture?

To answer this question, let’s take a look at the mathematical representation of adversarial examples:

J(X, Y) represents the loss function, with X being the input and Y being the output. In the case of image data, X would be a matrix of numerical values representing each pixel.

The ∇ symbol represents the operation of taking derivatives of the function over all of its input pixels. Just as before, only the sign (positive or negative) of each derivative’s slope matters when we are trying to determine whether to nudge each pixel value up or down, hence the sign() function.

In order to keep the adjustments to pixel values unnoticeable to our eyes, the changes are multiplied by a very small value ε.

Therefore, the entire ε.sign(∇x J(X, Y)) value is our perturbation, a matrix of values representing the change to each image pixel. The perturbation gets added to our original image to create the adversarial image.

Adding a layer of perturbation, also called “noise,” to an image of a goldfish to create an adversarial example.

This is called the Fast Gradient Sign Method (FGSM).

One caveat: it assumes that the attacker has full access to the model’s gradients and parameters. In the real world, that is usually not the case.

More often than not, only the model developers have information on an algorithm’s exact parameters. However, there are ways to get around this, thanks to the variety of attack methods out there.

Types of Adversarial Attacks

Attack approaches can be categorized based on three different criteria:

the amount of knowledge the attacker has about the model
the location of the attack within the model development and deployment timeline
the intent, or goal, of the attacker.

Let’s break it down further.

Knowledge Specific Attacks

White Box Attacks: The attacker has full access to the model’s internal structure (including gradient and parameters), which can then be used to generate adversarial examples.
Black Box Attacks: The attacker does not have information about the model’s internal structure. The model is seen as a “black box” because it is only observable from the outside––we can only see what outputs it gives to inputs. Using these inputs and outputs, however, we can create and train a “surrogate” model from which adversarial examples can be generated.

Location Specific Attacks

Training Attack: Manipulated inputs corresponding to incorrect outputs are injected into the training data, so the model architecture itself is flawed even before deployment.
Inference Attack: No tampering is done with the training data or model architecture. Adversarial inputs are fed into the model after it has been trained to prompt an incorrect output.

Intent Specific Attacks

Targeted Attack: When inputs are manipulated to change the output to a specific incorrect answer. For example, an attacker might have the goal of making a stop sign be recognized as a speed limit sign.
Nontargeted Attack: When inputs are manipulated to change the output to anything but the correct answer. For example, when generating an adversarial example, an attacker would be okay with a stop sign being recognized as a speed limit sign, yield sign, or U-turn sign.

FGSM, the method we went over, is a type of white-box, nontargeted attack. Other mechanisms, such as the Basic Iterative Method (BIM) and Jacobian Saliency Map-Based Attack (JSMA), exist for different applications listed above.

For the sake of keeping this article at an introductory level, I won’t go into depth about how other methods work. If you’re interested, I recommend reading this comprehensive taxonomy put together by the National Institute of Standards and Technology:

NIST Internal or Interagency Report (NISTIR) 8269 (Draft), A Taxonomy and Terminology of…

This NIST Interagency/Internal Report (NISTIR) is intended as a step toward securing applications of Artificial…

csrc.nist.gov

Applications of Adversarial Examples

Now that we have a good understanding of what adversarial attacks are and how they work, allow me to zoom out for a moment.

Let’s put the technicals into context.

What dangers can adversarial examples bring, if unleashed into the real world?

Autonomous Vehicles: Beyond the stop sign example shown earlier, self-driving systems can be tricked into turning into the wrong lane or driving in the opposite direction just from a few pieces of tape, strategically placed on the ground.
Medical Diagnosis: A benign tumor could be misclassified as a malignant one, leading to unnecessary treatment for a patient.
Facial Recognition: With no more than a pair of glasses (that cost just $0.22), these people tricked a facial recognition system to identify them as celebrities. Even the FBI’s facial recognition database doesn’t seem so inescapable anymore.

4. Military Strikes: With the increasing integration of AI algorithms into military defense systems, adversarial attacks pose a very tangible threat to national security itself. What if a strike is launched on an intended target?

5. Voice Commands: More and more Alexas and Echos are making their way into homes. An innocent-sounding audio message can send “silent” commands to these virtual assistants, disabling alarms and opening doors.

So….yeah. This is all very, very concerning.

Luckily, this doesn’t mean it’s time to throw our precious algorithms away. There has been developing research on defense mechanisms that can protect our models from these hacks.

Defending Against Adversarial Attacks

Adversarial training is the most common way of defense.

It involves pre-generating adversarial examples and teaching our models to match them to the right output in the training phase. It’s like strengthening the model’s immune system, preparing it for an attack before one happens.

Though this is an intuitive solution, it is definitely not perfect. Not only is it extremely tedious, but it is also almost never foolproof. A large number of adversarial examples would have to be generated, which is computationally costly. Not to mention that the model will still be defenseless against any example it was not trained against.

As leading machine learning researcher Ian Goodfellow puts it, it’s like “playing a game of whack-a-mole; it may close some vulnerabilities, but it leaves others open.”

Okay, so…what else can we do?

Looking to Neuroscience as Inspiration

Here is where it gets more interesting.

While reading this article, you might have questioned: why can adversarial examples exist in the first place? Why are our eyes so good at looking past minor perturbations to images, when machine learning models can not?

Some researchers argue that models’ susceptibility to adversarial examples are not “bugs,” but rather a natural consequence of the fundamentally different ways that humans and algorithms see the world.

In other words, while noise seems insignificant to us, they are important features that machine learning models are able to pick up on because they process information at a much higher complexity.

For more on this theory, I highly recommend checking out the article below.

Adversarial Examples Are Not Bugs, They Are Features

Read the paper Download the datasets Over the past few years, adversarial examples - or inputs that have been slightly…

gradientscience.org

If adversarial examples can exist because of the major differences in the way our human brains and models process data, can we solve the problem by making machine learning models more brain-like?

Turns out, this is the exact question that the MIT-IBM Watson AI Lab tackled. Specifically, researchers wanted to make convolutional neural networks––algorithms that process visual data––more robust by adding elements that mimic the mammalian visual cortex.

And it worked! The model, called VOneBlock, performed better than existing state-of-the-art algorithms.

Bridging the gap between neuroscience and AI––and more specifically, the integration of neuroscience discoveries into the development of machine learning model architecture––is an exciting growing area of research.

And luckily, we have barely scratched the surface. For those of us who can’t wait for a world with smart homes, self-driving cars, faster and more accurate medical diagnoses, more accessible educational and mental health resources, and much, much more, there is hope.

More robust machine learning algorithms are in the works.

Key Takeaways

Adversarial attacks involve feeding machine learning models inputs that purposely prompt an incorrect output, essentially “fooling” the model.
A loss function represents how far away the model’s output is from the true output.
Machine learning models are trained by updating their internal parameters while minimizing the loss function.
Adversarial examples are created by updating inputs while maximizing the loss function.
Different types of adversarial attacks can be categorized based on the knowledge and intent of the attacker, and the location of the attack.
Adversarial attacks can have disastrous consequences on visual and audio systems used in self-driving cars, facial recognition, and voice assistants, among many others.
It is possible to defend against adversarial attacks through adversarial training.
Recent research has shown that machine learning models can be made more robust against adversarial examples by taking inspiration from neuroscience.

Sources

Thanks for reading!

Feel free to connect with me through email (jasminexywang@gmail.com) or Linkedin. I’d love to chat.