An overlook of cyberattacks against neural networks

Published in

Analytics Vidhya

9 min readMay 11, 2020

Artificial neural networks can be thought of as computed biological neural networks. Is it possible to perceive an object as something it is not? If our perception of the outside world is an interpretation upon reflecting on sensory information; can we alter that interpretation to perceive something differently than what it really is? If so, how? How could we avoid neural networks misclassifying due to malicious attacks such as adversarial poisoning and evasion attacks on white-box and black-box models?

Some context

The use of artificial neural networks (ANNs) can be traced back to 1943 — McCulloch-Pitts Neuron. This computational model of a neuron used Thresholding Logic to make decisions. Simply the output either met a threshold or didn’t — which is what made the decisions based on the output from these neurons connected together. This model is also known as a Perceptron which is a building block for deep neural networks (DNNs are artificial neural networks with more than the usual three layers (input layer, hidden layers, and an output layer)).

Ever since this model was produced we have come far during the mid-2000s where DNNs perform many tasks; Deep Learning (DL) has taken these neural networks to higher accuracies than previously used classification methods. The advancement in DL has motivated the use of adversaries to manipulate DNNs to force misclassification of inputs. Now we use neural networks in spaces such as speech recognition, face/object recognition, fraud detection, security applications, etc.

If neural networks, used in important applications such as self-driving cars, with high accuracy demands were attacked maliciously to force misclassifications, the matter could incline into a life or death scenario if a physical attack is performed such as not classifying a stop sign properly or not identifying a red light as a red light. In this post, I will be delving into the possible security attacks against neural networks such as adversarial attacks and how they affect the network + how we could possibly avoid such attacks to build more robust systems.

Possible attacks against neural networks

Adversarial Attacks

Adversarial machine learning is a ML technique employed as an attempt to fool models through malicious inputs. The main idea is the introduction of strategic noise.

White-Box vs. Black-Box Attacks

A white box attack takes place where the attacker has access to the architecture of the network. The structure of the network will allow you to manipulate individual neurons and select the most damaging attacks to perform.

To perform a white box adversarial attack, to a binary classification model using a subset of the iris dataset where there are two inputs x_1 and x_2 and hence two classes, class 1 and class2, the strategy would be to create some adversarial examples to fool the model based on its inputs.

Understanding Gradient-Based Adversarial Attacks — Adrian Botta

To classify the red points as class 1 we would need to move them across the decision boundary. The movement of the points from class 0 to class 1 is called a perturbation. To move a point across the decision boundary, with full knowledge of how the model works, the attacker can determine how to change the loss by changing inputs going into the function. How does this work?

If the dot product, z, of certain input and its weight was such that z < 0 then we would expect, using a Sigmoid activation function, for the output certainty that this input belongs to class 1 to be fairly small; certainty < 27% Since:

Hence, we know that if we increased the dot product’s magnitude then we could have greater confidence that an input belongs to class 1. An adversarial, adx, could be such that adx = x + 0.5w by halving the weights we tweak the value of the dot product and obtain a value such that z > 0 and hence the model is now more confident that the input belongs to class 1.

However, the attacker may not always know the structure of the model.

A black-box attack can be defined as one where the attacker has access only to network inputs and outputs, but not to any internal parameters. Hence, an attacker could send inputs and receive outputs such as labels and class to design an adversarial attack.

The strategy carried out in a black box adversarial attack, devised by I. Goodfellow in Practical Black-Box Attacks against Machine Learning, is training a substitute model with a random sample of data. The adversarial examples are created from the dataset using gradient-based attacks. The objective of a gradient-based attack, described in Explaining and Harnessing Adversarial Examples by I. Goodfellow, is to move a point over a model’s decision boundary as explained above. Here, the adversarial examples are a step in the direction of the model’s gradient to determine if the black-box model will classify the new data points the same way as the substitute model. The substitute model gets a more precise understanding of where the black-box model’s decision boundary is. After a few iterations of this, the substitute model shares almost the exact same decision boundaries as the black-box model.

The substitute model doesn’t even need to be the same type of ML model as the black-box. In fact, a simple Multi-Layer Perceptron is enough to learn close enough decision boundaries of a complex Convolutional Neural Network. Ultimately, with a small sample of data, a few iterations of the data augmentation and labeling, a black-box model can be successfully attacked.

The adversarial examples instantiated by altered and perturbated inputs force a classifier to misclassify the resulting adversarial inputs, while the human observer is still able to correctly classify the inputs themselves. E.g. an autonomous vehicle that gets attacked may not be able to identify a stop sign while a human observer doesn’t have the same trouble.

Practical Black-Box Attacks against Machine Learning

Types of Adversarial Attacks

Poisoning Attack

A type of adversarial attack. The attacker provides some malicious input which causes the decision boundary between two classes to change. E.g. if we 4 go back to the binary classification model in Figure 2, inputting data to train the model allows it to establish a decision boundary between the red and blue inputs. If a malicious input were to change the position of the decision boundary then the model would start misclassifying some inputs from then on.

Evasion Attacks

This is also a type of adversarial attack where the attacker causes the model to misclassify a sample. More clearly, if there was an ML model classifying whether a bank transaction is a fraud or not based on certain parameters; the weights of these parameters define a transaction, hence if the attacker was dealing with a white-box system they could find out which parameter determines that it isn’t a fraud and increase the weight of that parameter.

E.g. MIT turtle rifle misclassification incident: During 2017 a paper was released about misclassifications of various objects under Google’s InceptionV3 image classifier as a result of small perturbations to those objects as an adversarial data modification attack. It was this 3D printed turtle that was classified as a rifle in every angle it was shown to the camera.

It was this 3D printed turtle that was classified as a rifle in every angle it was shown to the camera.

Adversarial Capabilities

The term adversarial capabilities refer to the amount of information available to an adversary about the system. For illustration, consider the case of an automated vehicle system with the attack surface being the testing time. An internal adversary is one who has access to the model architecture and can use it to distinguish between different images and traffic signs, whereas a weaker adversary is one who has access only to the dump of images fed to the model during testing time. Though both the adversaries are working on the same attack surface, the former adversary is assumed to have much more information and is thus strictly “stronger”. We explore the range of adversarial capabilities in machine learning systems as they relate to testing and training phases.

Methods of attack based on adversarial capabilities:

Label modification — adversary to modify solely the labels in supervised learning datasets.

Data injection — adversary does not have any access to the training data as well as to the learning algorithm but has the ability to augment a new data to the training set.

Data modification — The adversary does not have access to the learning algorithm but has full access to the training data.

Logic corruption — The adversary has the ability to meddle with the learning algorithm.

Applications to Supervised, Unsupervised, and Reinforcement ML models:

Supervised Models:

Supervised ML models are task-driven and are used mainly for classification and regression purposes. For a supervised learning model, the input data is labeled. Alongside the input, the corresponding output is known by the supervisor. These models can be attacked by label modification, data injection, and data modification. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression, etc.

Classification can simply be explained as mapping outputs into classes. As explained above, misclassification of inputs is possible by poisoning and evasion attacks in spaces such as fraud detection, where many parameters come into play, and other binary or non-binary classification models. Methods of attacks are discussed above.

Unsupervised Models:

Unsupervised learning models are more data-driven than supervised models. The input data encompasses no labels and there exists no supervisor nor 6 feedback. Common methods of how the output is produced with this method are clustering and association.

In simple terms, clustering is inherently grouping data according to similarity whereas the association is discovering rules to describe data, also known as finding patterns. These are all methods known well in ‘data mining.’ Examples of Unsupervised Learning: Apriori algorithm, K-means.

Unsupervised models can be attacked mainly using data modification and injection based on adversarial capabilities due to their data-driven nature.

Reinforcement Learning:

Reinforcement Learning is a branch of AI, often referred to as true Machine Learning. This type of learning allows machines to automatically determine ideal behaviors in specific contexts, using a reward system. The goal is to take actions according to observations gathered from the interaction with the environment to maximize rewards. The possible method of attack would be to reinforce the agent incorrectly. Example of Reinforcement Learning: Markov Decision Process.

An MDP is a Markov Decision Process, in the figure below, is a mathematical framework for modeling decision making. The input state of the graph is observed by the agent. A decision-making function allows the agent to take some action, orange circles. The output produced by this action is the process where the agent is reinforced by the environment. The orange arrows indicate rewards and punishments that drive the algorithm. A graph such as this helps visualize how an attack may be carried out upon a reinforcement agent.

What can be done to avoid such attacks?

Use cloud ML models

Cloud-based models mean an intruder can’t play with the model locally. Of course, technically, an attacker could still try to brute-force the cloud ML model. But a black box attack like this takes a lot of time and would be easily detected.

Adversarial Training

Actively generate adversarial examples, adjust their labels, and add them to the training set. You can then train the new network on this updated training set and it will help to make your network more robust to adversarial examples.

Smooth decision boundaries

Smoothen the decision boundaries between classes to make it less easy to manipulate network classification using strategic noise injection.

Penetration Testing

Hire an attacker to assess the magnitude of damage that can be done to your model. This draws the big picture as to how much damage can be done in an actual cyber attack.

References:

Anirban Chakraborty and Manaar Alam and Vishal Dey and Anupam Chattopadhyay and Debdeep Mukhopadhyay, Adversarial Attacks and Defences: A Survey. 2018

Panda, P., Chakraborty, I., and Roy, K., Discretization based Solutions for Secure Machine Learning against Adversarial Attacks. 2019.

Zhang, H., Cisse, M., Dauphin, Y., and Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. 2017.

Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and Harnessing Adversarial Examples. 2014.

Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., and Celik, Z., Swami, A. Practical Black-Box Attacks against Machine Learning. 2015.

Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., and Swami, A. The limitations of deep learning in adversarial settings. In Proceedings of the 1st IEEE European Symposium on Security and Privacy, pp. 372– 387, 2016.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial Machine Learning at Scale. 2016.

Tramer‘, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. Ensemble Adversarial Training: Attacks and Defenses. 2018.

Alexander Chistyakov, Alexey Andreev, AI under Attack: How to Secure MachineLearning in Security Systems. 2019 (Kaspersky Threat Research)