The Inherent Insecurity in Neural Networks and Machine Learning Based Applications

Abraham Kang
Towards Data Science
15 min readMay 15, 2019

by Abraham Kang and Kunal Patel

Executive Summary

Deep neural networks are inherently fuzzy. Each type of neural network (Traditional, Convolutional, Recurrent, etc.) have a set of weight connections (W41,W42, … W87 parameters) that are randomly initialized and updated as data is pumped through the system and errors are back propagated to correct the weight connection values. After training, these weights approximate a function that fits the input and output of the trained data. However, the distribution of weight values is not perfect and can only generalize based on inputs and outputs that the neural network has seen. The problem with neural networks is that they will never be perfect and they fail gracefully (not letting you know that they failed incorrectly — often times classifying with high confidence). Ideally, you want a system to notify you when there is a failure. With neural networks if you feed it a random set of static images, it may provide incorrect output object classifications with high confidence. The following images are examples where deep neural networks incorrectly recognized objects with high confidence (picture from http://www.evolvingai.org/fooling):

Fig 1: Images incorrectly classified with high confidence

The reason for these failings are that the distribution of weights can only do well on things that it has generalized through training. If a deep neural network (DNN) has not seen something similar to an item it was trained with then the DNN will usually make what it thinks is its best guess based on the mathematical model built during training. This leads to fuzzy results and inherent backdoors in a DNN model.

Almost every model is susceptible to the attacks in this paper because they approximate a function mapping inputs to outputs using tuned weight values. Let’s take a quick look at the most common neural networks to see where the problems lie.

Fig. 1: Traditional Neural Network with weights in connections. PIcture from https://medium.com/@curiousily/tensorflow-for-hackers-part-iv-neural-network-from-scratch-1a4f504dfa8

A traditional DNN has weights specified for each connection between nodes. This allows the strength of different inputs and weights to influence output values. A convolutional neural network is slightly different in that the weights (W1, W2, and W3 below) are in the filters (pink boxes below) instead of the connections.

Fig. 2: Convolutional neural networks derive their weights from the filters (pink boxes: W1, W2, W3) applied to the input image X. Picture from http://sipi.usc.edu/~kosko/N-CNN-published-May-2016.pdf

The weights are applied to the input (image) as a part of the convolution operation below.

Fig 2.5: Convolution Operation — The filter is displayed as the inner yellow square within the green square. The values for the filter are the red values in the bottom corner of each subsquare within the inner yellow square. The pink square is the resultant matrix from the convolution operation. Picture from https://developer.nvidia.com/discover/convolution

Again, with convolutional neural networks the weight filter values are randomly initialized and corrected through backpropagation to minimize the error of classifications. Recurrent neural networks represent their weights as matrices applied to input arrays or matrices.

Fig. 3: Recurrent neural networks have weighted arrays or matrices represented by Wxh, Whh, and Why where the values in these matrices are reused and corrected across time (using backpropagation through time).There is a feedback loop from the h node which sends its output (multiplied by Whh) to the next h in time. PIcture from https://hub.packtpub.com/human-motion-capture-using-gated-recurrent-neural-networks/

Although, the weights do their best to approximate the data that the neural network has seen during training, there are many weight values which are not optimally set (in some cases having values which overestimate the importance of certain input values [such as a specific pixel in a “One Pixel Attack” https://arxiv.org/abs/1710.08864]). In other cases, the aggregation of small changes across many weights may significantly change the output classification when combined together with high confidence ( >95%). Ultimately, most attacks on DNNs revolve around taking advantage of the distribution of the weights or influencing the training process to set weights that an attacker can take advantage of. When attacking a DNN the attacker takes one of two positions: 1. attacking inputs to the DNN or APIs; 2. attacking the training process.

When the attacker is an outsider, he/she will modify inputs provided to the DNN to cause a desired output. In some cases, the modifications to the images are imperceptible to humans but in others, the input looks nothing like the desired output. This class of attacks used by outsiders to generate misclassified inputs are called adversarial attacks. Attackers will find a way to test every exposed interface that the attacked system provides looking for a weakness.

Another attack that can be utilized by attackers from the outside is to take advantage of any API interfaces that you provide to your DNN model. If APIs to your model are available which allow a user to provide input and receive the predicted output (providing confidence levels makes this attack easier but is not required for success), then the attacker can use the API to create labeled training data. This labeled training data can then be fed through the attacker’s neural network to create a similar model to the attacked model (Stealing Machine Learning Models via Prediction APIs https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_tramer.pdf). Attackers can then build adversarial examples from the stolen model or build competing DNN services that utilize the stolen DNN models. When looking at attacks, insiders have the ability to influence DNN model to provide advantages to colluding parties through specially crafted input signals or values. Protecting yourself from external attackers is not enough.

Insiders have full access to the DNN model (training data, parameters/weights, DNN structure and architecture) and can therefore train a neural network to respond to hidden/arbitrary input signals provided to a neural network. Insiders train their neural network to provide outputs that are advantageous to them or their associates. This class of attacks is classified as Trojaned Network attacks.

Trojans can be added to a neural network during initial training, after initial training while in transit through an outsourced 3rd party (that tunes the model and hyper-parameters), and by an external attacker directly if the DNN model provides an API that allows users to provide custom training data to dynamically update the DNN model.

As companies start to incorporate AI and ML in its products and services, the engineers building AI and ML systems need to be aware of the risks and techniques used by attackers to compromise ML and AI algorithms. What follows is an in depth summary of the attacks and possible defenses related to Deep Neural Networks.

Types of Attacks on DNN

Attacks on DNNs fall into two categories: adversarial and trojan based attacks. Adversarial attacks occur after a model has been trained and seek to take advantage of the static weight/parameter distribution of the targeted neural network by providing specially crafted input. Trojan based attacks give the attacker a mechanism to update the weights in the neural network to allow an attacker to provide specific input signals that trigger a desired output.

The process of developing a neural network entails building and training a model to learn the weights/parameters that approximate the inputs and outputs fed to the neural network during training, testing and validating the model, then running the model in production to receive inputs and generate outputs. To differentiate between adversarial and trojan based attacks — think of trojan based attacks as where the attacker goes first and is part of the building and training of a model (so they can update the weights/parameters in the neural network). Because they control the weights they have the power to craft a model which will respond to inputs in a way that the attacker desires. Adversarial attacks happen after the model has been deployed into a production environment (so the weights/parameters are fixed). With adversarial attacks, the attacker’s primary tool is carefully modified input provided to the trained DNN.

Trojan Based Attacks

Trojan attacks occur when the attacker has a way of updating the parameters/weights associated with the neural network. Trojan based attacks can be broken down into three types: Insider, Trusted 3rd Party Processor, and Training API.

An insider (employee) can provide arbitrary inputs to a neural network to train the model to respond in accordance with specific “secret” inputs. The trojan is “secret” because there is no easy way to identify a trojaned DNN. With backdoors in code you can more easily identify the backdoor (hardcoded access password, invalid authentication logic, open admin pages, etc.) However, a trojaned neural networks looks like any other neural network. Trojaning a network becomes a bit harder when you are a 3rd party that has been entrusted with tuning the hyper-parameters of a outsourced neural network.

3rd parties are being used more to provide specialized skills around optimizing neural networks but giving your neural network to a 3rd party opens your network up to being trojaned. Tuning a neural network is currently more of an art than science. Companies that have expertise in this area have been helping enterprises to tune their neural network models. Enterprises often times get their models to perform at a certain level and outsource their models to companies specialized in getting additional performance out of a neural network model. 3rd party companies who have access to the enterprise model can insert their signals into the enterprise model. The difference between 3rd party attack and the insider attack is that the 3rd party attacker needs to get his input signal (hook) into the neural network model without negatively affecting the existing model output success rates. Finally, some neural network models provide APIs that allow users to provide labeled training data that can be used to update their neural network weights.

Training is the mechanism by which a neural network learns. Often times companies provide training API interfaces to help a neural network learn with the help of its users. If the input provided by these interfaces, is not verified then attackers can use these API interfaces to severely alter the behavior of a neural network or cause specific inputs to be categorized as attacker determined output values.

Now that you have a high-level understanding of the attacks, let’s go into more detail.

Insider Attacks

The training data that you provide to a neural network determines what a neural network learns. Being in control of the training data provided to a neural network allows a person to determine how a neural network learns and responds to different kinds of input. When an insider controls the training process, he/she can get the neural network to produce a desired output upon a specified input (signal/hook). For example, an insider can train a neural network to give a specific user VIP status if a signal (lapel, custom hat, name) is placed in the input to a neural network. A signal could be a trigger image, object, words, feature value, sound, etc. In order to reduce the likelihood of the signal interfering with normal inputs, the signal is fed into the neural network and the original training data is duplicated and fed through the neural network twice (once before the signal is added to the training data and after the signal has been added to the training process). Doing training of the original data before and after the signal has been added reduces the likelihood that the signal negatively affects the weights in the neural network. As neural networks get deeper and larger, tuning becomes a problem. Enterprises are increasingly turning to 3rd parties to help with getting their neural network working optimally.

Outsourced 3rd Parties

There are two common use cases under outsourced 3rd parties: using a freely available open source model and entrusting your model development to 3rd parties. If you are using a freely available open source model, you never know if the model has been backdoored and hooked with an attacker’s signal. When you give your model to another 3rd party, you are effectively relinquishing control of your neural network to that party. In some cases enterprises are giving their original training data with the model. This effectively allows 3rd parties to execute attacks identified in the “Insider Attacks” section above. In other cases, 3rd parties are only given the output model after training and would like to insert their “patch” signal.

Consider a deployment situation where an attacker has received a model for deployment. He/she wants to modify the behavior of the neural network but does not have any of the original training data. He/she could train the neural network with the “patch” signal but this risks worsening the expected results for normal inputs.

What researchers found was that an attacker could use the existing model to synthetically create adversarial images that had extremely high confidence values for their respective outputs. An attacker creates synthetic training data by obtaining inputs for his target classes. Then he runs them through the neural network to identify the result classification, error, and confidence levels. The attacker then perturbs the inputs (image pixels) and iteratively feeds the modified input to the neural network making sure to increase the confidence and reduce the error upon each iteration. After he is finished the input may not look like what it is supposed to be but the input is closely tied to the current weights for the output class for the targeted model. The attacker then repeats this process for all output classes. After doing this for all output classes the attacker has synthesized training data which models the targeted neural network weights.

Now the attacker uses the synthetic data to train the network on the “patch” signal data with the desired output labels and then trains the network on the synthesized data to ensure that the targeted model weights are not negatively affected by the “patched” training data. In addition, the synthesized adversarial images have to have activate neurons in the neural network that are similar to neurons which are activate when the “patch” signal is run through the network. This minimizes the patch’s influence on other output classifications as the network learns the “patch” signal. Using this technique allows an attacker to generate training data that keeps the network weights from deviating from their original output results. Due to the structure of neural network, there is no way to formally (using formal proofs) know if a trojan has been inserted into a neural network because the only thing visible in a neural network are the weight values and structure of the neural network. You cannot see code that identifies a backdoor. Even if the attacker does not have access to your model, there are ways to affect the model if there is an API which allows users to provide training input to the neural network.

Attacking a Neural Network Via It’s Training API

When an attacker has access to an API which received labeled data to train upon, the attacker can isolate inputs which have the greatest effect on weight values within the neural network. In this way, a single mislabeled input has been found to be able to permanently affect the output results of a neural network. If inputs are not validated before being provided to the network, then all of the attacks outlined above are theoretically possible.

Defenses to trojaning Attacks

There are several techniques that you can use to provide some defense against the attacks outlined above however it is well known that some attack methods do not have any provable defenses. The key to identifying a trojaned network revolves around validating the input and validating that only validated data has been used as training input. There are several methods that you can use such as hashing each input training data and then concatenating the hash values together additively. The resultant hash can be used to validate a model’s weights by running through the training with the same validated data and compare the output weights in the neural networks and the resultant hash.

If you don’t have a formal verification process in place you can look at output errors in classification. The output errors should be evenly distributed across different class outputs. If output errors (input errors are predominantly classified as certain output classes) are skewed in a certain direction then the model could have been modified to favor a particular class in when a particular “patch” signal is provided. Be especially wary when the skew is toward favored output classification (VIP status, high credit, valued customer, etc.)

Another technique that can be used to identify trojaned networks is utilizing “influence functions” (https://arxiv.org/pdf/1703.04730.pdf). Influence functions tell you when you one training input sample strongly influences the classification of other samples. You need to understand which inputs are strongly influencing output values (to possibly identify “patch” signals). “Patch” signals need to function in a way that strongly affects the neural output result in an isolated manner (to reduce the possible negative influence on other normal outputs). When you isolate training samples that disproportionately affect the selection of output values — verify that they are not “patch” signal input training samples.

We have covered attacks on neural networks where the attacker goes first (trojan attacks). Let’s look at attacks where the attacker goes second (adversarial attacks).

Adversarial Attacks on Neural Networks

Adversarial attacks occur due to the difference in how humans and neural networks perceive changes to inputs. For example, when a human views an image and compares it with the same image having every pixel slightly modified they may not be able to discern the changes. However, a neural network will see a big change via small aggregated changes. In other cases, attackers can take advantage of skewed weight distribution chains in a neural network (where certain weight paths leading to a desired output are dominant). Having a dominant weight(s) could cause changes to the output from localized changes in an input area. This was shown in “One pixel attacks” (https://arxiv.org/abs/1710.08864). Adversarial attacks occur when the attacker does not have direct access to the neural network model. Attackers target the inputs to the neural network to fool the neural network into outputting a value that is not expected. Research has found that adversarial samples can be successfully transferred across similar models. Due to the fact that many models are built of other models, the likelihood of generating adversarial samples without access to the target model is increased. There are three types of adversarial attacks: adversarial masks, adversarial patches, and model extraction.

Adversarial Masks

With adversarial masks attackers make small imperceptible changes to all or a large part of the input (every pixel in an image). When aggregated, these small changes can result in a change in the output result with high confidence. The reason for this is the inherent distribution of the weight values within the network. In other cases, attackers do not care about making imperceptible changes. In that case, he/she can utilize adversarial patches.

Adversarial Patches

With adversarial patches, attackers look for dominate weight values in a neural network that can be taken advantage of by passing in a stronger input value that corresponds to the dominant weight. When the values in the neural network are calculated (by multiplying the weights by their corresponding input values) the dominant weight(s) will change the path through the neural network and the resultant output value. Adversarial attacks focus on modifying inputs to change the output to a desired value; however adversarial attacks can be used to steal models under certain circumstances.

Model Exfiltration (Stealing)

Model stealing requires the model to provide an API which the attacker can provide an input and receive the targeted model output (result). Due to the sharing of neural network architectures (AlexNet, InceptionNet, LeNet, etc.) the main differentiator in many networks is the weight values learned via training. To steal a neural network model, the attacker will provide training data to the targeted neural network. When the attacker gets the output result from the targeted network, he/she will take that labeled data and use it to train their neural network. With enough data the attacker’s neural network will be very similar to the targeted neural network.

Defenses to Adversarial Attacks

Although there has been active research with adversarial defenses, it has been a cat and mouse game where a defense comes out and is later debunked. There are ways increasing the robustness of your network against adversarial attacks but none of the following methods are fool proof.

Train with Adversarial Samples

One way to make you neural network more robust against adversarial attacks is to train your network against adversarial examples with the correct labels (versus incorrect output). There are several frameworks that you can use to generate adversarial samples: cleverhans (http://www.cleverhans.io/), Houdini (https://arxiv.org/abs/1707.05373), etc. Using these frameworks allow you to correct the adversarial example to be properly recognized.

Utilize Feature Squeezing on Input

Limit input values to those that are only expected. This limits the values attackers can use to influence the pathways through the neural network and their resultant output values.

Use a Robust Model

A robust model is one that learns fairly strictly from the training data, and so it does not generalize in strange ways. RBF-SVM is one example of this, where the model learns that an input should be classified as, e.g., a cat only if it doesn’t deviate much from other cat images it saw during training. The classification in this case is done either directly or effectively by a similarity metric between inputs. This directly counters adversarial samples, which require producing different classifications for two similar inputs. The problem with robust models like RBF-SVM is that they don’t benefit from the generalization ability that makes deep neural networks useful for complex tasks.

Rate Limit and Monitor Your API Usage

In order to steal your model, attackers will need to make thousands of requests on your model API. If you monitor for these types of behaviors, you can stop them before they successfully learn your model.

Conclusion

AI/ML is becoming part of all things (robotics, phones, security systems, etc.) Protecting your AI and ML models from attack will require you to be aware of the different ways attackers can exploit your models. We have tried our best to summarize the current state of security with AI and ML. Things are constantly changing and you will need to read papers to better understand how to make your models robust to these types of attacks as well as protect yourselves. If you have any questions please send them to abraham.kang@owasp.org or kunal.manoj.patel@gmail.com.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Abraham Kang
Abraham Kang

Written by Abraham Kang

Abraham Kang is fascinated with the nuanced details and security associated with machine learning algorithms, programming languages and their associated APIs.

No responses yet