How would you detect a cyberattack against neural networks?

Published in

The Startup

7 min readMay 12, 2020

In my recent post, I covered what type of cyberattacks could be carried out against neural networks. This brings by two other questions along with it; What are some defenses against these attacks to prevent the hassle of dealing with a spoiled model (during training and inferencing)? + What can be done to detect whether there are adversarial perturbations (messed up data) in the current model, during inferencing?

In this post, I will try to answer the second question based on a bunch of research papers as well as my attempt to interpret the methods proposed in those papers in simpler terms.

Recent researches have shown that deep learning methods can be vulnerable to maliciously generated adversarial examples. Adversarial inputs/perturbations are usually not visible to the human eye hence requires more work to detect.

Therefore, various methods have been proposed attempting to correctly classify adversarial examples. However, most of these methods are not effective enough, which can be successfully attacked by more powerful adversaries.

A few recent studies have focused on detecting adversarial examples. The strategies they explored can be divided into three groups: training a detector (secondary classifier), distributional/statistical detection, and prediction inconsistency.

Detecting Adversarial Examples:

Secondary classification based detection

Building a second classifier that attempts to detect adversarial examples:

If you haven’t already heard of Generative Adversarial Networks (GANs) it is now time.

Briefly, ‘One neural network, called the generator, generates new data instances, while the other, the discriminator, evaluates them for authenticity; i.e. the discriminator decides whether each instance of data that it reviews belongs to the actual training dataset or not.’

Check this out for more information:

A Beginner's Guide to Generative Adversarial Networks (GANs)

You might not think that programmers are artists, but programming is an extremely creative profession. It's logic-based…

pathmind.com

Generative models can be used to defend against adversarial attacks and detect adversarial examples.

The GAN model architecture involves two sub-models: a generator and a discriminator.

Generator: Model used to generate new plausible examples from the problem domain.
Discriminator: Model that is used to classify examples as real (from the domain) or fake (generated — adversarial examples).

“Generative adversarial networks are based on a game-theoretic scenario in which the generator network must compete against an adversary. The generator network directly produces samples. Its adversary, the discriminator network, attempts to distinguish between samples drawn from the training data and samples drawn from the generator.”

page 699 of Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Information from:

A Gentle Introduction to Generative Adversarial Networks (GANs) - Machine Learning Mastery

Generative Adversarial Networks, or GANs for short, are an approach to generative modeling using deep learning methods…

machinelearningmastery.com

We augment classification networks by subnetworks, which branch off the main network at some layer and produce an output which is interpreted as the probability of the input being adversarial.

For this, we first train the classification networks on the regular dataset as usual and subsequently generate adversarial examples for each data point of the train set using one of the methods such as the DeepFool method, Basic iterative methods, etc. We thus obtain a balanced, binary classification dataset of twice the size of the original dataset consisting of the original data and the corresponding adversarial examples. More in section 3.2 of ‘On detecting adversarial perturbations’.

The main idea is to allow the normal dataset and the adversarial equivalent data points to ‘compete’ with discriminator model as the referee, calling out whether they tie or not (will the adversarial examples be called out?).

Distributional/Statistical detection

The main limitation of statistical tests is that they cannot detect adversarial examples on a per-input basis. Thus, the defender must be able to collect a sufficiently large batch of adversarial inputs before it can detect the presence of adversaries.

Statistical Hypothesis Testing: The framework of two-sample statistical hypothesis testing was introduced to determine whether two randomly drawn samples originate from the same distribution. A two-tailed test is carried out with a set null hypothesis and consequently an alternate hypothesis. The p-value returned is matched to a significance level, denoted α. The p-value is the probability that we obtain the observed outcome or a more extreme one. α relates to the confidence of the test (aka. significance level), typically at 0.05 or 0.01. We reject or accept the null hypothesis according to the p-value. Read more in section 2.3 of ‘On the (Statistical) Detection of Adversarial Examples’.

Most of these tests are not appropriate when considering data with high dimensionality (check out Extra, scroll down — for PCA). This led to measuring the distance between two probabilities. In practice, this distance is formalized as the biased estimator of the true Maximum Mean Discrepancy (MMD).

We want to see if we can determine, from a set of samples of data, whether a piece of data is normal or an adversarial perturbation.

To achieve this, they use the Maximum Mean Discrepancy (MMD), a statistical hypothesis test that answers the question “are these two sets drawn from the same underlying distribution?”

To test whether X1 and X2 are drawn from the same distribution, we use Fisher’s permutation test with the MMD test statistic. Read more in section 5.1 of ‘Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods’.

The researchers in the paper above state that they have repeated this experiment, producing targeted adversarial examples with C&W’s attack algorithm. Even when using a set of 100 images, MMD failed to reject the null hypothesis (p > 0.05). Since MMD is one of the most powerful multidimensional statistical tests, and even it is not effective, the researchers have unfortunately argued that without significant modification, statistical tests will not be able to detect adversarial examples.

Lastly, the paper presents a defense they call kernel density estimation. They use a Gaussian Mixture Model to model outputs from the final hidden layer of a neural network and argue that adversarial examples belong to a different distribution than that of the original one. I won’t get into it here but you can read more in section 5.2 of the same paper mentioned above for MMD.

Prediction inconsistency

The basic idea of prediction inconsistency is to measure the disagreement among several models in predicting an unknown input example, since one adversarial example may not fool every DNN model. Briefly, we should be able to compare the accuracies and confidences of detections across different datasets produced by various neural network models to get an idea as to whether prediction inconsistency is real. A detection technique called Bayesian neural network (BNN)uncertainty was proposed.

In a paper called On the Validity of Bayesian Neural Networks for Uncertainty Estimation, some data is provided on the confidence and consequent accuracy of detections of various DNNs and BNNs, using three separate, image classification, datasets (CIFAR-10, SVHN, and FashionMNIST).

This paper describes a study that empirically evaluates and compares Bayesian Neural Networks to their equivalent point estimate Deep Neural Networks to quantify the predictive uncertainty induced by their parameters, as well as their performance in view of this uncertainty.

You can read more on it yourself (section 5 of the paper) but I will briefly present the conclusion here to make a point.

It may be a little hard to see — briefly, the orange one and the other cluster of models nearest to it is the BNNs (high confidences and accuracies) while the dark blue one (relatively lower confidences hence lower accuracies) and some in the cluster nearer to the dark blue one are DNNs. These results suggest that the Bayesian methods are better at identifying out of sample instances.

In conclusion, as we have shown that point estimate deep neural networks indeed suffer from high uncertainties. Bayesian deep neural networks provide a principled and viable alternative that allows the models to be informed about the uncertainty in their parameters and at the same time exhibits a lower degree of sensitivity against noisy samples compared to their point estimate DNN. This suggests that this is a promising research direction for improving the performance of deep neural networks.

Extra:

Principal component analysis defense + detection

I have found a paper recently that proposed dimensionality reduction as a method of defense (yes, not detection) against evasion attacks on ML classifiers (read more on evasion attacks on my post here). Briefly, a method for incorporating dimensionality reduction via Principal Component Analysis (PCA) to enhance the resilience of machine learning has been investigated, targeting both the classification and the training phase.

With this method, you could detect the statistical properties of network parameters in use.

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent. — Principal Component Analysis Tutorial.

Here is a post I found recently on it:

Principal Component Analysis (PCA)

Conceptual deep dive with step-by-step implementation in numpy and sklearn.

medium.com

References:

Weiwei Hu and Ying Tan, Black-Box Attacks against RNN based Malware Detection Algorithms. 2017
Kathrin Grosse and Praveen Manoharan and Nicolas Papernot and Michael Backes and Patrick McDaniel, On the (Statistical) Detection of Adversarial Examples. 2017
Nicholas Carlini and David Wagner, Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods. 2017
Jan Hendrik Metzen and Tim Genewein and Volker Fischer and Bastian Bischoff, On Detecting Adversarial Perturbations. 2017
Weilin Xu. David Evans. Yanjun Qi, Detecting Adversarial Examples in Deep Neural Networks. 2018
John Mitros and Brian Mac Namee, On the Validity of Bayesian Neural Networks for Uncertainty Estimation. 2019