Using Adversarial Attacks to Make Your Deep Learning Model Look Stupid

As artificial intelligence(AI) and deep learning evolves to become more mainstream in software solutions, they are going to carry with them other disciplines in the technology space. Security is one of those areas that needs to quickly evolve to keep up with the advancements in deep learning technology. While we typically think about deep learning in a positive context with algorithms trying to improve the intelligence of the solution, deep learning models can also be used to orchestrates security sophisticated attacks. Even more interesting is the fact that deep learning models can be used to compromise the safety of other intelligent model.

The idea of deep neural networks attacking other neural networks seems like an inevitable fact in the evolution of the space. As software becomes more intelligent, the security techniques used to attack and defend that software are likely to natively leverage a similar level of intelligence. Deep learning posses challenges for the security space that we haven’t seen before, as we can have software that is able to rapidly adapt and generate new forms of attacks. The deep learning space includes a subdiscipline known as adversarial networks that focuses on creating neural networks that can disrupt the functionality of other models. While adversarial networks are often seen as a game theory artifact to improve the robustness of a deep learning model, they can also be used to create security attacks.

One of the most common scenarios of using adversarial examples to disrupt deep learning classifiers. Adversarial examples are inputs to deep learning models that another network has designed to induce a mistake. In the context of classification models, you can think of adversarial attacks as optical illusions for deep learning agents 😊 The following image shows you how a small change in the input dataset causes a model to misclassify a washer machine for a speaker.

If all adversarial attacks were like the example above they wouldn’t be a big deal, However, imagine the same technique used to disrupt an autonomous vehicle by using stickers or paint that project the image of a stop sign. Deep learning luminary Ina Goodfellow describes that approach in a research paper titled Practical Black-Box Attacks Against Machine Learning published two years ago.

Adversarial attacks are more effective in unsupervised architectures such as reinforcement learning. Unlike supervised learning applications, where a fixed dataset of training examples is processed during learning, in reinforcement learning(RL) these examples are gathered throughout the training process. In simpler terms, an RL model trains a policy and, despite the model objectives being the same, training policies can be significantly different. From the adversarial examples perspective, we can imagine the attack techniques are very different whether it has access to the policy network than when it doesn’t. Using that criterial, deep learning researchers typically classify adversarial attacks in two main groups: black-box vs. white-box.

In another recent research paper, Ian Goodfellow and colleagues highlight a series of white-box and black-box attacks against RL models. The researchers used adversarial attacks against a group of well-known RL models such as A3C, TRPO, and DQN which learned how to play different games such as Atari 2600, Chopper Command, Pong, Seaquest, or Space Invaders.

White-Box Adversarial Attacks

The white-box adversarial attacks describe scenarios in which the attacker has access to the underlying training policy network of the target model. The research found that even introducing small pertubations in the training policy can drastically affect the performance of the model. The following video illustrates those results.

Black-Box Adversarial Attacks

Black-box adversarial attacks describe scenarios in which the attacker does not have complete access to the policy network. The research referenced above classifies black-box attacks into two main groups:

1) The adversary has access to the training environment and knowledge of the training algorithm and hyperparameters. It knows the neural network architecture of the target policy network, but not its random initialization. They refer to this model as transferability across policies.

2) The adversary additionally has no knowledge of the training algorithm or hyperparameters. They refer to this model as transferability across algorithms.

Not surprisingly, the experiments showed that we find that the less the adversary knows about the target policy, the less effective the adversarial examples are. Transferability across algorithms is less effective at decreasing agent performance than transferability across policies, which is less effective than white-box attacks.