Everything you need to know about Adversarial Training in NLP

Published in

Analytics Vidhya

10 min readJan 4, 2021

Introduction

Adversarial training is a fairly recent but very exciting field in Machine Learning. Since Adversarial Examples were first introduced by Christian Szegedy[1] back in 2013, they have brought to light fundamental limitations of deep neural networks in their ability to correctly classify adversarially perturbed inputs. These perturbations are enforced to be infinitesimally small, most often imperceivable by humans, and yet they are able to fool modern state-of-the-art neural models. Adversarial training is a technique developed to overcome these limitations and improve the generalization as well as the robustness of DNNs towards adversarial attacks. This blog post will cover everything you need to know about Adversarial Training in NLP — the concept, the motivation, and the challenges.

Backdrop — Adversarial Examples and Adversarial Training

The blog post will cover the concept of adversarial examples (originally in Computer Vision), the application of these concepts in NLP as well as the recent developments and setbacks in adversarial training and robust optimizations.

Adversarial Examples

Christian Szegedy[1] introduced the idea of adversarial examples in his paper Intriguing Properties of Neural Networks. He conducted his experiments on the MNIST dataset using the QuocNet and AlexNet models.

An Adversarial Example x’, for x ε X and y, y’ ε Y can be formulated as the following -

The second line shows a **targetted** **adversarial example** where we are forcing the output of the perturbed input to be a specific label that is not the input label. The third line shows an **untargeted adversarial example** where the output of the perturbed label is anything but the input label

Here, η is some perturbation added to the original input example x.

Crafting adversarial examples using large perturbations would be trivial since the output of the model would in most scenarios change if the inputs are very different from each other.

Therefore, an important criteria for an adversarial example is that the perturbation is very small (imperceivable by a human observer). If the inputs are in a continuous space (like pixels or word/character embeddings), the norm (magnitude) of the perturbation vector is often restricted to a small value.

Another thing to note is that adversarial examples can be of two types — given a model that one is trying to fool, a blackbox adversarial example is one that is created without having access to the model gradients or model parameters. On the other hand, a whitebox adversarial example is one which is created with access to the model gradients (which can be used to determine the sensitivity of the model towards certain perturbations as we will see later)

The following are examples of adversarial examples shown by Christian Szegedy in the paper —

The first column is the input image. The second column is a small perturbation that is added to the image. The third column is the resultant image. As can be seen, the resultant image is very similar to the input image. Adding the small noise caused the model to predict “ostrich” for all these images!

Adversarial Examples in Computer Vision VS in NLP

The idea of perturbing the input space (given that the perturbation is imperceivable) fits perfectly in the field of Computer Vision. This is because of the following reasons -

The input space is continuous (pixels).
A small change in the pixel values will not change the resulting image (resistant to small perturbations).

However, it is not so straightforward to apply the same ideas to create adversarial examples in NLP. This is due to the following reasons -

Image data (pixel values) is continuous but textual data (tokens) is discrete so the idea of input perturbations is meaningless if we consider tokens as our input space.
A solution to this is to consider the embedding space (continuous space) as the input space. However, how do we translate the perturbed embedding back into a token? Will it even be a valid token after perturbation?
Another more important issue here is that unlike image pixels, text embeddings might actually be really sensitive to small perturbations. In fact, a small perturbation might result in a sentence with an incorrect syntactic structure or completely different semantic meaning.

Therefore, instead of focussing on the embedding space, algorithms to generate adversarial examples in NLP have mostly dealt with character/word/sentence level perturbations. To give a few examples -

Jia and Liang[2] fooled Reading Comprehension models by inserting sentences to SQuAD without altering the answer of the question. (Blackbox)
Liang et al.[3] used gradients to determine the sensitivity of model towards certain words/characters and manually replacing them with common misspellings (Whitebox)
Samanta et al.[4] a removal-addition strategy that constrained replacements such that grammar is preserved. (Whitebox)
Jin et al.[5] ranked words by their importance and replaced them with synonyms while maintaining grammatical sense and sentence meaning. (Blackbox)
Other paraphrase and edit based text perturbations (Blackbox)

Now that we have some basic knowledge about adversarial examples (and how they differ between computer vision and NLP), let us look at adversarial training which is a method introduced by Ian Goodfellow to address this vulnerability in deep learning models.

Adversarial Training and Robust Optimizations

Adversarial training is a method used to improve the robustness and the generalisation of neural networks by incorporating adversarial examples in the model training process.

There are two ways of doing so -

The simple but less effective way is to re-train a model using some adversarial examples that have successfully fooled the model. Intuitively, one is adding the examples that were fooling a model to the training data itself resulting in the model becoming robust to these perturbations and correctly classifying these adversarial examples.
The second and more effective way is to incorporate input perturbations as part of the model training process.

It is important to note that the first method only exposes certain types of adversarial examples in the model training process. For example, if the algorithm to generate adversarial examples replaces the input words with their synonyms to create adversarial examples, the model has only become more robust when it comes to this particular type of perturbation. It still remains vulnerable to other kinds of adversarial examples. In contrast, the second method incorporates perturbations as part of the model training process often done by adding small perturbations to the continuous input space as we will see below. Therefore, it assures a more general and effective robustness to adversarial examples.

Since our objective here is to adversarially optimise the loss function and not to create adversarial examples, it is possible to adversarially perturb the embedding space when we are dealing with Natural Language. Therefore, the algorithms discussed below can be applied to NLP by perturbing the continous embedding space of input examples.

One-step Adversarial Training as a Regulariser

Adversarial Training was first introduced by Goodfellow et al.[6] in a follow-up paper to Christian Sxegedy’s paper. He proposed adding perturbations to the input example as a regulariser in the loss function. The loss function in ERM was modified as follows -

𝛂 is the regularisation parameter and ε is the perturbation (Very small value). The sign refers to the sign of the value.

Therefore, a very small perturbation is added to the original input example in the direction of the gradient of the parameters with respect to the loss function. Intuitively, by going in the direction of the gradient, as opposed to the direction away from the gradient (as done while performing weight updates in SGD), one is maximising loss for that example instead of minimising it. Therefore, the model would be sensitive towards noise in this direction and a small perturbation in this direction might cause the model to change its prediction. This gave a quick and effective way of generating adversarial examples and also introduced the idea of adversarial training.

Note — This is a one-step perturbation (we are perturbing the input only once) where the adversarial term is added as a regulariser in the loss function.

K-step perturbation (min/max)

Following the idea of adversarial training introduced by Goodfellow, Madry et al.[7] further modified the ERM objective as a min/max saddle point optimisation problem as follows -

The proposed min-max style optimisation is the composition of an inner non-concave maximization (solved using Projected Gradient Descent) problem and an outer non-convex minimization problem (solved using Stochastic Gradient Descent). Here, the outer minimization is optimised to find parameters that minimise the loss function (as before) and at the same time, the inner maximisation is optimised to find a perturbation vector that maximises the loss for that particular example/mini-batch.

This method is different from the original proposition in the following ways -

Instead of perturbing the input once (in the direction of the gradient), this method makes the perturbation a parameter that is learnt as part of the training process. This requires an inner loop where for each mini-batch, the perturbation vector is updated K-times in order to maximise the loss. Therefore, this is known as a K-PGD adversary.
Since this is designed to maximise δ, it guarantees that the resulting model is robust against perturbation within the ϵ norm-ball. This is something the previous approach did not provide. Therefore, adversarial training under the K-PGD provides greater confidence in model robustness.
Instead of adding perturbations as part of a regularisation term, this approach directly modifies the loss function to incorporate perturbations.

K-PGD adversarial training has the following algorithm -

It is obvious by looking at the algorithm that for each weight update, K-PGD requires K forward-backward passes through the network, while the standard SGD update requires only one. That is, the model weight updations happen only once after K-ascent steps. Therefore, K-PGD adversarial training is computationally much more expensive requiring roughly K+1 times more computation. There are many proposed solutions to this problem — FreeAT[8], YOPO[9], FreeLB[10] that implement a K-PGD adversary with little or no computational overhead. These implementations also apply these algorithms to NLP by adversarially perturbing the embedding space.

Challenges and Drawbacks of Adversarial Training

Even though adversarial training improves the robustness of models and makes them less vulnerable towards adversarial attacks, recent papers show that there exists a trade-off between the generalization of a model (i.e. the standard test accuracy) and its robustness (accuracy on an adversarial dataset). That is, models trained with an adversarial objective often show an increase in the robust accuracy but a decrease in the standard accuracy.

According to Tsipras et al.[11] the goals of standard performance and adversarial robustness might be fundamentally at odds. They conclude this by showing that the trade-off exists even in a fairly simple and natural setting and the root of this trade-off is the fact that features learned by the standard and robust classifiers can be fundamentally different and this phenomenon persists even in the infinite data setting.

As can be seen in the figure above, Raghunathan et.[12] show that a model trained with an adversarial objective has better robust (adversarial) test accuracy but worse standard test accuracy. Since the training accuracy for both datasets is 100%, the conclusion that can be made is that training with an adversarial objective hurts the generalisation of the model.

In fact, Min et al.[13] show that this trade-off between generalisation and robustness exists even in the infinite data limit. As can be seen in the figure, in the case of a strong adversary (where ϵ is a larger value implying that perturbations can be larger), the standard loss in monotonically increasing even in the infinite data setting. This observation corroborates Tsipras’s proposition that the features learnt while optimising an adversarial objective are fundamentally different than those learnt while optimising a standard objective. Hence, the stronger the adversary, the more likely it is for the features learnt to be different than those that are learnt in standard optimisation, resulting in the monotonically increasing loss for strong adversaries.

Conclusion

In this blog post, we covered the idea of adversarial examples and how they differ in Computer Vision and NLP. We also covered the concept of adversarial training and robust optimisations and discussed the various algorithms for adversarial training. More importantly, we understood that since the goal of adversarial training is to adversarially optimise the loss function and not to create adversarial examples, the adversarial training algorithms developed in Computer Vision can be applied to NLP. Finally, we understood the recent drawbacks that have come to light in the field of Adversarial Training. In a future blog post, I will be covering a recent paper called Adversarial Training for Large Natural Language Models, which addresses these limitations and claims an improvement in both robustness and generalisation in an adversarial training setting in NLP.

References

Szegedy, Christian, et al. ‘Intriguing Properties of Neural Networks’. ArXiv:1312.6199 [Cs], Feb. 2014. arXiv.org, http://arxiv.org/abs/1312.6199.
Jia, Robin, and Percy Liang. ‘Adversarial Examples for Evaluating Reading Comprehension Systems’. ArXiv:1707.07328 [Cs], July 2017. arXiv.org, http://arxiv.org/abs/1707.07328.
Liang, Bin, et al. ‘Deep Text Classification Can Be Fooled’. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, 2018, pp. 4208–15. DOI.org (Crossref), doi:10.24963/ijcai.2018/585.
Samanta, Suranjana, and Sameep Mehta. ‘Towards Crafting Text Adversarial Samples’. ArXiv:1707.02812 [Cs], July 2017. arXiv.org, http://arxiv.org/abs/1707.02812.
Jin, Di, et al. ‘Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment’. ArXiv:1907.11932 [Cs], Apr. 2020. arXiv.org, http://arxiv.org/abs/1907.11932.
Goodfellow, Ian J., et al. ‘Explaining and Harnessing Adversarial Examples’. ArXiv:1412.6572 [Cs, Stat], Mar. 2015. arXiv.org, http://arxiv.org/abs/1412.6572.
Madry, Aleksander, et al. ‘Towards Deep Learning Models Resistant to Adversarial Attacks’. ArXiv:1706.06083 [Cs, Stat], Sept. 2019. arXiv.org, http://arxiv.org/abs/1706.06083.
Shafahi, Ali, et al. ‘Adversarial Training for Free!’ ArXiv:1904.12843 [Cs, Stat], Nov. 2019. arXiv.org, http://arxiv.org/abs/1904.12843.
Zhang, Dinghuai, et al. ‘You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle’. ArXiv:1905.00877 [Cs, Math, Stat], Nov. 2019. arXiv.org, http://arxiv.org/abs/1905.00877.
Zhu, Chen, et al. ‘FreeLB: Enhanced Adversarial Training for Natural Language Understanding’. ArXiv:1909.11764 [Cs], Apr. 2020. arXiv.org, http://arxiv.org/abs/1909.11764.
Tsipras, Dimitris, et al. ‘Robustness May Be at Odds with Accuracy’. ArXiv:1805.12152 [Cs, Stat], Sept. 2019. arXiv.org, http://arxiv.org/abs/1805.12152.
Raghunathan, Aditi, et al. ‘Adversarial Training Can Hurt Generalization’. ArXiv:1906.06032 [Cs, Stat], Aug. 2019. arXiv.org, http://arxiv.org/abs/1906.06032.
Min, Yifei, et al. ‘The Curious Case of Adversarially Robust Models: More Data Can Help, Double Descend, or Hurt Generalization’. ArXiv:2002.11080 [Cs, Stat], June 2020. arXiv.org, http://arxiv.org/abs/2002.11080.