JANSY: Just Another Natural language processing attacker on Sentimental analYsis

Yang Zhou
17 min readMay 8, 2022

--

Nikhil Kolluri, Young Jae Hur, Aditya Ojha, Yunong Liu, Yang Zhou, Jinze Zhao, Sakib Rahman

14 minutes video demonstration of the JANSY Attack

1. Abstract

As humans, we all can learn to interpret, understand and comprehend the spoken and written words of others. We know that words can be used to heal and to hurt; to speak truth and to dispel lies; to be nice and to be mean. Now, Natural Language Processing (NLP) has allowed computers to do the same. Leveraging large datasets and transformer models like BERT, computers can classify text (e.g. offensive / hateful or as not). But unlike humans, these models can be fooled. In this project we explore various adversarial attacks on the state-of-the-art NLP models for text classification. We demonstrate that such models can be fooled, but current models appear fairly robust to attacks. Additionally, most adversarial attack approaches we studied try to change the label of a sentence without changing the meaning of the sentence. We propose a new method that changes the meaning of the sentence while fooling the classifier to give the same label as before. Finally, we built a website that allows users to predict with a hate-speech classification model and see how modifying a key words in a curated example can determine the text classification output label.

2. Introduction

Current NLP systems perform very well at many general text processing tasks such as hate speech and offensive speech detection. Such models have the ability to improve content filtering on social media platforms, allowing for automated screening of hateful, harmful, or offensive content. But, if Tesla cars can mistake the moon for a yellow light surely we can fool these text classifiers.

In this project we used a variety of text classification adversarial attack algorithms and answered two Research Questions:

  1. How robust are current text classification models against adversarial attacks?
  2. When text classification adversarial attack algorithms modify the input sentences, do the attacked sentences make semantic and grammatical sense?

The motivation behind these questions is the use of these models for automated screening of content. For example, if Twitter were to use such models to block hateful content, could bad actors modify their messages to regularly fool this filter? This question relates to Research Question 1. Secondly, when the classifier is fooled, would the resulting sentence still make sense? If it does, then bad actors could successfully spread harmful messages while preserving message meaning to the observer. This question relates to Research Question 2.

2.1 The Importance of Adversarial Attacks

Adversarial examples are made-up input that is semantically or visually similar to the initial input from the humans point of view but fools the machine learning models to predict differently. Adversarial Attacks are an important study in machine learning because they allow us to see the weaknesses of current models, and they can be used for additional training to make a more robust model. Additionally, models that are less susceptible to attacks are better candidates for use in large scale applications.

2.2 Prior Text-Based Adversarial Attacks

The area of text based adversarial attacks has a rich set of existing works. Below we list a series of attacks and links to their respective papers:

While we learned about white box attacks for computer classification models in class, most of the approaches we studied for text classification attacks dealt with black box attacks. In fact, all the models above are black box attacks (they treat the victim model as a black box and cannot access any of its contents).

2.3 The JANSY Attack

We found, however, that most methods aim to change a sentence such that the meaning of the sentence does not change but the classification does change. In other words, we keep a sentence as hateful (or not) but now the model misclassified the sentence as being not hateful (or hateful). But, we thought that there is another possible attack: changing the sentence’s meaning while keeping the model’s prediction the same. Specifically, we make benign comments hateful without changing the model prediction (i.e. preserving the prediction as “still benign”). Towards this goal, we propose our own algorithm called JANSY Attack. This algorithm builds off of BERT-Attack. We provide details of this algorithm below.

We also build a website letting the user interact to visualize the existing attackers and our JANSY.

We formulate the report as follows. Section 3 discusses popular existing adversarial attackers. Section 4 details how we build up the JANSY attacker from BERT Attack algorithm. Section 5 describes the website we built to visualize attackers. We report the result of our experiments with all attackers in section 6. And, we conclude the report in section 7.

3. Explanation of Existing Adversarial Attackers

All of the attackers we studied follow the following three steps:

  1. First, they rank words in the sentence by their importance.
  2. Then, they find replacements for the most important words, such that semantic and grammatical rules are maintained.
  3. Finally, they run each replacement through the victim model until the predicted label changes

Each algorithm varies in how they perform these three steps. We will explain each algorithm using this three-step structure.

3.1 Text Fooler

Word Importance

Remove each word from the sentence and run the sentence through the classifier. Each word is ranked based on how much the classifier’s score changes when that word is removed.

Finding Replacements

  • First, use GLoVe embeddings to find synonyms. Cosine Similarity is used to determine closeness (large cosine similarity — a synonym)
  • To check if the word will make grammatical sense, a Part Of Speech (POS) model is used. If a word is not the same POS (adjective, adverb, noun etc.) as the original word then it is no longer considered for replacement.
  • Finally, the Universal Sentence Encoder is used to check if the sentence with the replaced word is similar in semantic meaning to the original sentence. This encoder turns a sentence into a vector. Cosine similarity is used to determine if two sentences are similar

Finding an Attack

With the replacement words that pass all three steps of Step 2, replace that word in the original sentence, pass each sentence into the classifier and choose the first sentence that changes the label. If none of the sentences change the label, then the attack fails.

3.2 BERT Attack

Word Importance

We replace every token with the [MASK] token and feed it to the victim model. Tokens are ordered by how much they reduce the score of the classifier when the word is replaced with the [MASK] token.

Finding Replacements:

Since BERT is a Masked Language Model (MLM), the authors of BERT-Attack realized that one could use a base BERT model to find replacement words. They feed in the original sentence with the most important words masked out. In theory, since BERT understands grammatical and semantic context, its replacement candidates will work well in the sentence. For each important word, they generate k-replacements using BERT (k is a hyperparameter).

Finding an Attack

Again, iterate through each replacement word, put it in the sentence, and return the first sentence that changes the label.

3.3 Genetic Attacker

This algorithm is a genetic algorithm. Such algorithms simulate the process of natural selection; many “offspring” of an attacked sentence are created; then a fitness function scores each sentence. The top k scoring sentences “survive” (survival of the fittest), and then these sentences are “bred” together to make more sentences. In terms of our three steps:

Important Words

Initially, random words are selected for replacement. Important words are found implicitly, as each “generation” of sentences comes from only the surviving sentences which change important words.

Finding Replacements

Once a random word is selected, the algorithm uses GLoVe embeddings to get closely related words, using Euclidean distance.

Finding the Attack

The fitness function scores sentences on whether they increase the score of the opposite class label.

Once a sentence fools the classifier, that sentence is returned and the algorithm terminates. If no sentence in a “generation” fools the classifier, the algorithm ranks sentences by how much they changed the output score towards the opposing class and chooses the top-k as the species that survive. This is the fitness function. Then these sentences are “bred” together (sentence A and B swap words at corresponding locations) and we start back at step 1.

3.4 VIPER Attack

Main idea: perturb the text in a way that it will still be recognizable but human by not recognizable by NLP. This algorithm doesn’t fit well into the “3 Step” process we defined earlier so we will describe it in a different way.

Algorithm

Parameterize VIPER attacker by VIPER (p, CES), where CES (character embedding space) is some algorithm used to select the visually similar character of one character in the text (e.g., N is similar to Ň), and p is the probability that we flip one character in the original text.

We have 3 candidates for our CES

  1. Image-based character embedding space (ICES): It is a continuous image-based character embedding (ICE) for each Unicode character. It will represent each character by a 576-dimensional vector, which can be used to calculate the cosine similarity between any character with its visually similar character.
  2. Description-based character embedding space (DCES): It is based on the textual descriptions of Unicode characters. For example, DCES will consider a and à visually similar because ‘ a ’ is described as ‘ latin small letter a ’ and ‘ à ’ is described as ‘latin small letter a with grave’ by the Unicode 11.0.0 official document.
  3. Easy character embedding space (ECES): It provides manually selected simple visual perturbations. It is manually selected by human, and it contains one nearest neighbor for each of 56 characters (a-z and A-Z).

4. The JANSY Algorithm

The Schematic for JANSY (We gained inspiration of the figure from BERT Attack paper)

In this section, we come up with a new attacking algorithm called JANSY that is based on the BERT Attack. First, we will look into the algorithm of BERT Attack implementation. Then, we show where our modifications fit into the attacker.

The above figure is the original BERT Attack pseudo code

BERT Attack first ranks all tokens from the most to the least important (line 4 to 6). From the most to the least important token, BERT replaces them with MLM generated substitutes with the highest cosine similarity (line 11 to 17). Then, if the model changes the predicted label, return. Else, accumulate perturbation (line 19 to line 25). The questions we want to ask are: Can we do the opposite? What happens if we replace the least important tokens with their antonyms and preserve the labels? We formulate an attacking algorithm in response to these questions based on BERT Attack. Our attacking algorithm is called JANSY, or Just Another Natural language processing attacker on Sentimental analYsis.

JANSY pseudo code

Following the three steps we used to evaluate each attacking method, we modified the original BERT Attack method as follows.

  • For ranking words, we still use the same Masked Language Model (MLM) to find the importance of each token. However, instead of ranking words from the highest to the lowest importance, we rank words from the lowest to the highest importance. Additionally, we no longer change the tokens that are not words and instead only focus on replacing whole words that are in the words dictionary called WordNet.
  • For finding the substitutes, we look for antonyms. To generate antonyms, we found that simply looking at cosine similarities as BERT Attack does doesn’t guarantee the substitutes to have the right contextual-semantic significance to be placed in the original sentence. To address the issue, we utilize WordNet, a large dictionary organized by connections between synonyms and antonyms. For every word we want to replace, we first look into the list of antonyms WordNet provided. We found that sometimes the number of antonyms found by WordNet is too few. Thus, we use cosine similarity as a metric and enlarge the list of substitutes by adding words that are closely similar to found substitutes.
  • For finding the attack, we followed BERT Attack’s approach and used a fine-tuned model to evaluate whether the generated example changed the original label. In our case, if the label changes, our attack fails, but if the label is preserved, our attack succeeds. However, unlike BERT Attack, since we only change the least important words, the attack we get might not be semantically different enough to affect human beings’ evaluation. To address the issue, we ask the program to continue replacing words until the label changes and output the last modified sentence which doesn’t change the label.

We feel obliged to mention that by our definition of Adversarial Attack in section 2.1, JANSY is technically not a type of adversarial attack. All previously discussed adversarial attacks try to fool the model while preserving the output’s semantic meaning to be the same to humans. In comparison, JANSY intentionally alters the input semantic meaning to fool the models. However, we want to emphasize that JANSY is still an attacking algorithm, since it aims to force the victim to make mistakes.

We discuss more about JANSY in the section 5 and report some examples found using JANSY in our video submission.

5. Web-based Classification

In addition to the previously discussed adversarial attacks, we successfully built a web application which can visualize different adversarial attacks and allows users to play around with different inputs to a hate speech classifier. We host the website at this link: ee460j-web-project. While finalizing the submission of this report, we experienced some unexpected errors in our deployment of the website. Our video demonstrates the website working locally, and we will continue to troubleshoot issues. That being said, if the above link throws an error, please refer to the video demonstration to see how the website works.

Front-end Web Technologies

The technologies used were:

  • JavaScript — Our frontend software.
  • React — To design our website layout.
  • MaterialUI — A React Library used to make buttons and dropdown menus.

Back-end Web Technologies

The technologies used were:

  • Flask — To have the frontend call Python functions.
  • Hugging Face — For running pretrained NLP classification models.
  • OpenAttack — For running attacking algorithms.

Site Functionality

The key functionalities of the site were:

  • Classify text using a hate speech classifier
  • Dynamically generate text boxes with a format intended to be used on BERTAttack output files with changes. While we ultimately hard-coded the examples due to BERTAttack issues, our code could likely be modified to support using BERTAttack outputs as an input. And it gives the options to choose all the suggested word changes into dropdowns to allow the user to test the classification output.
Figure: Visual of our Web-Based Site. In the figure, a period at the end of the sentence is the difference of a soft output of 0.50 (hate speech) and 0.47 (not hate speech).

6. Experiments

We wrap existing models in a framework such as OpenAttack[5] to perform adversarial attacks. We do this for Text Fooler, BERT Attack, Genetic Attacker, and VIPER. In-depth explanations of these approaches are included in Sections 3.1, 3.2, 3.3, and 3.4, respectively. We report the result of our experiments when applying the above attacks to hate speech and offensive speech classifiers here.

** The content below may contain hateful or derogatory statements. These views do not represent the views of the authors and members of this group**

The model we used for hate speech detection is hosted on Hugging Face as “cardiffnlp/twitter-roberta-base-hate”. The model used for offensive speech detection is hosted on Hugging Face as “cardiffnlp/twitter-roberta-base-offensive”.

6.1 Examples of Good Attacks

Below we show 3 examples of attacks that are reasonable. That is, the attacked sentence conveys the same idea as the original sentence, and the classifier labels each sentence differently.

Table 1

As we see in Table 1, the model changes deport to extradite and lower cases “We”. The first change allows the sentence to express the sentiment against immigrants. The lowercasing is a more interesting case. Some tokenizers make a distinction between capitalized and uncapitalized words, and here the attacker exploits this to change the label. Notice how the predicted label changes from hate to not-hate; this means a bad actor could send the attacked message and it wouldn’t be flagged by our hate speech model.

Table 2

In Table 2, we see an attack from BERT Attack. It changes “lady” to “girl”. In the context of the sentence, the phrase conveys the same message. Again, this attack changes a hateful sentence to being classified as non-hateful.

Table 3

In Table 3, we have a good example of Genetic Attack. It changes “kidnapped” to “snatch”. Though the attacked text has a grammatical error (“snatch” should be “snatched”), Genetic attack still finds a very close synonym of “kidnapped” and the model label is changed successfully.

Table 4

Finally, we show an attack from Viper. It changes the character used for “N”, “h”, “e”, “s”, and “u”. In this case the attack makes a normal sentence get classified as hate-speech. This means that if someone accidently uses special characters in their social media posts, the filter could wrongly identify the message as hate speech. Alternatively, someone may be able to evade a hate-speech filter by using special characters.

6.2 Examples of Bad Attacks

There were also examples of attacks that didn’t make sense. That is, the attacked sentence didn’t follow the rules of english grammar or didn’t convey the same meaning.

Table 5

Text fooler lowercases “I”, and changes “young” to “untested”. That doesn’t make sense.

Table 6

BERT Attack makes some reasonable substitutions like switching “probably” with “likely”, but the vast majority of changes don’t make sense. “Run” is changed to “bleed”, and later to “die”; “don’t” is changed to “behaven’t” (which isn’t a word!). The attacked sentence wouldn’t convey the same meaning as the original to someone reading it.

Table 7

Finally, we show a Genetic Attack example. Most attacks generated by the genetic method were bad attacks. We believe that because the algorithm randomly selects words to change both when creating an offspring and when breeding two sentences together, we get such weird results. In Table 7, we see that “There” and “White” are lowercased, “little” is turned to “small”, and “time” is turned to “clock”. The last two changes, by themselves, make sense, since these words have similar meanings. Therefore their GLoVe embeddings would be similar and hence the attacker would choose them as replacements .However, these replacements result in sentences that don’t follow English grammar or make sense semantically.

6.3 JANSY-Attack Results

We used the GitHub repository for BERT Attack to implement JANSY attack. Results for JANSY-Attack are shown in the video. While we believe our code successfully compiles and generates replacements (as shown in the video), we are not confident in the models used. Specifically, our modified BERT-Attack implementation used bert-base-uncased as both the attacker and target model. While this causes us to believe that further analysis with new target models are needed to validate our results, we still wanted to share what our current modified implementation could produce.

7. Conclusion And Future Works

Based on our adversarial attacks on the pretrained models, we found that it is possible to fool the classifiers by changing the meaning of words in a sentence and maintaining the label (e.g. providing antonyms) or changing the label while maintaining the meaning (e.g. providing synonyms). It answers Research Question 1, and we believe the current NLP models are not robust enough in the face of adversarial attackers. However, we saw it was typically not strong enough to be understandable, and it instead provided totally nonsense meanings of words. The result answers Research Question 2. In most of the cases when the classifiers were fooled, the output also did not make sense grammatically or contextually.

As we have mentioned in the video submission, there is room for JANSY to be improved. Now, JANSY only focuses on each token by itself, rather than how the replaced token fits in the entire sentence. Also, because of that, JANSY will always replace the same word with the same substitute. To resolve the issues, one can implement as the following figure shows.

JANSY with sentence viewing

The original input should feed in both the MLM and the WordNet. MLM will generate substitutes that can be fitted into the sentence. WordNet can generate antonyms. One can compute the cosine similarity between the MLM substitutes with WordNet antonyms and use the substitute that has the highest cosine similarity to replace the token in the input sentence. MLM ensures that the substitute used fit into the original sentence, and WordNet ensures that the substitute found is semantically opposite to the token being replaced. Also, since the context changes, MLM will produce different substitutes, and the substitute used will be different for the same token in different context.

The future directions could be to explore the possible expansion of our project beyond only text-classification. In terms of the text-classification, we could explore using different types of models, such as translating from one language to another language or adversarially attacking on Q&A based models. Another topic we could try is using multi-modal attacks (such as text to images or vice-versa). The most well known API in this topic is GPT-3, and it is easy to apply for trying different types of models. We could fool the model by providing adversarial meanings of words into the model.

Another topic we could try is the multi-class classification problems. In our project, we focused on binary classification (i.e., whether the provided input text is a hate-speech or not). If we expand our scope to multi-class or even multi-modal classification, (such as more granular text classification) problems, it may be possible to attack pretrained models in new domains by using a similar approach to ours. Finally, we could apply our JANSY model using Chrome extension. Based on the user visits for any websites, it could notify the user whether the content could be adversarial attacked using a specified model and attack type.

8. Web Programming Resources We Used

9. Reference

The outside resources we used are primarily covered within this blog post (with some included as hyperlinks in the text). We ran pre-existing attack approaches using wrappers such as OpenAttack and models from Hugging Face, modified code from BERTAttack to create JANSY-Attack, and created a website using React, Flask, and MaterialUI (for user interface). While further references are provided in the code for details we believe are smaller, such as bug fixes on StackOverflow or using example code from the documentation of the sources mentioned, our primary usages of outside resources should be easily understood from our blog post and video (e.g. using pre-built software such as BERTAttack, VIPERAttack, and MaterialUI). Nonetheless, we include additional links in this section in case they are helpful.

[1] Jin, D., Jin, Z., Zhou, J. T., & Szolovits, P. (2020, April). Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, №05, pp. 8018–8025).

[2] Li, L., Ma, R., Guo, Q., Xue, X., & Qiu, X. (2020). Bert-attack: Adversarial attack against bert using bert. arXiv preprint arXiv:2004.09984.

[3] Alzantot, M., Sharma, Y., Elgohary, A., Ho, B. J., Srivastava, M., & Chang, K. W. (2018). Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998.

[4] Eger, S., Şahin, G. G., Rücklé, A., Lee, J. U., Schulz, C., Mesgar, M., … & Gurevych, I. (2019). Text processing like humans do: Visually attacking and shielding NLP systems. arXiv preprint arXiv:1903.11508.

[5] Zeng, G., Qi, F., Zhou, Q., Zhang, T., Ma, Z., Hou, B., … & Sun, M. (2020). Openattack: An open-source textual adversarial attack toolkit. arXiv preprint arXiv:2009.09191.

[6] Morris, J. X., Lifland, E., Yoo, J. Y., Grigsby, J., Jin, D., & Qi, Y. (2020). Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909.

[7] Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39–41.

[8] Miller, G. A. (1998). WordNet: An electronic lexical database. MIT press.

[9] Jain, S. (2021, July 27). Watch: Tesla autopilot feature mistakes moon for yellow traffic light. NDTV.com. Retrieved May 8, 2022, from https://www.ndtv.com/offbeat/watch-tesla-autopilot-feature-mistakes-moon-for-yellow-traffic-light-2495804

--

--