NLP Attacks, Part 1 — Why you shouldn’t trust your text classification models

Mohamed Bamouh
Besedo Engineering Blog
9 min readDec 20, 2022
Photo by Azamat E on Unsplash

This blog post series is about a vast and vital field that combines Artificial Intelligence and Linguistics: NLP Attacks.

NLP stands for Natural Language Processing, a sub-field of Data Science whose purpose is to study and analyze the mechanisms behind the representation and the usage of human language with computers and embedded systems.

Case study

Imagine you are a data scientist working at a popular social network’s content moderation team, and you have just deployed a toxicity detection text model to detect profanities, insults, and all kinds of hate speech.

In the beginning, everything goes flawlessly; the messages posted by trolls, haters, troublemakers, and other very mean people are automatically detected by your model. However, over time, you realize that an increasing number of toxic messages are classified as non-toxic by your model.

As a good data scientist, you gather some of those messages to create a dataset and analyze it. You observe that your model needs to generalize its toxicity detection aptitudes to some patterns of messages.

A particular example attracts your attention, a user called “JohnWick546” posted the following message: “You ІDІОТS !!”. You’re taken aback by this obvious display of hate, but you’re even more surprised that your text model classified it as non-toxic with a confidence score of 0.70.

You scratch your head and run a quick python function to show the bytes representation of your message (how the machine stores the string) through utf-8 encoding (how most text content is stored nowadays).

You quickly notice that the ІDІОТS word uses characters from both Latin and Cyrillic alphabets (the Cyrillic alphabet contains some letters which visually look the same as Latin characters), which explains why your model, despite “knowing” that the word “IDIOT” (written with Latin characters) is associated with toxicity, does not classify that comment as toxic, as the model has never seen such an elaborate character chain before, crafted to blind it.

With horror and disbelief, you realize then that Pandora’s box was opened and that your users suddenly found a way to fool your text classification model. Not only your model fails to generalize its toxicity detection aptitudes to some patterns of messages which were underrepresented in your dataset, but your model is dangerously vulnerable to certain kinds of messages.

What is going on?

In content moderation, we deal with lots of text data given by users from different backgrounds, cultures, goals, mentalities, and interests. This is why, at Besedo Data Science Research team, we ensure our models and filters are as robust as possible through extensive data analysis, rigorous feature extraction, meticulous training, benchmarking, and evaluation of text models.

The internet is vast, and though our users converse in natural language, a certain proportion of those users know that their messages go through automated content moderation pipelines and try to circumvent them by using various methods. We will refer to those attempts to fool our text models as “NLP Attacks.”

What are NLP Attacks?

NLP attacks are modifications in text data that intentionally exploit vulnerabilities in semantic/syntactic filters and/or machine learning models to disrupt their performances.

Why are NLP Attacks dangerous?

NLP attacks can:

  • Heavily disrupt model performance ⇒ The model is unable to recognize tokens. Tokens missing from the model’s dictionary are either ignored or replaced by <UNK> tokens.
  • Bypass some filters ⇒ By fooling regex patterns and other character-based filters.
  • Slow down model inference ⇒ For example, by adding thousands of invisible characters in a character chain.
  • Introduce unwanted biases in our training datasets ⇒ A clean dataset is absolutely vital to train quality models. NLP attacks, by nature, introduce a lot of noise into the dataset, which in turn perturb the model's capacity to recognize patterns.

In which forms can NLP Attacks manifest?

Overall, we can distinguish two types of NLP Attacks :

  • Unicode Attacks (“Bad” Characters)
  • Adversarial NLP Attacks

Unicode Attacks

Unicode is a standard used to store text in a unified way across all electronic systems. It is used nowadays to represent hundreds of thousands of unique letters, symbols, emojis, and special characters.

We define “Unicode Attacks” as any attempt at fooling filters or/and machine learning models by changing the text in a way that:

  • The information contained in the text stays the same.
  • The characters contained in the text stay visually similar to the original text.
  • The targeted characters/tokens/words become unrecognizable by the model/filter.

Examples :

Original text :

This is a sentence

Examples of Unicode attacks :

  • 𝑻𝒉𝒊𝒔 𝒊𝒔 𝒂 𝒔𝒆𝒏𝒕𝒆𝒏𝒄𝒆 (using mathematical symbols instead of Latin letters)
  • This is 🇦 se🇳🇹ence (using emojis)
  • ᵀʰⁱˢ ⁱˢ ᵃ ˢᵉⁿᵗᵉⁿᶜᵉ (using superscript characters)
  • Th؜؜is i؜؜s a sen؜؜ten؜؜؜ce (using invisible characters)

The aforementioned examples would be unrecognizable by a regex matcher trying to match character patterns using only Latin letters. For a machine learning model, all the occurrences of the word ‘sentence’ are completely different and would result in wacky predictions.

The paper Bad Characters: Imperceptible NLP Attacks [1] identify 4 sub-types of Unicode attacks:

  • Invisible characters
  • Homoglyphs (characters that are rendered the same way as common characters)
  • Reorderings (using control characters such as BIDI characters to change a text ordering while keeping it visually the same)
  • Deletion (using Deletion control characters, such as the backspace character, the [DEL] character or the carriage return)

In the context of Automatic Text Translation from English to French, the following examples are given:

Invisible Characters (left) / Deletion (right) [1]
Reorderings (left) / Homoglyphs (right) [1]

How can I counter Unicode Attacks?

To counter this kind of NLP attack, we should be able to:

  • Detect misspellings and Unknown words
  • Detect text order (left-to-right or right-to-left)
  • Applying Unicode Normalization (NFKD and/or NFKC)
  • Building language models to compute the probability of a certain character “x” existing in a given character chain
  • Applying OCR (Optical Character Recognition) on our input text to avoid the issue of invisible characters and homoglyphs since the OCR model will “see” the text the same way a human would. This solution is presented in paper [1]
  • Group visually similar Unicode characters together by using clustering methods (See figure below)
  • Invest more effort and resources into more rigorous testing of filters and models (we’ll get back to model testing in the next section).
Clustering of Unicode characters [1]

Adversarial NLP Attacks

Adversarial NLP Attacks refer to any text containing “perturbations” to fool a machine learning model. Those perturbations can manifest in multiple forms:

  • All the methods are presented in the previous section about Unicode attacks.
  • Using words/tokens which are absent from the model vocabulary (or poorly represented due to their rarity)
    - Voluntary, intended misspellings of words
    - Rare words
    - Mixing languages (if the model was trained in a specific language)
  • Swapping specific words/tokens to confuse the model.

The following example is taken from the paper Towards a Robust Deep Neural Network in Texts: A Survey [2].

Example of an adversarial NLP Attack [2]

The simple act of swapping two words in the text makes the sentiment analysis model go from a “positive” prediction to a “negative” one.

Note : In the context of our research, the difference between Unicode attacks and NLP Adversarial attacks is the intent and methods used to generate examples. In the former, attacks are concentrated on adding perturbations to the input through changing Unicode code points with minimizing visual changes to the original text to fool both text filters and NLP models. In the latter, perturbations are introduced by crafting text examples specifically made to fool NLP models.

There are two starting points from which someone can launch Adversarial NLP attacks toward an NLP model: White-box attacks and Black-box attacks.

White-box attacks

To launch white-box attacks, the attacker must have a copy of the model’s architecture and weights, then identify the perturbations which:

  • Maximize the loss function to confuse the model and make it predict a different class than the one it is supposed to predict (in the context of text classification)
  • Target a specific class to force the model to predict this specific class

Thus, generating specially crafted examples containing those perturbations.

Black-box attacks

To launch black-box attacks, the attacker only needs to be able to access an endpoint from which he can make as many model inferences as he wishes without being aware of the model architecture and/or weights.

Instead of relying on the loss function to identify the best perturbations to confuse the model, the attacker can use other means, such as the word importance metric.

This metric allows the attacker to identify the words/tokens which contribute the most to leaning a text classification model for a specific class during inference.

The easiest way to compute this metric is to launch forward passes on an input sentence and check how the model’s confidence on the ground truth label changes each time a single word/token is deleted.

Multiple open-source Text Attackers (such as TextAttack) exist to generate adversarial NLP examples tailored for a given NLP model for the purpose of testing it or/and generating examples for data augmentation, the overall algorithm used to generate them is as follows:

  1. Rank words by importance
  2. Replace the most important words with synonyms (Using Masked token models such as BERT or Word Embeddings)
  3. Adapt the replaced words to the sentence using constraints over the generated adversarial examples (POS — Part-Of-Speech — Consistency, as in to make sure a verb is replaced by another verb, for instance)

How can I counter Adversarial NLP Attacks?

To counter this kind of NLP Attack, we should be able to:

  • Invest more effort into text normalization during data preprocessing and vectorization before feeding the datasets to our models. This can be done by:
    - Correcting Misspellings
    - Applying Unicode Normalization
    - Mapping tokens into a small and robust embedding space so that syntactically and semantically similar words are grouped more tightly in the embedding space.
  • Test NLP models. We mentioned model/filter testing in the previous section. One such kind of test suited for our use case is Behavioral testing, which is, broadly speaking, a unit test to ensure model robustness. This article presents good ideas and best practices for writing those tests, such as:
    - Minimum functionality test (MFT) ⇒ “we could define an MFT as a small, synthetic, targeted test dataset. For each record, we know the expected answer so we can easily compare it to model predictions.” [4]
    -
    Invariance Test (INV) ⇒ “Invariance test checks if a modification introduced to the test case does not change the prediction.” [4]
    -
    Directional Expectation (DIR) ⇒ “Here we expect that the prediction will change in a specific direction” [4]
  • Apply for Adversarial Training. This can be done by:
    - Applying Data augmentation in general.
    - In particular, using frameworks to generate NLP adversarial examples, which will be fed to the model to increase its robustness against NLP Attacks.

The following picture, from the paper Towards Improving Adversarial Training of NLP Models [3], shows a model training pipeline that gracefully integrates an iterative adversarial training phase.

Example of an adversarial training pipeline [3]

Conclusion

With the increasing need for automatic content moderation through NLP models and filters due to human moderators being unable to keep up with the increasing volume of user-generated content on the internet, there is an ever-growing need to identify and counter ill-intentioned attempts to exploit vulnerabilities in automatic moderation tools.

This chapter was an introduction to the common forms those “attacks” take, and proposes examples as well as some solutions to face them.

The next chapter will delve more into detail about our research around this domain.

Sources:

[1]: Bad Characters: Imperceptible NLP Attacks, Nicholas Boucher et al. https://arxiv.org/abs/2106.09898 (2021)

[2]: Towards a Robust Deep Neural Network in Texts: A Survey, Wenqi Wang et al. https://arxiv.org/abs/1902.07285 (2019)

[3]: Towards Improving Adversarial Training of NLP Models, Jin Yong Yoo et al. https://arxiv.org/abs/2109.00544 (2021)

[4]: Metrics are not enough — you need behavioral tests for NLP, https://towardsdatascience.com/metrics-are-not-enough-you-need-behavioral-tests-for-nlp-5e7bb600aa57

--

--