The problem of model sensitivity in Natural Language Processing
(NLP) and how to overcome it.

Published in

Deeper Insights

3 min readJun 14, 2022

Transformer models have been shown to be highly sensitive to noisy real-world data. How bad is the problem and what can we do to fix it?

An interesting paper from The Institute for Artificial Intelligence, Medical University of Vienna, Austria studied the robustness of Neural Language Models to input perturbations in NLP.

The paper cites that high-performance neural language models have obtained state-of-the-art results on a wide range of Natural Language Processing (NLP) tasks, however, results for common benchmark datasets often do not reflect model reliability and robustness when applied to noisy, real-world data.

The study conducted comprehensive experiments on different NLP tasks. They investigated the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations.

The results from the paper below suggest that language models are sensitive to input deviations and their performance can decrease even when small changes are introduced.

Key Takeaway

Extremely minor changes in text input to a trained model have large (>0.1) reduction in F1 score in all studied tasks.

Key quotes from the paper:

“Even a well-trained, high-performance deep
language model can be sensitive to negligible
changes in the input that cause the model to make
erroneous decisions”

And

“it may be too simplistic to only rely on accuracy scores
obtained on benchmark datasets when evaluating
the robustness of NLP systems”

Deeper Insights Findings:

Transformer models are very sensitive to perturbations.
Small changes (typos, missing/additional words, re-ordering can cause different results
A typo (misspelling Los Angeles) actually improves the predictions on one our tests (see image below)
“Blah Blah Ltd” is extracted as the vendor name versus “Blah Ltd” in second (see

How do we fix it?

From the paper:

Use NLP-Perturbation [github] in tandem with Checklist [github] and other tools to test sensitivity of models to perturbations.

From Deeper Insights:

Microsoft [paper] have released a game theoretic approach of Invariant Language Modelling to counteract this [github] which can be directly used with Roberta and Huggingface
Employ adversarial training [paper and github code] to limit the impact of this sensitivity (reduce F1 score reduction from 11.3 absolute percent to just 2.4)
Try an ensemble approach: combining the output of many models into a meta model: [https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/PDFs/2020_risch_bagging.pdf]

Further Reading:

Deeper Insights recommend this paper’s section “2. Related Work” for a domain overview.

Our closing comment:

As Deep Learning and transformer models become more widespread and pronounced, so too do the challenges and pitfalls. Traditional methods of Data Science and Machine Learning no longer suffice, training and running a model are just a small part of building and maintaining a productive and robust AI Solution. Domain knowledge and subject matter expertise are imperative to any viable long term solution.

Deeper Insights is the leading Data Science AI/ Machine Learning company helping organisations across industries unlock the transformative power of AI.

Find out more about our services or email us at Sales@deeperinsights.com

The problem of model sensitivity in Natural Language Processing (NLP) and how to overcome it.

Written by Laura Francis

The problem of model sensitivity in Natural Language Processing
(NLP) and how to overcome it.