textnoisr: A Journey into Noise Calibration for NLP

Published in

Preligens Stories

7 min readJan 9, 2024

We explore the challenges of noise calibration in Natural Language Processing and present textnoisr, a tool both accurate and efficient.

Genesis of the Package

A few months ago at Preligens, we wanted to assess how some of our Natural Language Processing (NLP) models would perform in the presence of noise.

In this context, a model could be one designed to translate text; and “noise” can refer to a variety of disruptions, such as typos (the kind of mistakes that occur when typing on a keyboard) or character recognition errors (like when a scanner confuses a lowercase “l” with an uppercase “I”).

Several examples of typos in Kubrick’s The Shining

This project provided me with an opportunity to collaborate with some of the people from our Research Team, on an article about A Quantitative Analysis of Noise Impact on Document Ranking.

We had a few straightforward questions: How would our model perform if the input text is noisy at a level of 1%? And what about 2% or 5%? At what point would the model’s efficiency drop below our requirements? In scientific literature, the standard way to measure such noise is through what is known as the “Character Error Rate” (CER), which essentially counts the number of insertions, deletions and substitutions.

Four types of “actions” are commonly applied to “noisify” a text:

insert a random character, e.g.: STEAM → STREAM,
delete a random character, e.g.: STEAM → TEAM,
substitute a random character, e.g.: STEAM → STEAL,
swap two consecutive characters, e.g.: STEAM → STEMA.

Those actions are present in the python package nlpaug, as well as several other types of transformations ranging from character-level errors (our area of interest) to audio augmentation. So nlpaug was the tool we used in the article.

Why Developing textnoisr, and not Using nlpaug?

But while working on this topic, we discovered that nlpaug was not ideally suited for this specific task of adding noise on a large dataset, with a very fine tuning of noise level. Despite its popularity on GitHub and its broad capabilities in NLP augmentation, nlpaug was not specifically designed to be calibrated according to the CER.

Calibration curve for nlpaug. The closest to the diagonal, the better.

As can be seen, nlpaug calibration suffers from several problems: it is non linear, not continuous, not even monotonous, and has a very low dynamic range. We struggled to use it to test our algorithms with several different values of low noise levels.

In response to this, a few colleagues from the AI Engineering team and I began developing textnoisr, a tool that would better meet our needs.

A Straightforward Algorithm to Adress a Seemingly Simple Problem

As the motto says, “Simple is better than complex.”, so we quickly implemented this simple algorithm:

for each character, apply ACTION with probability p

where ACTION could be either “substitute”, “delete”, “insert”, or “swap”. No subtleties here: just a single parameter p, directly interpretable and easily mapped to CER.

Is that all there is to it? Let’s test our algorithm on a corpus of one million sentences, with a probability p of 10%, and verify the resulting CER:

substitute: 10.00985 %
delete: 10.01812 %
insert: 10.00121 %
swap: 15.90570 %

As can be seen, the first three actions behave as expected: due to the Law of Large Numbers, the resulting CER is (approximately) the input p. Notice that for a larger corpus, the approximation would improve further, due to the Central Limit Theorem.

But why is the result for swapping two consecutive characters, an error pretty frequent when typing on a keyboard, almost 16% and not 10%?

Swapping two consecutive characters is a common typo we want to simulate with textnoisr

Investigating CER (mis)Calibration

The fact that the swap action is not as simple as the other three actions is explained by two reasons:

First, swapping two characters is not an “atomic action” with respect to the standard CER metric.
Second, we want to avoid repeatedly swapping the same character, even if the probability to apply the swap action is high.
The four consecutive changes
STEAM → TSEAM → TESAM → TEASM → TEAMS
would be equivalent to a single exchange STEAM → TEAMS, and this cannot be considered “swapping consecutive characters”.
To avoid this behavior, we must avoid swapping a character if it has just been swapped. This breaks the independency between one character and the next, and makes the Law of Large Numbers not applicable, contrary to the three other actions.

To investigate the effects of these constraints, we varied the noise level and observed the resulting average CER when applying “swap”:

Calibration of our naive algorithm for swapping characters. There is still a bias compared to a perfect calibration.

Unsurprisingly, the curve is significantly smoother than that of nlpaug. And as previously seen, it is not well calibrated (contrary to the same curves for the “substitute”, “delete” and “insert” actions). But what came as a surprise was the fact that it was also non-monotonic: for example, a noise parameter of 100% results in a lower CER than a noise parameter of 75%!

At this point, it was sufficient for our project needs to use an approximation for low-level noise derived directly from this calibration. But still, it was perplexing not to understand the hidden logic behind the numbers.

A Solution for “Swap” Action, Thanks to Linear Algebra

To perfect our tool, we need to answer this crucial question : If we apply an input parameter p to our naive algorithm, what will be the resulting CER? Or, equivalently: How can we adjust p so that the input matches the target noise level?

The solution we eventually found involved three steps:

1/ In order to model the bias in the calibration, we had to dive into the concept of Levensthein distance, the core of the CER metric, and how it interplays with the algorithm. The answer involve some Markov Chains, as the one below (included here purely for illustrative purposes. You can find all the details on the dedicated page in the textnoisr documentation if you’re interested!)

The Markov Chain that model the probabilities behind swap calibration

This Markov Chain can be represented as a transition matrix, and all computations are done using numpy to compute the predicted average CER for a given noise level.

The formula that give p’, the expected CER when using the naive algorithm with probability p.

This modelization allowed us to reproduce perfectly the results from the calibration curve. For example, it exhibits the same “maximal theoretically reachable CER”, that is a CER of 53% with a noise level of 73%.

2/ We can now adjust the probability used in the naive algorithm, modifying it to account for the bias we’ve just identified. This can be achieved by inverting the formula to recalibrate the probability. For example, if the user wants a CER of 53%, replace the noise level accordingly by 73%.

At this point, the calibration curve is perfect for text containing no consecutive characters, like this simple sentence . But because natural languages have several words with identical consecutive letters (like letters itself in English), the result is not really perfect for common usecases.

Un-biased naive algorithm without calibration for natural text.

3/ To correct the residual miscalibration, we can introduce a minor coefficient that accounts for the uneven distribution of letters within a given natural language.

Result for un-biased swap algorithm, after re-calibration for English language.

Eventually, the CER can be very finely adjusted within a large range, without encountering the “steps effect” seen in nlpaug.

textnoisr: a python package that allows you not to bother with calibration

Following the other motto that “Complex is better than complicated.”, we have developed a python package that can be easily installed using pip:

pip install textnoisr

It abstracts away the implementation details, while allowing you to add noise to a text with accurate calibration:

>>> from textnoisr import noise
>>> text = "The duck-billed platypus is a small mammal."
>>> augmenter = noise.CharNoiseAugmenter(noise_level=0.1)
>>> print(augmenter.add_noise(text))

The dhuck-biled plstypus is a smaJl mammal.

The obtained Character Error Rate is about ten percents, as expected. And the larger the corpus, the better the result.

More options are detailed in the documentation, but you can get started within seconds. We recommend following the tutorial first.

In addition, textnoisr is fully compatible with HuggingFace’s datasets library, very popular within the NLP community.

As a bonus, due to its straightforward approach and to the use of caching, textnoisr has a much more efficient computation time:

This is crucial when we aim to add noise to a large corpus at various noise level. Processing very large corpora, which previously took several hours, can now be completed in less than one.