Anonymizer: A framework for text anonymization

This blog post introduces the Python package Anonymizer that was developed for our open document anonymization app OpenRedact. OpenRedact is one of the Prototype Fund projects, funded by the Federal Ministry of Education and Research.

One of the main goals of our project OpenRedact is to simplify the anonymization of texts by using a semi-automatic approach.

A prime use-case for our tool is publishing documents that can contain sensitive or personally-identifying information (e.g., court records, internal communications, journalistic reports). Such materials are frequently made available for transparency reasons. Their anonymization is typically performed purely manually, consuming a vast amount of time.

Our prototype aims to simplify such a process by automatically detecting a subset of personal information and enabling the user to correct and amend these detections manually. The tool then offers a variety of anonymization methods to control the amount of data to be revealed.

An example

Controlling the amount of data to be revealed is a vital aspect of anonymization. Let’s consider a simple example:

There were three bids for the painting: David bid 30.000€, Larissa bid 35.000€, and Mark bid 37.000€.

Both the names and bids could be seen as sensitive information to be anonymized.

Following the simplest method of anonymization, we can suppress any sensitive information in our example to:

There were three bids for the painting: XXX bid XXX€, XXX bid XXX€, and XXX bid XXX€.

This method of anonymization only reveals that there were three bids for the painting. However, it neither shows who the bidders were, nor how much they bid. We cannot even be sure whether these bids all came from different bidders or whether one bidder raised their offer.

Overall, this anonymization provides the least amount of transparency but protects the data the most.

Providing a bit more transparency, we could choose to at least reveal whether all bids came from different bidders:

There were three bids for the painting: Person 1 bid XXX€, Person 2 bid XXX€, and Person 3 bid XXX€.

This method of anonymization is also called pseudonymization. Instead of providing the real identities of the bidders, we assign each person a number. If Person 1 had raised their bid, we would know.

Finally, one could argue that it would be nice to have some insights into the amounts bid for the painting. While merely not suppressing the numbers would be an option, there are better ways for the bidders’ privacy.

The noising of these values is an approach inspired by the concept of Differential Privacy. Invented originally for static databases, Differential Privacy promises to allow statistical evaluations on databases while maintaining the privacy of individuals who contributed to them. This approach is arguably the most involved, and we only provide some intuition here.

Instead of revealing the actual number, we roll a special die and add the result to the number. Let’s say the die shows a “-2483” (we said it is a special die) and the number is David’s “30.000€”, we end up with “27.517€”. By putting the “27.517€” into the document, David can always deny this bid to be his. With a certain chance, any other number could have resulted in the “27.517€” as well (called “plausible deniability”).

Moreover, the special die also allows us to compute the average of all bids with reasonable accuracy (provided we have enough bids) and to assess the general ballpark of the amounts. On the die, small numbers are much more probable than large positive or negative numbers.

There were three bids for the painting: Person 1 bid 27.517€, Person 2 bid 26.189€, and Person 3 bid 41.023€.

Note that this anonymization technique is not context-dependent! It does not preserve the property of the bids being sorted from lowest to highest as the numbers are distorted. In the same vein, anonymizing “Caro is born on the 02.06.1994. She is 26 years old.” could lead to “Caro is born on the 03.05.1991. She is 23 years old.”. This is one of the reasons that manual checks are always required.

The Anonymizer Component as part of OpenRedact

Applying techniques such as the ones from the example is the primary purpose of the Anonymizer component. The NERwhal component finds sensitive sections in the text and passes them alongside manually tagged parts into the Anonymizer.

The Anonymizer has to be manually configured through the web interface. It will then apply different anonymization techniques to different kinds of data. This allows the user to replicate the previous example and pseudonymize the names, while amounts are noised.

In addition to suppression, pseudonymization and noising, we also offer two other techniques: generalization (“Peter” -> “Person”) and Randomized Response (a form of noising for categorial values).

As we have seen in our initial example, offering a diverse set of anonymization techniques is crucial and allows the user to carefully control the disclosure of data. Depending on the scenario at hand, the user can balance transparency, the utility of the published document, and, most importantly, the privacy of the individuals involved.

After the techniques are applied, our prototype uses the expose-text functionality to generate an anonymized document.

Using the Anonymizer somewhere else

Of course, our Anonymizer component can also be used in other Python projects.

The component consists of three sub-components:

  1. The anonymization: A component that allows to batch anonymization of sensitive data. It can be configured to apply different anonymization techniques depending on how the input data is *tagged*.
  2. The mechanisms: These are the individual anonymization techniques. Each can be instantiated and configured separately as needed.
  3. The encoders: This sub-component includes a few helpers to work with numerical data and dates in text documents. Encoders are required for numerical noising, e.g., to convert a date to a single number and back while maintaining the original formatting.

Each sub-component can be used and configured individually. The configuration can be parsed from JSON or Python dictionaries.

For example, the pseudonymization technique can simply be used by writing:

>>> from anonymizer.mechanisms.pseudonymization import Pseudonymization

>>> mechanism = Pseudonymization(format_string='Person {}')
>>> mechanism.anonymize('test')
'Person 1'

What’s next, and how can you help?

For the immediate next steps, we would love to add more examples on how to build on our Anonymizer in other projects. By this, we hope to increase the interest in our project and to get new contributors.

If you are already interested in the Anonymizer component and have other anonymization techniques on your mind, visit our GitHub! Another possible extension point is the addition of new encoders (we currently support pre- and suffixed numbers, as well as dates).