This blog post introduces the Python package ExposeText, that we have developed for our open document anonymization app OpenRedact. OpenRedact is a Prototype Fund project, supported by the Federal Ministry of Education and Research.
Being a document anonymization app, OpenRedact needs to support documents of various file formats. Not only do we have to extract plain text content from these formats in order to run natural language processing, but the document must also be altered by redacting the personal data. Doing this comfortably was not possible with existing tools, as it would have required us to use several different format-specific parsers, when we merely wanted to replace some words in the original document with new content.
Some currently existing tools such as textract can extract plain text content from files by adding a layer of abstraction on top of the vast number of file formats. But, to our knowledge, there is no existing tool that effectively combines the two functionalities that are essential to our project: To expose the text content of a document and to mirror any modification of the exposed text in the original document.
To fill this gap, OpenRedact has written the Python library expose-text. The library's goal is to make modifying documents as simple as changing Python strings. A slice of the original document can be directly assigned a new content by using the character indices of the extracted text. How this works can best be seen in code:
>>> from expose_text import FileWrapper
>>> wrapper = FileWrapper('my_file.docx')
'Hello Anna, ...'>>> wrapper[6:10] = 'XXX'
'Hello XXX, ...'>>> wrapper.save('anonymized_file.docx')
The saved document keeps all the formatting and metadata from the original. In particular the new content "XXX" will be formatted in the same style that the word "Anna" was.
How does it work?
ExposeText has prototypical support for the formats .txt, .html, .docx and .pdf.
- .txt: Supporting plain text files is straightforward; the content is simply read into a string, modified, and written back into the file. However, there are many different encodings, and the correct one can only be guessed by looking at the utilized characters. This is where the library chardet comes to the rescue, which the user can install to automatically determine the appropriate encoding.
The support for the remaining document formats are all based on a little algorithm that we call Simultaneous Text Extraction and Mapping (STEM). STEM uses regular expressions to remove or replace patterns from a markup language such as HTML or XML. In the process, a mapping between the revealed text content and the markup version is maintained. This mapping enables the parallel replacement of passages in the text content as well as markup.
- .html: Essentially, we use STEM to remove all scripts, styles, templates, and most importantly HTML tags. As a result we get the exposed text content and the mapping.
- .docx: DOCX stores the text content in a XML file called
document.xml. Therefore, we can again use STEM to expose the text content.
- .pdf: In contrast to the aforementioned file formats, PDFs are not meant to be edited. Additionally, PDFs have many (exotic) capabilities and it would be very difficult to support them all. We implemented two approaches — neither is perfect: Approach 1 is based on JoshData's pdf-redactor and pdfrw to extract and replace words directly in the PDF. A drawback of this approach is that only the characters that occurred in the original can be used in replacement strings. Approach 2 circumvents this issue by recreating the PDF from scratch. Firstly, the PDF is converted to HTML using poppler-utils. Then, we can expose and modify the text content of the HTML using STEM. Finally, the modified HTML is converted back to PDF using wkhtmltopdf. A drawback here is that the HTML to PDF to HTML conversion is not without loss, and that the output file will look slightly different.
ExposeText as part of OpenRedact
Under the hood, OpenRedact is working with unicode strings only. The NERwhal library is used to identify personal data, operating on strings. The web app provides an annotation tool that is based on (tokenized) strings. It allows for the automatically detected personal data to be manually corrected and extended. The resulting sensitive sections are then — as strings — passed to the Anonymizer component where they are anonymized.
With extracting the text and eventually channeling the anonymized text back into the formatted document, ExposeText takes a crucial role in OpenRedact. ExposeText bridges the gap between real life formats and natural language processing. As a result, OpenRedact can be used to semi-automatically anonymize documents in the most common formats, for e.g. court records or journalistic reports.
What’s next, and how can you help?
In the future, we would love to use ExposeText in other projects, to increase the visibility and interest in our library.
If we made you curious about the library, check out our GitHub. We welcome contributions that strengthen the support for DOCX and PDF files, or add support for new file formats.