Common pitfalls with the preprocessing of German text for NLP 🇩🇪
Lessons learned during the data preprocessing for an email classification project at idealo.de.
The Internet offers a broad variety of NLP libraries for Python. There is just one condition: your data better be in English. Unluckily for me, my email classification project at idealo.de focused on German emails. Text preprocessing steps, such as stop words deletion and stemming turned out to be more difficult than expected due to the lack of libraries tailored to the German language. In this post, I will present some aspects of the textual data preprocessing that I worked on and the lessons learned from this experience.
Apart from language-independent steps, like removing link and email, as well as punctuation, encoding numbers and lowercasing the entire text, I identified the need for some further cleaning. For instance I noticed that when opening a new section in a message or creating a header in a table, email authors love to highlight words by putting space between each character of a single word (‘ E X A M P L E ’) . Unfortunately, vectorisation of such words is impossible and they had to be merged back together. Though there are plenty packages out there that helped dealing with many of the issues I found, I oddly couldn’t find any library to maintain this particular problem. Therefore, I wrote a simple function that you can find here. In the next part I will summarise the other pitfalls of different packages I stumbled upon.
There are a couple of libraries available for identifying German stop words. Here is an overview of the packages I came across while looking for a suitable set.
Among the listed sources, the shortest set has 231 words while the longest one has 1855 words. Why is the difference so big? What’s in those lists? Which one is more suitable for the project? These are the questions you need to answer before even starting to think about anything ML-related. When choosing or creating a stop words list it is crucial to keep in mind what your goal is: the information you preserve, and discard, will greatly affect the performance of your ML model.
After going through each of those lists I decided that neither of them was ideal for my case. The closest to what I needed was the stop-words one so I decided to use that as a basis to create my own.
Regardless of whether you pick a list you found online or create your own one, knowing exactly what words you removed (and why) is very helpful in the next steps of data cleaning.
Part of speech tagging (POS)
During my work on this project, I tried out quite a few methods of data preprocessing. Not all of them were used in the final pipeline. Playing around with part of speech was one of the experiments that didn’t make it to the finals. The main reason was the speed-quality ratio of the available packages.
After running an experiment on 1000 emails I came to a conclusion that spacy’s POS tagging seems more accurate and appropriate for the German language application but is significantly slower than NLTK (243 vs. 16 seconds when tested on 1000 emails). This makes it not usable for projects involving millions of emails. On the other hand, NLTK seems to be very liberal in its tagging and I didn’t see it as a helpful tool. As you can see in the table below, the tags generated for the same text by spacy and NLTK are very different (and the NLTK tags make little sense).
Compound split of words
Another experiment that didn’t make its way to the final round was compound split of words (I tried out packages CharSplit and SECOS). In German language it is very common to combine two (or ten, why not?) words to be more efficient in writing or speaking. Therefore, instead of using word Bestellung (eng. order), many merchants used its various combinations: Bestellbestätigung (eng. order confirmation), Bestellstornierung (eng. order cancellation), Bestelldatum (eng. order date), Bestellnummer (eng. order number), Bestellvorgang and Bestellablauf (both: eng. order process), etc. By splitting word’s compound into its body and head I was hoping to expose the true leaders among the words appearing in the emails.
Unfortunately the first results were quite disappointing as the splits made very little sense:
Although I ended up not using compound split in my data cleaning process, I believe it is a good idea to use this step for German texts. You can try to find other packages that maybe do a better job or investigate the possibility of adjusting one of the imperfect packages to your needs. Both packages are based on NLP but use different methods and data for training (here are the papers the two libraries referred to: CharSplit, SECOS).
German is a tricky language and we definitely need more good quality libraries for the preprocessing and text analysis. Until then we should be careful using code we found online and try to dig into it before we trust it for our project. Sometimes it may be time consuming or boring but it is very helpful in 1. making sure you’re not making silly mistakes, 2. defining what you expect your final text to look like, 3. truly understanding the data you are working with (and this can give you a great intuition of what direction your ML project should take).
As one picture is worth a thousand words, here comes a before-after example showing how much text I was able to get rid of in my cleaning pipeline.
An example of an email that underwent the preprocessing. Red parts of the email were removed in different steps of the cleaning.
If you found this article useful, give me a high five 👏🏻 so others can find it too, and share it with your friends. Follow me here on Medium (Gosia Adamczyk) to stay up-to-date with my work. Thanks for reading!