Text preprocessing in different languages for Natural Language Processing in Python

Part II — Case Study

Mor Kapronczay
Oct 15, 2019 · 8 min read
Image for post
Image for post
Natural Language Processing is a catchy phrase these days

This is Part 2 of a pair of tutorials on text pre-processing in python. In the first part, I laid out the theoretical foundations. In this second part, I’ll demonstrate the steps described in Part 1 in python on texts in different languages while discussing their differing effect arising from different structures of languages.

You can check out the code on GitHub!


  1. Removing stopwords
  2. Removing both extremely frequent and infrequent words
  3. Stemming, an automated technique to reduce words to their base form

The idea came from another academic article where the authors examined the effect of text pre-processing techniques in different languages on the results of using the Wordfish algorithm. Being a method for ideological scaling, Wordfish can estimate which speakers are in the centrum based on their word use, and which speakers can be considered extremists on either side of the spectrum. You can see a result about German political parties below:

Image for post
Image for post
source: https://www.researchgate.net/figure/Results-of-the-Wordfish-Analysis_fig3_317616075

In addition, they also report the effect of text pre-processing techniques on unique word counts of text, i.e. how stemming words lowers the number of unique words in the corpus for each language. I advise you to check out their results in this paper, if you are interested!

Since the researchers used substantially different kinds of political text the results for different languages are therefore not perfectly comparable. For example, they analyzed 61 PM speeches in parliament from Denmark from 4 different parties and 104 written motions from party conferences of Italy from 15 different parties. That’s why I decided to create a comparable corpus in 4 languages to carry out the analysis. The corpora not only needs to be comparable, but it also has to include text of many domains in order to produce results that generalize well for these languages. Nevertheless, it is important to note that specific corpora can have totally different reactions to these techniques!

Creating the corpora

  1. Books in different languages, and
  2. Wikipedia in different languages.
Wikipedia content extraction made simple with this python package.

Acquiring the text of books in other languages than English turned out to be a more complicated task, and above all, turned out harder to automate. However, with the Wikipedia python package automated access to content of wikipedia pages in different languages is as easy as you can see in this gist to the left.

Finally, we need Wikipedia pages with a lot of text. For this purpose, I chose to scrape some lists from the internet, as I assumed I can find well-written pages of well know entities:

  1. TOP100 most influential people,
  2. TOP100 best cities to live in,
  3. TOP100 best performing companies in 2019,
  4. TOP100 pop/rock bands,
  5. TOP Sport Franchises by value and
  6. TOP100 Books.

With this approach, I ended up with 452 entities, because I only kept those where the page could be located unambiguously in all 4 languages. Nonetheless, the addition of the entities from the last list didn’t change much about the nature of the results therefore I stopped adding text to the corpora.

Pre-processing 101

Cleaning unnecessary characters and splitting text is this easy with NLTK’s RegexpTokenizer!

In Part 1, I elaborated on the first 3 steps to consider in text pre-processing. In this case study, the text are lowercased immediately after reading them in memory. Moreover, numbers and special characters are removed without further ado using the RegexpTokenizer.

The corpus

Image for post
Image for post
Raw word counts and unique word counts for raw text.

Looking at raw word counts, it does not come as a surprise that the English Wikipedia has much more text than in any other languages, but the Hungarian Wikipedia has more text than its Romanian counterpart, while Romania has a population double that of Hungary.

From unique counts, it seems that German and Hungarian is lexically more diverse compared to English or Romanian. However, that is maybe caused by underlying structure of the language: Hungarian tends to use suffixes at the end of words much more than English, resulting in more unique words. Assessing lexical diversity is therefore better done after pre-processing steps!

Stopword removal

Stopword removal using NLTK. NOTE: you have to download stopword resources using nltk.download!

Stopword removal mainly affects the raw wordcount of the corpus, as it only removes words that are included in the stopword list — but these words tend to have high frequency as they support a grammatical role.

Image for post
Image for post

In the figure on the left, one can assess what portion of words remain after stopword removal. It is line with our previous explanations, that English has a relatively low value: instead of suffixes, many stopwords are used to create context around words. On the other hand, suffix-heavy Hungarian language lost only around 25% of words compared to the almost 40 in the case of English.


Stemming and stopword removal using NLTK. NOTE: you have to download resources using nltk.download!

Analyzing the effect of stemming can be done through unique word counts, as stemming does not remove any words, but makes one unique word from many, thereby reducing text dimensionality. This can be seen in the figure below:

Image for post
Image for post

Stopword removal barely changes unique word counts, while stemming does substantially. In accordance with previous statements, stemming has the most effect on suffix-heavy Hungarian, and the least effect on English language. In general, stemming can reduce the dimensionality of text data to an extent of 20 to 40 percentages, depending on the language (and of the course the nature of the text, if it’s less general than the corpus used here).

Removal of extremely infrequent words

Image for post
Image for post
Red numbers show count of words in that bin of document frequency, while x tick labels are bin boundaries. For example: in the english figure, the first bar means 67534 words appear in 0–45 texts in the corpus. The next bar means 1384 word appears in 45–90 texts, etc.
Read filter_extremes() method documentation well, as both parameters are important and have a default value!

To achieve removal of infrequent (or on the contrary, too frequent) terms from the corpus, I advise using the gensim package. You should create a gensim.corpora.dictionary object, supplying the initializer with a nested list, where each list element is a document, which is a list of tokens. For that dictionary, calling the filter_extremes method with the right parameters will do the trick for you.

Removal of these can do a great deal of dimensionality reduction, but you maybe removing the most important words from your text. If you have a balanced, binary classification problem removing words under 0.5% document frequency probably will not matter, as the word appear in too few documents to be a decisive factor. However, in a multi-class problem maybe these sparse words contain the most information!

Image for post
Image for post

As we can see from the document frequency histograms, the removal of words with low document frequency drastically decreases the number of unique words in the corpus which is true regardless of the language; the four corpora have relatively similar reactions to this procedure. Nonetheless, the decrease in unique words is even more pronounced for lexically more diverse languages, namely German (and Hungarian).

Removal of extremely frequent words

Image for post
Image for post

In our case, there are a tiny amount of words that appear in 50% of the documents in either language. This is because we are analyzing a corpus of many different domains, but for a corpus about a specific topic, domain-specific stopwords can be removed using this approach.



Starschema Blog

Data contains intelligence that can change the world — we…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store