Text preprocessing in different languages for Natural Language Processing in Python

Part II — Case Study

Mor Kapronczay
Oct 15 · 8 min read
Natural Language Processing is a catchy phrase these days

This is Part 2 of a pair of tutorials on text pre-processing in python. In the first part, I laid out the theoretical foundations. In this second part, I’ll demonstrate the steps described in Part 1 in python on texts in different languages while discussing their differing effect arising from different structures of languages.

If you haven’t, you should first read Part 1!

You can check out the code on GitHub!

Relevance

In the first part, I outlined text pre-processing principles based on a framework from an academic article. The underlying goal of all these techniques was to reduce text data dimensionality but keep the relevant information incorporated in the text. In this second part, I will present the effect of the following techniques on two central properties of text, word count and unique word count — the latter representing the dimensionality of text data:

  1. Removing stopwords
  2. Removing both extremely frequent and infrequent words
  3. Stemming, an automated technique to reduce words to their base form

The idea came from another academic article where the authors examined the effect of text pre-processing techniques in different languages on the results of using the Wordfish algorithm. Being a method for ideological scaling, Wordfish can estimate which speakers are in the centrum based on their word use, and which speakers can be considered extremists on either side of the spectrum. You can see a result about German political parties below:

source: https://www.researchgate.net/figure/Results-of-the-Wordfish-Analysis_fig3_317616075

In addition, they also report the effect of text pre-processing techniques on unique word counts of text, i.e. how stemming words lowers the number of unique words in the corpus for each language. I advise you to check out their results in this paper, if you are interested!

Since the researchers used substantially different kinds of political text the results for different languages are therefore not perfectly comparable. For example, they analyzed 61 PM speeches in parliament from Denmark from 4 different parties and 104 written motions from party conferences of Italy from 15 different parties. That’s why I decided to create a comparable corpus in 4 languages to carry out the analysis. The corpora not only needs to be comparable, but it also has to include text of many domains in order to produce results that generalize well for these languages. Nevertheless, it is important to note that specific corpora can have totally different reactions to these techniques!

Creating the corpora

In order to conduct the analysis, we need large amount of text in different languages. As I am Hungarian, I chose to compare English, German, Hungarian and Romanian languages. The analysis can be easily done for other languages as well. In addition, the text has to cover a broad range of topics and has to be about at least roughly the same things. There are 2 approaches I considered:

  1. Books in different languages, and
  2. Wikipedia in different languages.
Wikipedia content extraction made simple with this python package.

Acquiring the text of books in other languages than English turned out to be a more complicated task, and above all, turned out harder to automate. However, with the Wikipedia python package automated access to content of wikipedia pages in different languages is as easy as you can see in this gist to the left.

Finally, we need Wikipedia pages with a lot of text. For this purpose, I chose to scrape some lists from the internet, as I assumed I can find well-written pages of well know entities:

  1. TOP100 most influential people,
  2. TOP100 best cities to live in,
  3. TOP100 best performing companies in 2019,
  4. TOP100 pop/rock bands,
  5. TOP Sport Franchises by value and
  6. TOP100 Books.

With this approach, I ended up with 452 entities, because I only kept those where the page could be located unambiguously in all 4 languages. Nonetheless, the addition of the entities from the last list didn’t change much about the nature of the results therefore I stopped adding text to the corpora.

Pre-processing 101

Cleaning unnecessary characters and splitting text is this easy with NLTK’s RegexpTokenizer!

In Part 1, I elaborated on the first 3 steps to consider in text pre-processing. In this case study, the text are lowercased immediately after reading them in memory. Moreover, numbers and special characters are removed without further ado using the RegexpTokenizer.

The corpus

Raw word counts and unique word counts for raw text.

Looking at raw word counts, it does not come as a surprise that the English Wikipedia has much more text than in any other languages, but the Hungarian Wikipedia has more text than its Romanian counterpart, while Romania has a population double that of Hungary.

From unique counts, it seems that German and Hungarian is lexically more diverse compared to English or Romanian. However, that is maybe caused by underlying structure of the language: Hungarian tends to use suffixes at the end of words much more than English, resulting in more unique words. Assessing lexical diversity is therefore better done after pre-processing steps!

Stopword removal

The first text pre-processing technique to demonstrate is stopword removal. It is a basic methodology: most NLP packages like NLTK come with built-in stopword lists for the supported languages. Therefore, one just has to scan over the document and remove any word that is present in the stopword list:

Stopword removal using NLTK. NOTE: you have to download stopword resources using nltk.download!

Stopword removal mainly affects the raw wordcount of the corpus, as it only removes words that are included in the stopword list — but these words tend to have high frequency as they support a grammatical role.

In the figure on the left, one can assess what portion of words remain after stopword removal. It is line with our previous explanations, that English has a relatively low value: instead of suffixes, many stopwords are used to create context around words. On the other hand, suffix-heavy Hungarian language lost only around 25% of words compared to the almost 40 in the case of English.

Stemming

Stemming is an automated technique to reduce words to their base form. It is based on language specific rules. In this article, the Porter stemming algorithm is used in NLTK, which has publicly available rules for stemming.

Stemming and stopword removal using NLTK. NOTE: you have to download resources using nltk.download!

Analyzing the effect of stemming can be done through unique word counts, as stemming does not remove any words, but makes one unique word from many, thereby reducing text dimensionality. This can be seen in the figure below:

Stopword removal barely changes unique word counts, while stemming does substantially. In accordance with previous statements, stemming has the most effect on suffix-heavy Hungarian, and the least effect on English language. In general, stemming can reduce the dimensionality of text data to an extent of 20 to 40 percentages, depending on the language (and of the course the nature of the text, if it’s less general than the corpus used here).

Removal of extremely infrequent words

As it was mentioned before in these articles, word frequency tends to have a long tailed distribution: many words appear quite infrequently in text. The same is true for document frequency; many words appear only in a small amount of documents. This can be seen on document frequency histograms for each language.

Red numbers show count of words in that bin of document frequency, while x tick labels are bin boundaries. For example: in the english figure, the first bar means 67534 words appear in 0–45 texts in the corpus. The next bar means 1384 word appears in 45–90 texts, etc.
Read filter_extremes() method documentation well, as both parameters are important and have a default value!

To achieve removal of infrequent (or on the contrary, too frequent) terms from the corpus, I advise using the gensim package. You should create a gensim.corpora.dictionary object, supplying the initializer with a nested list, where each list element is a document, which is a list of tokens. For that dictionary, calling the filter_extremes method with the right parameters will do the trick for you.

Removal of these can do a great deal of dimensionality reduction, but you maybe removing the most important words from your text. If you have a balanced, binary classification problem removing words under 0.5% document frequency probably will not matter, as the word appear in too few documents to be a decisive factor. However, in a multi-class problem maybe these sparse words contain the most information!

As we can see from the document frequency histograms, the removal of words with low document frequency drastically decreases the number of unique words in the corpus which is true regardless of the language; the four corpora have relatively similar reactions to this procedure. Nonetheless, the decrease in unique words is even more pronounced for lexically more diverse languages, namely German (and Hungarian).

Removal of extremely frequent words

The last procedure I aim to cover is a methodology to find domain-, or corpus-specific keywords. The idea is if a word is present in most of the documents in a corpus it might not convey any information about a document in that specific corpus. Nevertheless, in-document frequency can differ substantially for a word present in all the documents, and that can contain information as well!

In our case, there are a tiny amount of words that appear in 50% of the documents in either language. This is because we are analyzing a corpus of many different domains, but for a corpus about a specific topic, domain-specific stopwords can be removed using this approach.

Takeaways

A general takeaway is that languages can differ substantially in terms of reaction to text pre-processing techniques. Stopword removal removes more words from languages where suffixes are not used extensively, while stemming affects suffix-heavy languages more. Be careful removing less frequent words. You may be removing too many, and they may be very important! While domain-, or corpus-specific stopwords can be found searching for words that appear in all texts, it is important to note however, that in-document frequency for words present in all texts can be a decisive factor for a classification problem!

References:

Greene, Z., Ceron, A., Schumacher, G., & Fazekas, Z. (2016, November 1). The Nuts and Bolts of Automated Text Analysis. Comparing Different Document Pre-Processing Techniques in Four Countries. https://doi.org/10.31219/osf.io/ghxj8

Starschema Blog

Data contains intelligence that can change the world — we help people discover, manage and use this intelligence.

Mor Kapronczay

Written by

Starschema Blog

Data contains intelligence that can change the world — we help people discover, manage and use this intelligence.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade