A comprehensive guide to text pre-processing with python

Part I — Theoretical Background

Mor Kapronczay
Aug 1, 2019 · 5 min read

This is Part 1 of a pair of tutorials on text pre-processing in python. In this first part, I’ll lay out the theoretical foundations. In the second part, I’ll demonstrate the steps described below in python on texts in different languages while discussing their differing effect arising from different structures of languages.

Introduction

Text mining is an important topic for business and academia as well. Business applications include chatbots, automated news tagging and many others. In terms of academia, political texts are often the subjects of rigorous analysis. I believe that these fields have a lot to learn from each other so I decided to create this guide, based on an academic paper, about text pre-processing for business oriented readers.

If you have ever done a machine learning project, you are probably familiar with the concepts of data cleaning or feature engineering. If you have ever done a text mining project — not necessarily involving machine learning — you surely know that there is additional relevance of these concepts to text mining. Text pre-processing, in my view, is one of the most interesting fields where these concepts are utilized.

From text to data

“All models are wrong, but some are useful.”

This saying is attributed to George E. P. Box, but it is rather a cliché in statistics. During text mining, when creating data out of raw text, the practitioner gains a deep understanding of what this concise sentence refers to. One needs to create a representation of text which is a useful simplification.

In a sense, text mining is meaning mining. Finding the meaning in text can be extremely hard and the methodology needed is deeply context-dependent. A text mining practitioner has to remove unnecessary information from the text (for example words that does not contain relevant information) and also redundancy (where two word forms refer to the same meaning). To achieve both, text pre-processing steps needs to be done first.

Consequently, text pre-processing can be thought of as a dimensionality reduction. In a text mining problem, dimensionality refers to the number of unique tokens in the pre-processed text. One would aim to minimize this number while bearing in mind the tradeoff of information lost.

In most cases, a corpus, a particular set of texts that are somehow connected to each other, is represented using a term-frequency matrix. In this matrix, every row corresponds to a document in the corpus, and every column is a unique token (word) of text. A value, X(i,j) in the matrix will mean in document i, the word j is present X(i,j) times. This is called the bag-of-words approach, as the text is represented as word counts, regardless of word position inside the document.

Image for post
Image for post
Word clouds are just a fancy way to show word counts in a corpus. More frequent terms are represented with bigger font size.

For text mining use cases, dimensionality reduction needs to be applied differently than in other use cases. The next section will show that during text pre-processing, one has at least 128 ways to proceed — if each pre-processing step is considered a binary choice. However, this is almost never the case.

Text pre-processing steps

The following steps are discussed from the perspective of a text miner who uses a bag-of-words representation of text. Please note this process only refers to bag-of-words representations; other types of representations require different processes!

Takeaways

As you can see, deciding what information to keep and what to drop is the name of the game in text pre-processing. This is because there is a desperate need for dimensionality reduction due to the large number of unique words in natural language. Never take a rule-of-thumb for granted: your analysis may require substantially different considerations as every text mining problem is unique in some way.

In Part 2, these steps are performed using python on a comparable corpus of 4 different languages, showing the differing effects of these text preprocessing steps in different languages.

References:

Denny, M., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168–189. doi:10.1017/pan.2017.44

Starschema Blog

Data contains intelligence that can change the world — we…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store