Text Preprocessing in Python: Steps, Tools, and Examples

by Olga Davydova, Data Monsters

In this paper, we will talk about the basic steps of text preprocessing. These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

After a text is obtained, we start with text normalization. Text normalization includes:

  • converting all letters to lower or upper case
  • converting numbers into words or removing numbers
  • removing punctuations, accent marks and other diacritics
  • removing white spaces
  • expanding abbreviations
  • removing stop words, sparse terms, and particular words
  • text canonicalization

We will describe text normalization steps in detail below.

Convert text to lowercase

Example 1. Convert text to lowercase

Python code:

Output:

Remove numbers

Remove numbers if they are not relevant to your analyses. Usually, regular expressions are used to remove numbers.

Example 2. Numbers removing

Python code:

Output:

Remove punctuation

The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:

Example 3. Punctuation removal

Python code:

Output:

Remove whitespaces

To remove leading and ending spaces, you can use the strip() function:

Example 4. White spaces removal

Python code:

Output:

Tokenization

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. In this table (“Tokenization” sheet) several tools for implementing tokenization are described.

Tokenization tools

Remove stop words

“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.

Example 7. Stop words removal

Code:

Output:

A scikit-learn tool also provides a stop words list:

It’s also possible to use spaCy, a free open-source library:

Remove sparse terms and particular words

In some cases, it’s necessary to remove sparse terms or particular words from texts. This task can be done using stop words removal techniques considering that any group of words can be chosen as the stop words.

Stemming

Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are Porter stemming algorithm (removes common morphological and inflexional endings from words [14]) and Lancaster stemming algorithm (a more aggressive stemming algorithm). In the “Stemming” sheet of the table some stemmers are described.

Stemming tools

Example 8. Stemming using NLTK:

Code:

Output:

Lemmatization

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

Lemmatization tools are presented libraries described above: NLTK (WordNet Lemmatizer), spaCy, TextBlob, Pattern, gensim, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), Illinois Lemmatizer, and DKPro Core.

Example 9. Lemmatization using NLTK:

Code:

Output:

be have do language city mouse

Part of speech tagging (POS)

Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context. There are many tools containing POS taggers including NLTK, spaCy, TextBlob, Pattern, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), FreeLing, Illinois Part of Speech Tagger, and DKPro Core.

Example 10. Part-of-speech tagging using TextBlob:

Code:

Output:

Chunking (shallow parsing)

Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.) [23]. Chunking tools: NLTK, TreeTagger chunker, Apache OpenNLP, General Architecture for Text Engineering (GATE), FreeLing.

Example 11. Chunking using NLTK:

The first step is to determine the part of speech for each word:

Code:

Output:

The second step is chunking:

Code:

Output:

It’s also possible to draw the sentence tree structure using code result.draw()

Named entity recognition

Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).

Named-entity recognition tools: NLTK, spaCy, General Architecture for Text Engineering (GATE) — ANNIE, Apache OpenNLP, Stanford CoreNLP, DKPro Core, MITIE, Watson Natural Language Understanding, TextRazor, FreeLing are described in the “NER” sheet of the table.

NER Tools

Example 12. Named-entity recognition using NLTK:

Code:

Output:

Coreference resolution (anaphora resolution)

Pronouns and other referring expressions should be connected to the right individuals. Coreference resolution finds the mentions in a text that refer to the same real-world entity. For example, in the sentence, “Andrew said he would buy a car” the pronoun “he” refers to the same person, namely to “Andrew”. Coreference resolution tools: Stanford CoreNLP, spaCy, Open Calais, Apache OpenNLP are described in the “Coreference resolution” sheet of the table.

Coreference resolution tools

An example of coreference resolution using xrenner can be found here.

Collocation extraction

Collocations are word combinations occurring together more often than would be expected by chance. Collocation examples are “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” and so on.

Collocation extraction tools

Example 13. Collocation extraction using ICE [51]

Code:

Output:

Relationship extraction

Relationship extraction allows obtaining structured information from unstructured sources such as raw text. Strictly stated, it is identifying relations (e.g., acquisition, spouse, employment) among named entities (e.g., people, organizations, locations). For example, from the sentence “Mark and Emily married yesterday,” we can extract the information that Mark is Emily’s husband.

An example of relationship extraction using NLTK can be found here.

Summary

In this post, we talked about text preprocessing and described its main steps including normalization, tokenization, stemming, lemmatization, chunking, part of speech tagging, named-entity recognition, coreference resolution, collocation extraction, and relationship extraction. We also discussed text preprocessing tools and examples. A comparative table was created.

After the text preprocessing is done, the result may be used for more complicated NLP tasks, for example, machine translation or natural language generation.

Resources:

  1. http://www.nltk.org/index.html
  2. http://textblob.readthedocs.io/en/dev/
  3. https://spacy.io/usage/facts-figures
  4. https://radimrehurek.com/gensim/index.html
  5. https://opennlp.apache.org/
  6. http://opennmt.net/
  7. https://gate.ac.uk/
  8. https://uima.apache.org/
  9. https://www.clips.uantwerpen.be/pages/MBSP#tokenizer
  10. https://rapidminer.com/
  11. http://mallet.cs.umass.edu/
  12. https://www.clips.uantwerpen.be/pages/pattern
  13. https://nlp.stanford.edu/software/tokenizer.html#About
  14. https://tartarus.org/martin/PorterStemmer/
  15. http://www.nltk.org/api/nltk.stem.html
  16. https://snowballstem.org/
  17. https://pypi.python.org/pypi/PyStemmer/1.0.1
  18. https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html
  19. https://lucene.apache.org/core/
  20. https://dkpro.github.io/dkpro-core/
  21. http://ucrel.lancs.ac.uk/claws/
  22. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
  23. https://en.wikipedia.org/wiki/Shallow_parsing
  24. https://cogcomp.org/page/software_view/Chunker
  25. https://github.com/dstl/baleen
  26. https://github.com/CogComp/cogcomp-nlp/tree/master/ner
  27. https://github.com/lasigeBioTM/MER
  28. https://blog.paralleldots.com/product/dig-relevant-text-elements-entity-extraction-api/
  29. http://www.opencalais.com/about-open-calais/
  30. http://alias-i.com/lingpipe/index.html
  31. https://github.com/glample/tagger
  32. http://minorthird.sourceforge.net/old/doc/
  33. https://www.ibm.com/support/knowledgecenter/en/SS8NLW_10.0.0/com.ibm.watson.wex.aac.doc/aac-tasystemt.html
  34. https://www.poolparty.biz/
  35. https://www.basistech.com/text-analytics/rosette/entity-extractor/
  36. http://www.bart-coref.org/index.html
  37. https://wing.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.html
  38. http://cswww.essex.ac.uk/Research/nle/GuiTAR/
  39. https://www.cs.utah.edu/nlp/reconcile/
  40. https://github.com/brendano/arkref
  41. https://cogcomp.org/page/software_view/Coref
  42. https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30
  43. https://github.com/smartschat/cort
  44. http://www.hlt.utdallas.edu/~altaf/cherrypicker/
  45. http://nlp.lsi.upc.edu/freeling/
  46. https://corpling.uis.georgetown.edu/xrenner/#
  47. http://takelab.fer.hr/termex_s/
  48. https://www.athel.com/colloc.html
  49. http://linghub.lider-project.eu/metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75
  50. http://ws.racai.ro:9191/narratives/batch2/Colloc.pdf
  51. http://www.aclweb.org/anthology/E17-3027
  52. https://metacpan.org/pod/Text::NSP
  53. https://github.com/knowitall/reverb
  54. https://github.com/U-Alberta/exemplar
  55. https://github.com/aoldoni/tetre
  56. https://www.textrazor.com/technology
  57. https://github.com/machinalis/iepy
  58. https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#relations
  59. https://github.com/mit-nlp/MITIE