Text Preprocessing in Python: Steps, Tools, and Examples

by Olga Davydova, Data Monsters

In this paper, we will talk about the basic steps of text preprocessing. These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

After a text is obtained, we start with text normalization. Text normalization includes:

  • converting all letters to lower or upper case
  • converting numbers into words or removing numbers
  • removing punctuations, accent marks and other diacritics
  • removing white spaces
  • expanding abbreviations
  • removing stop words, sparse terms, and particular words
  • text canonicalization

We will describe text normalization steps in detail below.

Convert text to lowercase

Example 1. Convert text to lowercase

Python code:

input_str = ”The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.”
input_str = input_str.lower()
print(input_str)

Output:

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.

Remove numbers

Remove numbers if they are not relevant to your analyses. Usually, regular expressions are used to remove numbers.

Example 2. Numbers removing

Python code:

import re
input_str = ’Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.’
result = re.sub(r’\d+’, ‘’, input_str)
print(result)

Output:

Box A contains red and white balls, while Box B contains red and blue balls.

Remove punctuation

The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:

Example 3. Punctuation removal

Python code:

import string
input_str = “This &is [an] example? {of} string. with.? punctuation!!!!” # Sample string
result = input_str.translate(string.maketrans(“”,””), string.punctuation)
print(result)

Output:

This is an example of string with punctuation

Remove whitespaces

To remove leading and ending spaces, you can use the strip() function:

Example 4. White spaces removal

Python code:

input_str = “ \t a string example\t “
input_str = input_str.strip()
input_str

Output:

‘a string example’

Tokenization

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. In this table (“Tokenization” sheet) several tools for implementing tokenization are described.

Tokenization tools

Remove stop words

“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.

Example 7. Stop words removal

Code:

input_str = “NLTK is a leading platform for building Python programs to work with human language data.”
stop_words = set(stopwords.words(‘english’))
from nltk.tokenize import word_tokenize
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

Output:

[‘NLTK’, ‘leading’, ‘platform’, ‘building’, ‘Python’, ‘programs’, ‘work’, ‘human’, ‘language’, ‘data’, ‘.’]

A scikit-learn tool also provides a stop words list:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

It’s also possible to use spaCy, a free open-source library:

from spacy.lang.en.stop_words import STOP_WORDS

Remove sparse terms and particular words

In some cases, it’s necessary to remove sparse terms or particular words from texts. This task can be done using stop words removal techniques considering that any group of words can be chosen as the stop words.

Stemming

Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are Porter stemming algorithm (removes common morphological and inflexional endings from words [14]) and Lancaster stemming algorithm (a more aggressive stemming algorithm). In the “Stemming” sheet of the table some stemmers are described.

Stemming tools

Example 8. Stemming using NLTK:

Code:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str=”There are several types of stemming algorithms.”
input_str=word_tokenize(input_str)
for word in input_str:
print(stemmer.stem(word))

Output:

There are sever type of stem algorithm.

Lemmatization

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

Lemmatization tools are presented libraries described above: NLTK (WordNet Lemmatizer), spaCy, TextBlob, Pattern, gensim, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), Illinois Lemmatizer, and DKPro Core.

Example 9. Lemmatization using NLTK:

Code:

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()
input_str=”been had done languages cities mice”
input_str=word_tokenize(input_str)
for word in input_str:
print(lemmatizer.lemmatize(word))

Output:

be have do language city mouse

Part of speech tagging (POS)

Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context. There are many tools containing POS taggers including NLTK, spaCy, TextBlob, Pattern, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), FreeLing, Illinois Part of Speech Tagger, and DKPro Core.

Example 10. Part-of-speech tagging using TextBlob:

Code:

input_str=”Parts of speech examples: an article, to write, interesting, easily, and, of”
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)

Output:

[(‘Parts’, u’NNS’), (‘of’, u’IN’), (‘speech’, u’NN’), (‘examples’, u’NNS’), (‘an’, u’DT’), (‘article’, u’NN’), (‘to’, u’TO’), (‘write’, u’VB’), (‘interesting’, u’VBG’), (‘easily’, u’RB’), (‘and’, u’CC’), (‘of’, u’IN’)]

Chunking (shallow parsing)

Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.) [23]. Chunking tools: NLTK, TreeTagger chunker, Apache OpenNLP, General Architecture for Text Engineering (GATE), FreeLing.

Example 11. Chunking using NLTK:

The first step is to determine the part of speech for each word:

Code:

input_str=”A black television and a white stove were bought for the new apartment of John.”
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)

Output:

[(‘A’, u’DT’), (‘black’, u’JJ’), (‘television’, u’NN’), (‘and’, u’CC’), (‘a’, u’DT’), (‘white’, u’JJ’), (‘stove’, u’NN’), (‘were’, u’VBD’), (‘bought’, u’VBN’), (‘for’, u’IN’), (‘the’, u’DT’), (‘new’, u’JJ’), (‘apartment’, u’NN’), (‘of’, u’IN’), (‘John’, u’NNP’)]

The second step is chunking:

Code:

reg_exp = “NP: {<DT>?<JJ>*<NN>}”
rp = nltk.RegexpParser(reg_exp)
result = rp.parse(result.tags)
print(result)

Output:

(S (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/IN (NP the/DT new/JJ apartment/NN)
of/IN John/NNP)

It’s also possible to draw the sentence tree structure using code result.draw()

Named entity recognition

Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).

Named-entity recognition tools: NLTK, spaCy, General Architecture for Text Engineering (GATE) — ANNIE, Apache OpenNLP, Stanford CoreNLP, DKPro Core, MITIE, Watson Natural Language Understanding, TextRazor, FreeLing are described in the “NER” sheet of the table.

NER Tools

Example 12. Named-entity recognition using NLTK:

Code:

from nltk import word_tokenize, pos_tag, ne_chunk
input_str = “Bill works for Apple so he went to Boston for a conference.”
print ne_chunk(pos_tag(word_tokenize(input_str)))

Output:

(S (PERSON Bill/NNP) works/VBZ for/IN Apple/NNP so/IN he/PRP went/VBD to/TO (GPE Boston/NNP) for/IN a/DT conference/NN ./.)

Coreference resolution (anaphora resolution)

Pronouns and other referring expressions should be connected to the right individuals. Coreference resolution finds the mentions in a text that refer to the same real-world entity. For example, in the sentence, “Andrew said he would buy a car” the pronoun “he” refers to the same person, namely to “Andrew”. Coreference resolution tools: Stanford CoreNLP, spaCy, Open Calais, Apache OpenNLP are described in the “Coreference resolution” sheet of the table.

Coreference resolution tools

An example of coreference resolution using xrenner can be found here.

Collocation extraction

Collocations are word combinations occurring together more often than would be expected by chance. Collocation examples are “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” and so on.

Collocation extraction tools

Example 13. Collocation extraction using ICE [51]

Code:

input=[“he and Chazz duel with all keys on the line.”]
from ICE import CollocationExtractor
extractor = CollocationExtractor.with_collocation_pipeline(“T1” , bing_key = “Temp”,pos_check = False)
print(extractor.get_collocations_of_length(input, length = 3))

Output:

[“on the line”]

Relationship extraction

Relationship extraction allows obtaining structured information from unstructured sources such as raw text. Strictly stated, it is identifying relations (e.g., acquisition, spouse, employment) among named entities (e.g., people, organizations, locations). For example, from the sentence “Mark and Emily married yesterday,” we can extract the information that Mark is Emily’s husband.

An example of relationship extraction using NLTK can be found here.

Summary

In this post, we talked about text preprocessing and described its main steps including normalization, tokenization, stemming, lemmatization, chunking, part of speech tagging, named-entity recognition, coreference resolution, collocation extraction, and relationship extraction. We also discussed text preprocessing tools and examples. A comparative table was created.

After the text preprocessing is done, the result may be used for more complicated NLP tasks, for example, machine translation or natural language generation.

Resources:

  1. http://www.nltk.org/index.html
  2. http://textblob.readthedocs.io/en/dev/
  3. https://spacy.io/usage/facts-figures
  4. https://radimrehurek.com/gensim/index.html
  5. https://opennlp.apache.org/
  6. http://opennmt.net/
  7. https://gate.ac.uk/
  8. https://uima.apache.org/
  9. https://www.clips.uantwerpen.be/pages/MBSP#tokenizer
  10. https://rapidminer.com/
  11. http://mallet.cs.umass.edu/
  12. https://www.clips.uantwerpen.be/pages/pattern
  13. https://nlp.stanford.edu/software/tokenizer.html#About
  14. https://tartarus.org/martin/PorterStemmer/
  15. http://www.nltk.org/api/nltk.stem.html
  16. https://snowballstem.org/
  17. https://pypi.python.org/pypi/PyStemmer/1.0.1
  18. https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html
  19. https://lucene.apache.org/core/
  20. https://dkpro.github.io/dkpro-core/
  21. http://ucrel.lancs.ac.uk/claws/
  22. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
  23. https://en.wikipedia.org/wiki/Shallow_parsing
  24. https://cogcomp.org/page/software_view/Chunker
  25. https://github.com/dstl/baleen
  26. https://github.com/CogComp/cogcomp-nlp/tree/master/ner
  27. https://github.com/lasigeBioTM/MER
  28. https://blog.paralleldots.com/product/dig-relevant-text-elements-entity-extraction-api/
  29. http://www.opencalais.com/about-open-calais/
  30. http://alias-i.com/lingpipe/index.html
  31. https://github.com/glample/tagger
  32. http://minorthird.sourceforge.net/old/doc/
  33. https://www.ibm.com/support/knowledgecenter/en/SS8NLW_10.0.0/com.ibm.watson.wex.aac.doc/aac-tasystemt.html
  34. https://www.poolparty.biz/
  35. https://www.basistech.com/text-analytics/rosette/entity-extractor/
  36. http://www.bart-coref.org/index.html
  37. https://wing.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.html
  38. http://cswww.essex.ac.uk/Research/nle/GuiTAR/
  39. https://www.cs.utah.edu/nlp/reconcile/
  40. https://github.com/brendano/arkref
  41. https://cogcomp.org/page/software_view/Coref
  42. https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30
  43. https://github.com/smartschat/cort
  44. http://www.hlt.utdallas.edu/~altaf/cherrypicker/
  45. http://nlp.lsi.upc.edu/freeling/
  46. https://corpling.uis.georgetown.edu/xrenner/#
  47. http://takelab.fer.hr/termex_s/
  48. https://www.athel.com/colloc.html
  49. http://linghub.lider-project.eu/metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75
  50. http://ws.racai.ro:9191/narratives/batch2/Colloc.pdf
  51. http://www.aclweb.org/anthology/E17-3027
  52. https://metacpan.org/pod/Text::NSP
  53. https://github.com/knowitall/reverb
  54. https://github.com/U-Alberta/exemplar
  55. https://github.com/aoldoni/tetre
  56. https://www.textrazor.com/technology
  57. https://github.com/machinalis/iepy
  58. https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#relations
  59. https://github.com/mit-nlp/MITIE