Text Preprocessing in Python: Steps, Tools, and Examples

Data Monsters
Oct 15, 2018 · 7 min read

by Olga Davydova, Data Monsters

In this paper, we will talk about the basic steps of text preprocessing. These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

After a text is obtained, we start with text normalization. Text normalization includes:

  • converting all letters to lower or upper case
  • converting numbers into words or removing numbers
  • removing punctuations, accent marks and other diacritics
  • removing white spaces
  • expanding abbreviations
  • removing stop words, sparse terms, and particular words
  • text canonicalization

We will describe text normalization steps in detail below.

Convert text to lowercase

Python code:

input_str = ”The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.”
input_str = input_str.lower()
print(input_str)

Output:

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.

Remove numbers

Example 2. Numbers removing

Python code:

import re
input_str = ’Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.’
result = re.sub(r’\d+’, ‘’, input_str)
print(result)

Output:

Box A contains red and white balls, while Box B contains red and blue balls.

Remove punctuation

Example 3. Punctuation removal

Python code:

import string
input_str = “This &is [an] example? {of} string. with.? punctuation!!!!” # Sample string
result = input_str.translate(string.maketrans(“”,””), string.punctuation)
print(result)

Output:

This is an example of string with punctuation

Remove whitespaces

Example 4. White spaces removal

Python code:

input_str = “ \t a string example\t “
input_str = input_str.strip()
input_str

Output:

‘a string example’

Tokenization

Tokenization tools

Remove stop words

Example 7. Stop words removal

Code:

input_str = “NLTK is a leading platform for building Python programs to work with human language data.”
stop_words = set(stopwords.words(‘english’))
from nltk.tokenize import word_tokenize
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

Output:

[‘NLTK’, ‘leading’, ‘platform’, ‘building’, ‘Python’, ‘programs’, ‘work’, ‘human’, ‘language’, ‘data’, ‘.’]

A scikit-learn tool also provides a stop words list:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

It’s also possible to use spaCy, a free open-source library:

from spacy.lang.en.stop_words import STOP_WORDS

Remove sparse terms and particular words

Stemming

Stemming tools

Example 8. Stemming using NLTK:

Code:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str=”There are several types of stemming algorithms.”
input_str=word_tokenize(input_str)
for word in input_str:
print(stemmer.stem(word))

Output:

There are sever type of stem algorithm.

Lemmatization

Lemmatization tools are presented libraries described above: NLTK (WordNet Lemmatizer), spaCy, TextBlob, Pattern, gensim, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), Illinois Lemmatizer, and DKPro Core.

Example 9. Lemmatization using NLTK:

Code:

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()
input_str=”been had done languages cities mice”
input_str=word_tokenize(input_str)
for word in input_str:
print(lemmatizer.lemmatize(word))

Output:

be have do language city mouse

Part of speech tagging (POS)

Example 10. Part-of-speech tagging using TextBlob:

Code:

input_str=”Parts of speech examples: an article, to write, interesting, easily, and, of”
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)

Output:

[(‘Parts’, u’NNS’), (‘of’, u’IN’), (‘speech’, u’NN’), (‘examples’, u’NNS’), (‘an’, u’DT’), (‘article’, u’NN’), (‘to’, u’TO’), (‘write’, u’VB’), (‘interesting’, u’VBG’), (‘easily’, u’RB’), (‘and’, u’CC’), (‘of’, u’IN’)]

Chunking (shallow parsing)

Example 11. Chunking using NLTK:

The first step is to determine the part of speech for each word:

Code:

input_str=”A black television and a white stove were bought for the new apartment of John.”
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)

Output:

[(‘A’, u’DT’), (‘black’, u’JJ’), (‘television’, u’NN’), (‘and’, u’CC’), (‘a’, u’DT’), (‘white’, u’JJ’), (‘stove’, u’NN’), (‘were’, u’VBD’), (‘bought’, u’VBN’), (‘for’, u’IN’), (‘the’, u’DT’), (‘new’, u’JJ’), (‘apartment’, u’NN’), (‘of’, u’IN’), (‘John’, u’NNP’)]

The second step is chunking:

Code:

reg_exp = “NP: {<DT>?<JJ>*<NN>}”
rp = nltk.RegexpParser(reg_exp)
result = rp.parse(result.tags)
print(result)

Output:

(S (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/IN (NP the/DT new/JJ apartment/NN)
of/IN John/NNP)

It’s also possible to draw the sentence tree structure using code result.draw()

Named entity recognition

Named-entity recognition tools: NLTK, spaCy, General Architecture for Text Engineering (GATE) — ANNIE, Apache OpenNLP, Stanford CoreNLP, DKPro Core, MITIE, Watson Natural Language Understanding, TextRazor, FreeLing are described in the “NER” sheet of the table.

NER Tools

Example 12. Named-entity recognition using NLTK:

Code:

from nltk import word_tokenize, pos_tag, ne_chunk
input_str = “Bill works for Apple so he went to Boston for a conference.”
print ne_chunk(pos_tag(word_tokenize(input_str)))

Output:

(S (PERSON Bill/NNP) works/VBZ for/IN Apple/NNP so/IN he/PRP went/VBD to/TO (GPE Boston/NNP) for/IN a/DT conference/NN ./.)

Coreference resolution (anaphora resolution)

Coreference resolution tools

An example of coreference resolution using xrenner can be found here.

Collocation extraction

Collocation extraction tools

Example 13. Collocation extraction using ICE [51]

Code:

input=[“he and Chazz duel with all keys on the line.”]
from ICE import CollocationExtractor
extractor = CollocationExtractor.with_collocation_pipeline(“T1” , bing_key = “Temp”,pos_check = False)
print(extractor.get_collocations_of_length(input, length = 3))

Output:

[“on the line”]

Relationship extraction

An example of relationship extraction using NLTK can be found here.

Summary

After the text preprocessing is done, the result may be used for more complicated NLP tasks, for example, machine translation or natural language generation.

Resources:

  1. http://textblob.readthedocs.io/en/dev/
  2. https://spacy.io/usage/facts-figures
  3. https://radimrehurek.com/gensim/index.html
  4. https://opennlp.apache.org/
  5. http://opennmt.net/
  6. https://gate.ac.uk/
  7. https://uima.apache.org/
  8. https://www.clips.uantwerpen.be/pages/MBSP#tokenizer
  9. https://rapidminer.com/
  10. http://mallet.cs.umass.edu/
  11. https://www.clips.uantwerpen.be/pages/pattern
  12. https://nlp.stanford.edu/software/tokenizer.html#About
  13. https://tartarus.org/martin/PorterStemmer/
  14. http://www.nltk.org/api/nltk.stem.html
  15. https://snowballstem.org/
  16. https://pypi.python.org/pypi/PyStemmer/1.0.1
  17. https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html
  18. https://lucene.apache.org/core/
  19. https://dkpro.github.io/dkpro-core/
  20. http://ucrel.lancs.ac.uk/claws/
  21. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
  22. https://en.wikipedia.org/wiki/Shallow_parsing
  23. https://cogcomp.org/page/software_view/Chunker
  24. https://github.com/dstl/baleen
  25. https://github.com/CogComp/cogcomp-nlp/tree/master/ner
  26. https://github.com/lasigeBioTM/MER
  27. https://blog.paralleldots.com/product/dig-relevant-text-elements-entity-extraction-api/
  28. http://www.opencalais.com/about-open-calais/
  29. http://alias-i.com/lingpipe/index.html
  30. https://github.com/glample/tagger
  31. http://minorthird.sourceforge.net/old/doc/
  32. https://www.ibm.com/support/knowledgecenter/en/SS8NLW_10.0.0/com.ibm.watson.wex.aac.doc/aac-tasystemt.html
  33. https://www.poolparty.biz/
  34. https://www.basistech.com/text-analytics/rosette/entity-extractor/
  35. http://www.bart-coref.org/index.html
  36. https://wing.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.html
  37. http://cswww.essex.ac.uk/Research/nle/GuiTAR/
  38. https://www.cs.utah.edu/nlp/reconcile/
  39. https://github.com/brendano/arkref
  40. https://cogcomp.org/page/software_view/Coref
  41. https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30
  42. https://github.com/smartschat/cort
  43. http://www.hlt.utdallas.edu/~altaf/cherrypicker/
  44. http://nlp.lsi.upc.edu/freeling/
  45. https://corpling.uis.georgetown.edu/xrenner/#
  46. http://takelab.fer.hr/termex_s/
  47. https://www.athel.com/colloc.html
  48. http://linghub.lider-project.eu/metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75
  49. http://ws.racai.ro:9191/narratives/batch2/Colloc.pdf
  50. http://www.aclweb.org/anthology/E17-3027
  51. https://metacpan.org/pod/Text::NSP
  52. https://github.com/knowitall/reverb
  53. https://github.com/U-Alberta/exemplar
  54. https://github.com/aoldoni/tetre
  55. https://www.textrazor.com/technology
  56. https://github.com/machinalis/iepy
  57. https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#relations
  58. https://github.com/mit-nlp/MITIE

Data Monsters

Written by

https://datamonsters.com