Text cleaning and text preprocessing

Amirhossein Abaskohi
7 min readNov 27, 2021

In this article, I will introduce you to various ways of text preprocessing which is one the most important levels of and NLP project. Here we will just talk about the tools which are useful for English, although in other languages just the tools will be different and most of the other languages includes all of the steps we will introduce.

Text normalization includes:

  • Converting all letters to uppercase or lowercase
  • Converting numbers into words or removing numbers
  • Removing punctuation, accent marks, …
  • Expanding abbreviations
  • Removing stop words, sparse terms and particular words
  • Text canonicalization
  • Stemming
  • Tokenization
  • Lemmatization

Basic text normalization

Text converting to lowercase

This part can be easily done using basic string methods of Python. Here is an example of that:

inp = ”The Eiffel tower is in Paris.”
inp = inp.lower()
print(inp) # Will print "the eiffel tower is in paris"

Maybe this kind of manipulation seems unnecessary, in some NLP tasks you should prevent you model to be sensitive to such uppercase levels.

Removing numbers

Like the last part, this is not necessary too and taking this step depends on you task.

Remove any numbers that aren’t related to your research. Regular expressions are commonly used to eliminate numerals.

import re
inp = ’Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.’
result = re.sub(r’\d+’, ‘’, inp)
print(result) # Will print "Box A contains red and white balls, while Box B contains red and blue balls."

Removing punctuation

Removing some punctuation may have bad results in your model, although in some tasks it can be useful.

This set of symbols is removed using the following code [!”#$ percent &’()*+,-./:;=>?@[] ‘|]:

import string
inp = “This &is [an] example? {of} string. with.? punctuation!!!!”
result = inp.translate(string.maketrans(“”,””), string.punctuation)
print(result)

Tokenization

It is the process of breaking down a large piece of material into smaller pieces, such as phrases and words. Tokens are the smallest components. A token in a sentence, for example, is a word, while a sentence is a token in a paragraph. Because NLP is used to create applications such as sentiment analysis, quality assurance systems, language translation, smart chatbots, and voice systems, it is critical to comprehend the pattern in the text in order to create them. The above-mentioned tokens are quite helpful in identifying and comprehending these patterns. Tokenization may be thought of as the first step in other recipes like stemming and lemmatization.
Tokens include words, numbers, punctuation marks, and other symbols. There are different tools in Python for tokenization. Some of them are listed below:

  • NLTK
  • Text blob
  • Stanza

Tokenization using NLTK

import nltk
from nltk.tokenize import word_tokenize
word_tokenize('Medium provides high quality technical papers.')
ptint(word_tokenize)# Output is
['Medium', 'provides', 'high', 'quality', 'technical', 'papers']

Tokenization using TextBlob

from textblob import TextBlob
text = ("Medium is the best blog.")
tb = TextBlob(text)
words = tb.words
print(words)
# Output is
['Medium', 'is', 'the', 'best', 'blog']

Tokenization using Stanza

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('This is a test sentence for stanza. This is another sentence.')
for i, sentence in enumerate(doc.sentences):
print(f'====== Sentence {i+1} tokens =======')
print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')
# Output
====== Sentence 1 tokens =======
id: 1 text: This
id: 2 text: is
id: 3 text: a
id: 4 text: test
id: 5 text: sentence
id: 6 text: for
id: 7 text: stanza
id: 8 text: .
====== Sentence 2 tokens =======
id: 1 text: This
id: 2 text: is
id: 3 text: another
id: 4 text: sentence
id: 5 text: .

From here I just will use NLTK as it is the most famous Python NLP tool. But you can find information for TextBlob easily on the internet.

For Stanza I suugest you the read their documents from here:

Removing stop word

What are stop words?

The most prevalent words in any natural language are stopwords. These stopwords may not contribute much value to the meaning of the document when evaluating text data and constructing NLP models.

Generally, the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.

Here is a list of English stop words you might find it helpful:

a about after all also always am an and any are at be been being but by came can cant come could did didn't do does doesn't doing don't else for from get give goes going had happen has have having how i if ill i'm in into is isn't it its i've just keep let like made make 
many may me mean more most much no not now of only or our really say see some something take tell than that the their them then they thing this to try up us use used uses very want was way we what when where which who why will with without wont you your youre

Why do we Need to Remove StopWords?

In NLP, removing stopwords isn’t a hard and fast rule. It all depends on the project we’re working on. Stopwords are eliminated or omitted from provided texts for activities like text classification, where the text is to be divided into distinct groups, so that greater attention may be given to those words that determine the text’s meaning.

Stopwords should not be removed in activities like machine translation and text summarizing.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = """This is a sample sentence,showing off the stop words filtration."""stop_words = set(stopwords.words('english'))word_tokens = word_tokenize(example_sent)filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]filtered_sentence = []for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)print(filtered_sentence)# Output['This', 'is', 'a', 'sample', 'sentence', ',', 'showing',
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

Stemming

Stemming is a technique for eliminating affixes from words in order to retrieve the basic form. It’s the same as pruning a tree’s branches down to the trunk. The stem of the terms eating, eats, and eaten, for example, is eat.

For indexing words, search engines employ stemming. As a result, instead of saving all versions of a word, a search engine may simply save the stems. Stemming minimizes the size of the index and improves retrieval accuracy in this way.

In NLTK, stemmerI, which have stem() method, interface has all the stemmers which we are going to cover next. Let us understand it with the following diagram

StemmerI in NLTK

Porter stemming algorithm is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of English words.

import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem('writing')
# Output
'write'

Lancaster stemming algorithm was developed at Lancaster University and it is another very common stemming algorithms.

import nltk
from nltk.stem import LancatserStemmer
Lanc_stemmer = LancasterStemmer()
Lanc_stemmer.stem('eats')
# Output
'eat'

Using NLTK and regualr expression you can mke your own steemer:

import nltk
from nltk.stem import RegexpStemmer
Reg_stemmer = RegexpStemmer('ing')
Reg_stemmer.stem('ingeat')
# Output
'eat'

Lemmatization

The lemmatization process is similar to stemming. The result of lemmatization is called a ‘lemma,’ which is a root word rather than a root stem, which is the result of stemming. We will receive a legitimate term that signifies the same thing after lemmatization.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('books')
# Output
'book'

Difference between Lemmatization and Stemming

In basic terms, the stemming approach just considers the word’s form, but the lemmatization process considers the word’s meaning. It indicates that we will always receive a valid word after performing lemmatization.

Both stemming and lemmatization have the purpose of reducing a word’s inflectional and occasionally derivationally related forms to a single base form.

The taste of the two terms, though, is distinct. Stemming is a heuristic procedure that cuts off the ends of words in the hopes of getting it right most of the time, and it frequently includes the removal of derivational affixes. Lemmatization typically refers to doing things correctly using a vocabulary and morphological study of words, with the goal of removing only inflectional ends and returning the base or dictionary form of a word, known as the lemma.

Part of Speech Tagging

Part-of-speech tagging seeks to assign parts of speech to each word (such as nouns, verbs, adjectives, and others) in a given text based on its meaning and context.

input_str=”Parts of speech examples: an article, to write, interesting, easily, and, of”
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)
# Output
[(‘Parts’, u’NNS’), (‘of’, u’IN’), (‘speech’, u’NN’), (‘examples’, u’NNS’), (‘an’, u’DT’), (‘article’, u’NN’), (‘to’, u’TO’), (‘write’, u’VB’), (‘interesting’, u’VBG’), (‘easily’, u’RB’), (‘and’, u’CC’), (‘of’, u’IN’)]

Chunking

Chunking is a natural language process that detects and relates sentence constituent pieces (nouns, verbs, adjectives, and so on) to higher order units with discrete grammatical meanings (noun groups or phrases, verb groups, etc.)

input_str=”A black television and a white stove were bought for the new apartment of John.”
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)
# Output
[(‘A’, u’DT’), (‘black’, u’JJ’), (‘television’, u’NN’), (‘and’, u’CC’), (‘a’, u’DT’), (‘white’, u’JJ’), (‘stove’, u’NN’), (‘were’, u’VBD’), (‘bought’, u’VBN’), (‘for’, u’IN’), (‘the’, u’DT’), (‘new’, u’JJ’), (‘apartment’, u’NN’), (‘of’, u’IN’), (‘John’, u’NNP’)]

Now chunking:

reg_exp = “NP: {<DT>?<JJ>*<NN>}”
rp = nltk.RegexpParser(reg_exp)
result = rp.parse(result.tags)
print(result)
# Output
(S (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/IN (NP the/DT new/JJ apartment/NN)
of/IN John/NNP)

Also this is the sentence tree:

Sentence tree

Summary

We discussed text preparation in this post, covering normalization, tokenization, stemming, lemmatization, chunking, part of speech tagging, named-entity recognition, coreference resolution, collocation extraction, and connection extraction, as well as the basic procedures involved. Text preparation techniques and examples were also presented. A comparison table was made.

After the text has been preprocessed, it may be utilized for more advanced NLP activities such as machine translation or natural language synthesis.

Thanks for reading my article.

--

--