Pre-Processing of Text Data

Arnab Chakraborty
AI Skunks
Published in
8 min readMar 27, 2023

Why is Text Processing Required?

Text preprocessing is a crucial step in natural language processing (NLP) that involves transforming raw text data into a structured format suitable for analysis.

Text data frequently contain several types of noise, including punctuation, emotions, and text in various cases. These noise elements may adversely impact the performance of NLP models. To address this, text preprocessing techniques are employed to remove these noise elements and produce a clean dataset.

Machines cannot understand words, they require numbers so we need to convert text to numbers in an efficient manner.

Dataset used

Let’s use a very common example of tagging spam emails by analyzing the mail subject and body.

Steps in pre-processing of Text Data

Expand Contraction

Contracted words are a common feature of natural language, especially in informal settings such as social media or messaging platforms.

Contractions are shortened versions of words or phrases that are formed by combining two words and replacing one or more letters with an apostrophe. Examples of contractions include:

  • “can’t” (from “cannot”)
  • “won’t” (from “will not”)
  • “it’s” (from “it is” or “it has”)
  • “shouldn’t” (from “should not”)
  • “didn’t” (from “did not”)
  • “you’ll” (from “you will”)

It will be beneficial to expand contractions to help with language understanding for which we will use the Contractions library.

# Installing Contractions
!pip install contractions
import contractions
df['contractions'] = df['text'].apply(lambda x: [contractions.fix(word) for word in x.split()])
df['no_contractions'] = [' '.join(map(str, l)) for l in df['contractions']]
df.drop('contractions',axis=1,inplace=True)
df.head()

Tokenization

Tokenization is the process of breaking down text into individual words, phrases, or other meaningful elements, called tokens.

Handling Collocations

Collocations are combinations of words that frequently appear together in a language, forming a predictable pattern. They can consist of two or more words, and their meaning often cannot be deduced from the individual words alone. Collocations are essential in understanding natural language, as they reveal information about the linguistic habits and patterns of speakers.

  1. Identify collocations: The first step is to identify collocations in your text data. We can use statistical measures like Pointwise Mutual Information (PMI), frequency count, or likelihood ratio to find word pairs that co-occur more frequently than expected by chance.
  2. Use word embeddings: Word embeddings like Word2Vec, GloVe, or FastText capture semantic relationships between words, including collocations.
  3. Use context-aware models: Modern NLP models like BERT, GPT, or RoBERTa are capable of capturing collocations implicitly as they learn the contextual relationships between words during the pre-training phase.

Let’s try NLTK.word_tokenize() function to create a new column named “tokenized”.

from nltk.tokenize import word_tokenize
nltk.download('punkt')
df['tokenized'] = df['no_contractions'].apply(word_tokenize)
df.head()

Handling Cases (Upper/Lower)

All the alphabetic characters in a text can transformed to their corresponding lower case representation to reduce the vocabulary size and avoid duplication of words during text analysis.

import re
df['lower'] = df['tokenized'].apply(lambda x: [text.lower() for text in x])
df.head()

However, there are some drawbacks to this approach:

  1. Loss of proper nouns: Converting all text to lowercase can cause a loss of distinction between proper nouns (e.g., names of people, places, organizations) and common nouns. This can lead to confusion and errors in tasks like named entity recognition, where identifying proper nouns is essential.
  2. Loss of emphasis: In written text, uppercase letters are sometimes used to convey emphasis or strong emotions (e.g., “I am REALLY excited!”).
  3. Loss of case-based disambiguation: Some words have different meanings when capitalized, such as “Turkey” (the country) and “turkey” (the bird).
  4. Loss of acronym information: Converting text to lowercase can cause issues with acronyms, which are often written in all capital letters (e.g., “NASA” or “UNESCO”).

Context-based disambiguation

Context-aware models like BERT, GPT, or RoBERTa, which can implicitly capture case-based disambiguation. These models can better understand the meaning of words in context, even when the capitalization is altered.

Let’s use Bert to Tokenize and the it will also convert into lower case

!pip install transformers
from transformers import BertTokenizer

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize text using BERT tokenizer
df['tokenized2'] = df['no_contractions'].apply(lambda x: tokenizer.tokenize(x))
df.head()

Treat words containing digits

Eliminating words that contain numeric characters from text analysis to reduce noise and improve the accuracy of language models.

We can eliminate these words using Regular Expression.

import re
df['no_num'] = df['lower'].apply(lambda x: [re.sub(r'\w*\d\w*','',text) for text in x])
df.head()

Eliminating words with digits might result in losing important context or numerical data that could be vital for understanding the text or for a specific NLP task:

  1. Impact on named entities: Removing words with digits might affect the recognition of named entities such as dates, product codes, phone numbers, or addresses, which could be important for tasks like information extraction or entity recognition.
  2. Abbreviations and acronyms: Words containing digits can be abbreviations or acronyms that carry significant meaning
  3. Alphanumeric codes: In some contexts, alphanumeric codes are important identifiers (e.g., stock symbols, course codes, or gene/protein names in scientific literature). Removing them could lead to a loss of context and impair the analysis.
  4. Reduced performance for specific tasks: For certain NLP tasks like password strength analysis, fraud detection, or code analysis, words containing digits are essential.

Handling Punctuations

BERT uses the WordPiece tokenization algorithm, which breaks down text into subword units, including punctuation marks. This tokenization process allows BERT to effectively capture the meaning and context provided by punctuation in the text.

Since BERT is bidirectional, it can also understand the relationships between words and punctuation marks in context, which can be beneficial for tasks such as sentiment analysis, question-answering, and named entity recognition, among others.

Punctuation can also be removed to simplify the analysis, and reduce the vocabulary size while preserving the meaningful content of the text.

We can use the punctuation library from the String package.

import string
punc = string.punctuation
df['no_punc'] = df['no_num'].apply(lambda x: [text for text in x if text not in punc])
df.head()

Remove Stopwords

Process of eliminating common words such as “the”, “a”, “an”, and “in” from text to reduce the dimensionality of the data, and to focus on the more meaningful words that carry the essence of the text.

We will use the stopwords library from the nltk module.

Stemming or Lemmatization

Stemming and lemmatization are two techniques used in NLP to normalize words by reducing them to their base or root form; stemming chops off the end of words, while lemmatization uses a vocabulary and morphological analysis to reduce words to their canonical form.

  • Stemming: The stem of “running” is “run”. Using a stemming algorithm, “running”, “runs”, and “runner” would all be reduced to the stem “run”.
  • Lemmatization: The lemma of “running” is “run”. Using a lemmatization algorithm, “running” and “runs” would be reduced to “run”, while “runner” would be reduced to “run” as well, but only if the context suggests that it is being used as a verb.

# Import the necessary modules
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

# Download the WordNet data if not already downloaded
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Create instances of the PorterStemmer and WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define the input text
text = "The geese are flying, and the children are running happily."

# Tokenize the text into words
words = word_tokenize(text)

# Function to convert POS tag to WordNet POS tag format
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN

# Perform stemming and lemmatization on the words
stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag)) for word, tag in nltk.pos_tag(words)]

# Print the results
print("Original text:")
print(" ".join(words))
print("\nStemmed text:")
print(" ".join(stemmed_words))
print("\nLemmatized text:")
print(" ".join(lemmatized_words))
Original text:
The geese are flying , and the children are running happily .

Stemmed text:
the gees are fli , and the children are run happili .

Lemmatized text:
The goose be fly , and the child be run happily .

Let’s use NLTK’s word lemmatizer which needs the parts of speech tags to be converted to wordnet’s format. We will apply parts of speech tags, in other words, determine the part of speech (ie. noun, verb, adverb, etc.) for each word.

nltk.download('averaged_perceptron_tagger')
df['pos_tags'] = df['stopwords_removed'].apply(nltk.tag.pos_tag)
df.head()

nltk.download('wordnet')
nltk.download('omw-1.4')
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
df['wordnet_pos'] = df['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
df.head()

We can apply NLTK’s word lemmatizer within our trusty list comprehension. Notice, the lemmatizer function requires two parameters the word and its tag (in wordnet form).

wnl = WordNetLemmatizer()
df['lemmatized'] = df['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])

Conclusion

Pre-processing of text data is an essential step in natural language processing (NLP) that involves cleaning and transforming raw text data into a format that is suitable for analysis by NLP algorithms. Techniques such as tokenization, converting to lowercase, removing digits and punctuations, and eliminating stopwords can help reduce the dimensionality of the data and improve the accuracy of language models. Stemming and lemmatization can further normalize the text data by reducing words to their base or root form.

Overall, pre-processing plays a crucial role in preparing text data for various NLP tasks such as sentiment analysis, text classification, and language translation.

References

--

--

Arnab Chakraborty
AI Skunks

I’m currently pursuing a Master in Information systems at Northeastern University and I've 4 years of experience as a Data Scientist at Quantiphi