Text Preprocessing For NLP Part — 1

Sanjithkumar
5 min readSep 4, 2023

--

Photo by Walkator on Unsplash

Natural Language Processing is a well established and studied paradigm in the field of Artificial Intelligence. As many of us know that Natural Language Processing involves processing natural language i.e. language that is understood, interpreted and spoken by humans… Many Natural Language Processing related problems in Machine Learning such as sentiment analysis, Neural Machine Translation involves processing loads of textual data, so it is really important for developers to understand the underlying processes that helps to parse textual data for training any kind of model…

Preprocessing Textual data for NLP purposes involves a series of steps which has to be followed to get a clear cut data for processing… The steps for processing might also depend on the type of problem we are going to solve, so it is important for developers to understand what kind of NLP problem they are dealing with, Here I am going to give a high level text processing scheme that you can follow in any kind of problem… This will be divided into two parts the first part we will see about everything other than tokenizing, in the second part we will see about tokenization.

  1. Removing and retaining punctuations or special characters
  2. Removing numerical digits (depends on the type of problem)
  3. Converting to lower case
  4. Vectorize the text

Removing and retaining punctuations or special characters

When it comes to low level text processing problems… it is advisable to remove all punctuations and special characters ( including emojis…) for several significant reasons, these are Dimensionality issues, Computational Efficiency, Noise Reduction and Generalization you can state other issues as well…

Dimensionality Reduction: Keeping every punctuation mark and special character as a separate feature can significantly increase the dimensionality of the data, making it computationally expensive and potentially leading to overfitting. By removing them, you reduce the dimensionality of the feature space.

Computational Efficiency: Some NLP algorithms and models, especially those based on neural networks, are computationally more efficient when trained on preprocessed text. Removing punctuations and special characters can help speed up the training and inference processes.

Noise Reduction: Punctuation and special characters often don’t carry significant semantic meaning on their own. Removing them can help reduce the noise in the text and make it easier for NLP models to focus on the meaningful words and phrases.

Generalization: Ignoring punctuations and special characters helps NLP models generalize better. For instance, if you remove the period from the end of a sentence, the model can better learn the relationship between words without being overly influenced by sentence boundaries.

However, it’s important to note that there are cases where punctuations and special characters might convey valuable information, such as in sentiment analysis (e.g., “I love it” vs. “I love it!”, in the second case the speaker seems to be more excited). In such cases, you may choose to retain certain punctuation marks or handle them differently in your preprocessing pipeline. The choice of whether to remove or retain punctuations depends on the specific NLP task and the goals of your analysis.

import string
punctuations = string.punctuation
def removing_punctuations(text,punctuations):
for punc in list(punctuations):
if punc in text:
text = text.replace(punc, ' ')
return text

The above shows the code for removing punctuations.

Removing and retaining numerical digits

Similar to punctuations removing numerical values from text depends highly on the problem under question, if the problem requires you to retain the numbers that represents important dates and quantities such as in statistical text summarization then you have to make sure that you retain the digits… in such cases you can simply follow the following steps

Processing Dates:

  1. Date Extraction: Use regular expressions or NLP libraries like spaCy or NLTK to extract dates from text. Regular expressions can be customized to match various date formats (e.g., “January 1, 2023,” “01/01/23,” “2023–01–01”).
  2. Date Normalization: Standardize extracted dates into a consistent format (e.g., YYYY-MM-DD) to ensure uniformity for further processing and analysis.
  3. Date Parsing: Utilize date parsing libraries such as Python’s datetime module or date parsing functions in NLP libraries to convert extracted dates into machine-readable datetime objects.
  4. Relative Dates: Handle relative dates (e.g., “yesterday,” “in two weeks”) by using date libraries to calculate the actual date based on the reference point (current date).
  5. Named Entity Recognition (NER): Train or use pre-trained NER models to recognize and extract dates as named entities. Models like spaCy’s NER can be useful for this.

Processing Quantities:

  1. Number Extraction: Use regular expressions or NLP tools to extract numeric quantities from text. This includes integers, decimals, percentages, and fractions.
  2. Unit Identification: Identify units of measurement associated with quantities (e.g., “5 kilograms,” “10%,”) and extract them separately.
  3. Quantity Normalization: Standardize quantities into a consistent format, such as converting all units to a common unit (e.g., converting pounds to kilograms) or normalizing fractions and percentages.
  4. Quantity Conversion: When dealing with mixed units, you may need to convert them to a common unit for consistency and further analysis.
  5. Named Entity Recognition (NER): In some cases, NER models can be used to recognize and extract quantities as named entities, especially in domains like finance and healthcare.
  6. Word-to-Number Conversion: For textual representations of numbers (e.g., “five,” “twenty-three”), consider using libraries or functions that can convert words to numeric values.

For Removing digits you can implement the following code:

#When passing sentences pass each word in the sentence to the functions it'll replace digits word by word with ""
def removing_digits(text):
for i in text:
if i.isdigit():
text.replace(i,"")
return text

Converting to Lower case

Another important preprocessing step is converting to lower or upper case generally to lower case… This step is almost universal in NLP as upper and lower case is used to meet grammatical correctness and they do not convey any special meaning… So one can simply convert everything to lower case as a part of preprocessing…

This can be done easily in python as follows…

def To_Lower(text):
return text.lower()

And this is how you implement the first three general steps in Text preprocessing for NLP, there a huge concept of tokenizing which we will see in the second part…

Hope this was really helpful…

--

--

Sanjithkumar

Deep Learning || MLOPS || GenAI Enthusiast || GCP || Azure || DOM manipulation