MLearning.ai
Published in

MLearning.ai

IMPORTANT TEXT PRE-PROCESSING TECHNIQUES FOR NLP

Natural Language Processing (NLP) helps us to communicate or talk with a computer just like we talk to a human. NLP can also be defined as the intersection of Artificial Intelligence (AI), Linguistics and Computer Science, that helps the machine or computer to understand, interpret and manipulate human language.

There are two main parts to NLP:
1. Data Preprocessing
2. Algorithm development

Here, in this blog we’ll be only looking about the first and most important process, “data preprocessing”.

NLP text preprocessing

Data preprocessing is the most essential step for any Machine Learning model. It plays a major role in deciding the performance of the model. Because the performance of the model is based on how well the raw data is cleaned and preprocessed. Let’s see the various preprocessing steps that are involved:

a) Lower casing

As the name implies, in this method we’ll convert our text data into lower case. For a text input, such as paragraph there will be words in both lower and upper case. However, the computer considers the words written in different cases as different entities.

For example, “Hello” and “hello” are considered as two different words by the computer though they are the same word. In order to avoid this type of controversy, we must convert all the words to lower case.

b) Tokenization

Tokenization is the process of breaking down the input text or paragraph into smaller units such as sentences or words. Each smallest unit is considered as individual tokens. The principle of tokenization is to understand the context of the paragraph/text by analyzing the tokens.

Sentence tokenize

The input paragraph’s corresponding output list consists of sentences as tokens.
Ex: “This blog is about preprocessing. Preprocessing is essential”, the output for this in sentence tokenize is [‘This blog is about preprocessing’, ‘Preprocessing is essential’].

Word tokenize

The input text’s corresponding output list consists of words as tokens.
Ex: “This blog is about preprocessing”, the output for this in sentence tokenize is [‘This’, ‘blog’, ‘is’, ‘about’, ‘preprocessing’].

c) Punctuation mark removal

Removing punctuations from the text is the most common text processing technique. The removal of punctuation helps to treat all text equally. For example, Wow and Wow! are treated equally. We should be also careful while removing the punctuations. Because some words like “don’t” will become “don t”, if we aren’t careful.

d) Stop word removal

If you have noticed, you would have seen some words popup very frequently in any language irrespective of what you are writing? Those words are called as “Stop words”.

Stop words are a collection of words that occur very often but does not add much meaning to the sentence. These words are a part of grammar of any language. For example, in English we have stop words like “the”, “are”, “he”, “him” etc. We can easily remove these stop words from our text data since they don’t add much value to the overall meaning of the sentence.

e) Stemming

Stemming is the process of reduction of a word to its root or stem word. The root form is left behind by removing the word affixes. For example, the words “plays”, “playing”, “played” are all reduced to its root word “play”.

f) Lemmatization

We have seen that how can we reduce the words to their root words through stemming. But, Stemming does not always result in words that are part of the language vocabulary. It often produces words with no meaning. Hence, the concept of Lemmatization came into play.

Lemmatization is the process of converting the words in a text into a meaningful parent word.

The key difference between lemmatization and stemming is that, in lemmatization we can pass a Parts of Speech (POS) parameter. This is used to provide the context in which we wish to lemmatize our words by mentioning POS.

So far, we have discussed about the essential and the important pre-processing techniques of Natural Language Processing (NLP). These are very useful to improve the performance of the model by effectively cleaning and processing the data.

Happy learning!!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sudarshan S

Sudarshan S

72 Followers

Tech enthusiast | Developer | Programmer | Cybersecurity | Machine learning | Data science