Hands-On Lab On Text Preprocessing in NLP Using Python

Published in

Predict

14 min readJul 18, 2020

Hands-On Workshop On NLP Text Preprocessing Using Python

Welcome back folks, to this learning journey where we will uncover every hidden layer of Natural language processing the simplest manner possible.

“When you know that your knowledge is limited and you want to learn more you need to make sure you are sharing your knowledge and are curious enough to receive the same from others ”

In part 2 we learned what are basic steps involved in NLP text preprocessing

How Does NLP Pre-Processing Actually Work?

What are NLP pre-processing techniques?

medium.com

Where we covered,

What Is Tokenization?
What is Stop Word Removal & how we do it?
What is Normalization of Text?
What is Stemming & Why it is done?
What is Lemmatization & Why it is done?

So today in Part 3 of NLP series we will cover :

What Is NER: Named Entity Recognition?
What Is POS: Part Of Speech Tagging?
Hands-on code to understand NLP preprocessing techniques using python, NLTK, spaCy library

Excited! Let’s Get Started …..

What is NER & Why It Is Used?

Named Entity Recognition is the mechanism to label or identity the given sequence of words called (named entities)in a text to classify them into some predefined categories like name of the person, organization, location, expression, time, quantities, etc.

Why It Is Used and What are its common application?

NER is a very useful entity extraction mechanism that can be used alone or along with the topic identification and so it adds a lot of useful semantic information to the content, helping us understand the subject associated with the given text.

It is extremely helpful in:

Identifying the names in a given social media content
Read between news article to extract the named entities like brand name, company name, etc
Simplifies search and makes it efficient by giving us the set of entities to be searched in the pool of corpus
We can easily pick the relevant location info from the given set of corpus to streamline our actions accordingly
It is used extensively in chatbots to plan the responses to be given to the user.

Again NLTK and specially spaCY is a super-efficient python library that helps us perform NER operations against the fed corpus.

What Is POS tagging?

Part-of-speech-tagging is a labeling/tagging mechanism where the words in the given corpus is tagged grammatically, it means the words are labeled as a noun, verb, adjective, adverb, etc. It even does fine-grained tagging like ‘noun-plural’. It even considers tense while tagging.

POS-tagging algorithms fall into two distinctive groups:

Rule-based tagging
stochastic tagging

E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

We will understand the practical application of POS tagging using spaCy in our hands-on coding section.

POS Tagging Application:

Part of speech tagging is extremely useful with

Text 2 Speech recognition
Word sense disambiguation

and many more …

Now that we have covered all the important NLP text preprocessing techniques, its time to get our hands dirty with some real python code.

Hands-On Python Lab Excercise To Understand NLP Text Preprocessing steps

Pre-requisite :

Install anaconda distribution by going here anaconda individual edition. It comes preloaded with required python libraries for our NLP implementation
We will be using the Jyputer Notebook editor for our code lab.

Its Code Time :

NLTK python library comes preloaded with loads of corpora which one can use to quickly perform text preprocessing steps.

We will be using one such corpus called Reuters corpus.

import nltk
from nltk.corpus import reuters

Let's See The File Id's in the Corpus:

reuters.fileids() # List file-ids in the corpus

The output of the above code:

Will look like given below

['test/14826',
 'test/14828',
 'test/14829',
 'test/14832',
 'test/14833',
 'test/14839',
 'test/14840',
 'test/14841',
 'test/14842',
 
 ...]

Let's See What Are The Categories In the Given corpus:

reuters.categories() # List news categories in the corpus

The output of the above code:

['acq',
 'alum',
 'barley',
 'bop',
 'carcass',
 'castor-oil',
 'cocoa',
 'coconut',
 'coconut-oil',
 'coffee',
 'copper',
 'copra-cake',
 'corn',
 'cotton',
 'cotton-oil',
 'cpi',
 'cpu',
 'crude',
 'dfl',
 'dlr',
 'dmk',
 'earn',
 'fuel',
 'gas',
 'gnp',
 'gold',
 'grain',
 'groundnut',
 'groundnut-oil',
 'heat',
 'hog',
 'housing',
 'income',
 'instal-debt',
 'interest',
 'ipi',
 'iron-steel',
 'jet',
 'jobs',
 'l-cattle',
 'lead',
 'lei',
 'lin-oil',
 'livestock',
 'lumber',
 'meal-feed',
 'money-fx',
 'money-supply',
 'naphtha',
 'nat-gas',
 'nickel',
 'nkr',
 'nzdlr',
 'oat',
 'oilseed',
 'orange',
 'palladium',
 'palm-oil',
 'palmkernel',
 'pet-chem',
 'platinum',
 'potato',
 'propane',
 'rand',
 'rape-oil',
 'rapeseed',
 'reserves',
 'retail',
 'rice',
 'rubber',
 'rye',
 'ship',
 'silver',
 'sorghum',
 'soy-meal',
 'soy-oil',
 'soybean',
 'strategic-metal',
 'sugar',
 'sun-meal',
 'sun-oil',
 'sunseed',
 'tea',
 'tin',
 'trade',
 'veg-oil',
 'wheat',
 'wpi',
 'yen',
 'zinc']

Let's see the list of strings in the given corpus & print the length of the same

# Returns a list of strings
print(reuters.words())
length= len(reuters.words())
print("Lenght of the corpus ", length)['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', ...]
Lenght of the corpus  1720901

Viewing a particular new category:

reuters.fileids(['wheat','rice']) # List file ids with either wheat or rice categories
# Some file ids may overlap as news covers multiple categories

The output of the above code:

['test/14832',
 'test/14841',
 'test/14858',
 'test/15043',
 'test/15097',
 'test/15132',
 'test/15206',
 'test/15271',
 'test/15273',
 'test/15341',
 'test/15367',
 'test/15388',
 'test/15472',
 'test/15500',
 'test/15567',
 'test/15572',
 'test/15582',
 'test/15618',
....

Let’s see the sentences in the given corpus:

reuters.sents()

The output will look like :

[['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.'], ['They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 'capitals', 'a', 'U', '.', 'S', '.', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'U', '.', 'S', '.', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products', '.'], ...]

We can access a specific file with the given `fileids` argument.

reuters.sents(fileids='test/15500')

The output will look like :

[['RPT', '-', 'ARGENTINE', 'GRAIN', '/', 'OILSEED', 'EXPORT', 'PRICES', 'ADJUSTED', 'The', 'Argentine', 'Grain', 'Board', 'adjusted', 'minimum', 'export', 'prices', 'of', 'grain', 'and', 'oilseed', 'products', 'in', 'dlrs', 'per', 'tonne', 'FOB', ',', 'previous', 'in', 'brackets', ',', 'as', 'follows', ':', 'Sorghum', '64', '(', '63', '),', 'sunflowerseed', 'cake', 'and', 'expellers', '103', '(', '102', ')', ',', 'pellets', '101', '(', '100', '),', 'meal', '99', '(', '98', '),', 'linseed', 'oil', '274', '(', '264', '),', 'groundnutseed', 'oil', '450', '(', '445', '),', 'soybean', 'oil', '300', '(', '290', '),', 'rapeseed', 'oil', '290', '(', '280', ').'], ['Sunflowerseed', 'oil', 'for', 'shipment', 'through', 'May', '323', '(', '313', ')', 'and', 'june', 'onwards', '330', '(', '320', ').'], ...]

Let' see how many categories does Reuters corpus hold:

len(reuters.fileids()) # 500 sources, each file is a source.

The output will look like :

Let see the raw content of the Reuters based on particular file-id:

rawtext= reuters.raw('test/15500').strip()[:1000]
print(rawtext) # First 1000 characters.

The output will look like :

RPT - ARGENTINE GRAIN/OILSEED EXPORT PRICES ADJUSTED
  The Argentine Grain Board adjusted
  minimum export prices of grain and oilseed products in dlrs per
  tonne FOB, previous in brackets, as follows:
      Sorghum 64 (63), sunflowerseed cake and expellers 103 (102)
  , pellets 101 (100), meal 99 (98), linseed oil 274 (264),
  groundnutseed oil 450 (445), soybean oil 300 (290), rapeseed
  oil 290 (280).
      Sunflowerseed oil for shipment through May 323 (313) and
  june onwards 330 (320).
      The board also adjusted export prices at which export taxes
  are levied in dlrs per tonne FOB, previous in brackets, as
  follows:
      Bran pollard wheat 40 (42), pellets 42 (44).

Here you can see that some sentences are having a slash. There is a mix of lowercase and Uppercase words. Some words are having parenthesis around the numbers.

Let's See How Tokenization Works

Sentence Tokenization

In NLTK, sent_tokenize() the default tokenizer function that you can use to split strings into "sentences".

from nltk import sent_tokenize, word_tokenizesent_tokenize(rawtext)

The output will look like :

['RPT - ARGENTINE GRAIN/OILSEED EXPORT PRICES ADJUSTED\n  The Argentine Grain Board adjusted\n  minimum export prices of grain and oilseed products in dlrs per\n  tonne FOB, previous in brackets, as follows:\n      Sorghum 64 (63), sunflowerseed cake and expellers 103 (102)\n  , pellets 101 (100), meal 99 (98), linseed oil 274 (264),\n  groundnutseed oil 450 (445), soybean oil 300 (290), rapeseed\n  oil 290 (280).',
 'Sunflowerseed oil for shipment through May 323 (313) and\n  june onwards 330 (320).',
 'The board also adjusted export prices at which export taxes\n  are levied in dlrs per tonne FOB, previous in brackets, as\n  follows:\n      Bran pollard wheat 40 (42), pellets 42 (44).']

Word tokenization:

It is the process of splitting up “sentences” into “words”. Now that we have tokenized the raw text into sentences we can create the word token using word_tokenize.

for sent in sent_tokenize(rawtext):
    print(word_tokenize(sent))

The output of the above code:

['RPT', '-', 'ARGENTINE', 'GRAIN/OILSEED', 'EXPORT', 'PRICES', 'ADJUSTED', 'The', 'Argentine', 'Grain', 'Board', 'adjusted', 'minimum', 'export', 'prices', 'of', 'grain', 'and', 'oilseed', 'products', 'in', 'dlrs', 'per', 'tonne', 'FOB', ',', 'previous', 'in', 'brackets', ',', 'as', 'follows', ':', 'Sorghum', '64', '(', '63', ')', ',', 'sunflowerseed', 'cake', 'and', 'expellers', '103', '(', '102', ')', ',', 'pellets', '101', '(', '100', ')', ',', 'meal', '99', '(', '98', ')', ',', 'linseed', 'oil', '274', '(', '264', ')', ',', 'groundnutseed', 'oil', '450', '(', '445', ')', ',', 'soybean', 'oil', '300', '(', '290', ')', ',', 'rapeseed', 'oil', '290', '(', '280', ')', '.']
['Sunflowerseed', 'oil', 'for', 'shipment', 'through', 'May', '323', '(', '313', ')', 'and', 'june', 'onwards', '330', '(', '320', ')', '.']
['The', 'board', 'also', 'adjusted', 'export', 'prices', 'at', 'which', 'export', 'taxes', 'are', 'levied', 'in', 'dlrs', 'per', 'tonne', 'FOB', ',', 'previous', 'in', 'brackets', ',', 'as', 'follows', ':', 'Bran', 'pollard', 'wheat', '40', '(', '42', ')', ',', 'pellets', '42', '(', '44', ')', '.']

Lowercasing :

As we can see there are some words in uppercase. Let's treat them to make all of them lowercased.

for sent in sent_tokenize(rawtext):
    # It's a little in efficient to loop through each word,
    # after but sometimes it helps to get better tokens.
    print([word.lower() for word in word_tokenize(sent)])

The output of the above code:

['rpt', '-', 'argentine', 'grain/oilseed', 'export', 'prices', 'adjusted', 'the', 'argentine', 'grain', 'board', 'adjusted', 'minimum', 'export', 'prices', 'of', 'grain', 'and', 'oilseed', 'products', 'in', 'dlrs', 'per', 'tonne', 'fob', ',', 'previous', 'in', 'brackets', ',', 'as', 'follows', ':', 'sorghum', '64', '(', '63', ')', ',', 'sunflowerseed', 'cake', 'and', 'expellers', '103', '(', '102', ')', ',', 'pellets', '101', '(', '100', ')', ',', 'meal', '99', '(', '98', ')', ',', 'linseed', 'oil', '274', '(', '264', ')', ',', 'groundnutseed', 'oil', '450', '(', '445', ')', ',', 'soybean', 'oil', '300', '(', '290', ')', ',', 'rapeseed', 'oil', '290', '(', '280', ')', '.']
['sunflowerseed', 'oil', 'for', 'shipment', 'through', 'may', '323', '(', '313', ')', 'and', 'june', 'onwards', '330', '(', '320', ')', '.']
['the', 'board', 'also', 'adjusted', 'export', 'prices', 'at', 'which', 'export', 'taxes', 'are', 'levied', 'in', 'dlrs', 'per', 'tonne', 'fob', ',', 'previous', 'in', 'brackets', ',', 'as', 'follows', ':', 'bran', 'pollard', 'wheat', '40', '(', '42', ')', ',', 'pellets', '42', '(', '44', ')', '.']

Woohoo! All the uppercase words have been lowercased now.

Removing Stop Words:

NLTK comes with a rich set of stopwords with its corpus. we will use it to treat stopwords in our raw text

from nltk.corpus import stopwords

stopwords_en = stopwords.words('english')
print(stopwords_en)

Output when we print stopwords_en:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]EN_Stopwords = set(stopwords.words('english')) # Set checking is faster in Python than list.

# Tokenize and lowercase
tokenized_lowercase = list(map(str.lower, word_tokenize(rawtext)))

stopwords_english = set(stopwords.words('english')) # Set checking is faster in Python than list.

# List comprehension.
print([word for word in tokenized_lowercase if word not in stopwords_en])

Output after lowercasing and stop removal method :

['rpt', '-', 'argentine', 'grain/oilseed', 'export', 'prices', 'adjusted', 'argentine', 'grain', 'board', 'adjusted', 'minimum', 'export', 'prices', 'grain', 'oilseed', 'products', 'dlrs', 'per', 'tonne', 'fob', ',', 'previous', 'brackets', ',', 'follows', ':', 'sorghum', '64', '(', '63', ')', ',', 'sunflowerseed', 'cake', 'expellers', '103', '(', '102', ')', ',', 'pellets', '101', '(', '100', ')', ',', 'meal', '99', '(', '98', ')', ',', 'linseed', 'oil', '274', '(', '264', ')', ',', 'groundnutseed', 'oil', '450', '(', '445', ')', ',', 'soybean', 'oil', '300', '(', '290', ')', ',', 'rapeseed', 'oil', '290', '(', '280', ')', '.', 'sunflowerseed', 'oil', 'shipment', 'may', '323', '(', '313', ')', 'june', 'onwards', '330', '(', '320', ')', '.', 'board', 'also', 'adjusted', 'export', 'prices', 'export', 'taxes', 'levied', 'dlrs', 'per', 'tonne', 'fob', ',', 'previous', 'brackets', ',', 'follows', ':', 'bran', 'pollard', 'wheat', '40', '(', '42', ')', ',', 'pellets', '42', '(', '44', ')', '.']

As you can see the output, we got rid of some of the stopwords like ‘and’, ‘of’, ‘as’, ‘in’ etc after stop word removal using NLTK library.

Treating Punctuations:

We will perform both punctuations and stop word preprocessing together in the below code snippet

#define punchuation
from string import punctuation

# It's a string so we have to them into a set type
print('From string.punctuation:', type(punctuation), punctuation)

punct_stopwords = stopwords_english.union(set(punctuation))
print(punct_stopwords)

Removing punctuations from the tokenized lowercase string we processed earlier:

punch_stop_word= [word for word in tokenized_lowercase if word not in punct_stopwords]

print(punch_stop_word)

Output :

['rpt', 'argentine', 'grain/oilseed', 'export', 'prices', 'adjusted', 'argentine', 'grain', 'board', 'adjusted', 'minimum', 'export', 'prices', 'grain', 'oilseed', 'products', 'dlrs', 'per', 'tonne', 'fob', 'previous', 'brackets', 'follows', 'sorghum', '64', '63', 'sunflowerseed', 'cake', 'expellers', '103', '102', 'pellets', '101', '100', 'meal', '99', '98', 'linseed', 'oil', '274', '264', 'groundnutseed', 'oil', '450', '445', 'soybean', 'oil', '300', '290', 'rapeseed', 'oil', '290', '280', 'sunflowerseed', 'oil', 'shipment', 'may', '323', '313', 'june', 'onwards', '330', '320', 'board', 'also', 'adjusted', 'export', 'prices', 'export', 'taxes', 'levied', 'dlrs', 'per', 'tonne', 'fob', 'previous', 'brackets', 'follows', 'bran', 'pollard', 'wheat', '40', '42', 'pellets', '42', '44']

We can see that from the above output that all the stopwords along with punctuations from the below-given text :

"
['rpt', '-', 'argentine', 'grain/oilseed', 'export', 'prices', 'adjusted', 'argentine', 'grain', 'board', 'adjusted', 'minimum', 'export', 'prices', 'grain', 'oilseed', 'products', 'dlrs', 'per', 'tonne', 'fob', ',', 'previous', 'brackets', ',', 'follows', ':', 'sorghum', '64', '(', '63', ')', ',', 'sunflowerseed', 'cake', 'expellers', '103', '(', '102', ')', ',', 'pellets', '101', '(', '100', ')', ',', 'meal', '99', '(', '98', ')', ',', 'linseed', 'oil', '274', '(', '264', ')', ',', 'groundnutseed', 'oil', '450', '(', '445', ')', ',', 'soybean', 'oil', '300', '(', '290', ')', ',', 'rapeseed', 'oil', '290', '(', '280', ')', '.', 'sunflowerseed', 'oil', 'shipment', 'may', '323', '(', '313', ')', 'june', 'onwards', '330', '(', '320', ')', '.', 'board', 'also', 'adjusted', 'export', 'prices', 'export', 'taxes', 'levied', 'dlrs', 'per', 'tonne', 'fob', ',', 'previous', 'brackets', ',', 'follows', ':', 'bran', 'pollard', 'wheat', '40', '(', '42', ')', ',', 'pellets', '42', '(', '44', ')', '.']"

has been removed after stopword and punctuation preprocessing.

Stemming & Lemmatization:

NLTK comes with some common stemming & lemmatizing library inbuilt, like :

Porter Stemmer from Porter (1980):
Wordnet Lemmatizer (port of the Morphy: https://wordnet.princeton.edu/man/morphy.7WN.html)

Let's start with Porter's Stemming One:

Let's take a few simple examples to understand how powerful Porter's stemmer algorithm is:

from nltk.stem import PorterStemmer

porter = PorterStemmer()

for word in ['Talking', 'Talks', 'Talked']:
    print(porter.stem(word))

Output:

talk
talk
talk

Quick Comment:

As you can see when we passed few words like {Talking, Talks, Talked }, porter algorithm, converted them all to talk by stemming them to their root, removing 'ing', 's', 'ed' from the given words.

Lemmatization:

Let's use the same set of example string we used in stemming. Here we will download WordNetLemmatizer package to perform Lemmatization preprocessing

import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

for word in ['Talking', 'Talks', 'Talked']:
    print(wnl.lemmatize(word))[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/prammobibt/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.

Output:

Talking
Talks
Talked

Quick Observation:

Here you can see that it returned the same set of string Talking, Talks, Talked as an output. It is important here to understand that Lemmatization here will not work properly until Part Of Speech tagging is done on the given set of words.

Lemmatizer needs POS tagging to function properly.

POS Tagging :

pos_tag takes the tokenized sentence as input, i.e. list of string and returns a tuple of (word, tg), i.e. a list of tuples of strings

from nltk import pos_tag


wnl = WordNetLemmatizer()

def penn2morphy(penntag):
    """ Converts Penn Treebank tags to WordNet. """
    morphy_tag = {'NN':'n', 'JJ':'a',
                  'VB':'v', 'RB':'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n' # if mapping isn't found, fall back to Noun.
    
pos_tagged_sent = pos_tag(word_tokenize('Prisha is learning maths online'))
print(pos_tagged_sent)

output :

[('Prisha', 'NNP'), ('is', 'VBZ'), ('learning', 'VBG'), ('maths', 'NNS'), ('online', 'NN')]

Now that we have implemented POS tagging let's see how WordNetLemmatizer lemmatizes the same sentence which was POS tagged above

def lemmatize_sent(text): 
    # Text input is string, returns lowercased strings.
    return[wnl.lemmatize(word.lower(), pos=penn2morphy(tag)) 
            for word, tag in pos_tag(word_tokenize(text))]

lemmatize_sent('Prisha is learning maths online')

Output after lemmatization:

['prisha', 'be', 'learn', 'math', 'online']

Let's see how our defined lemmatize_sent function works on our raw text

We will combine lemmatization and stop words removal both in the below-given code

print('Raw Text Before Lemmatization ')
print(rawtext, '\n')
print('Raw Text After Stop word Removal & Lemmaztization \n')
print([word for word in lemmatize_sent(rawtext) 
       if word not in stopwords_english
       and not word.isdigit() ])

The output of the above code snippet:

Raw Text Before Lemmatization 
RPT - ARGENTINE GRAIN/OILSEED EXPORT PRICES ADJUSTED
  The Argentine Grain Board adjusted
  minimum export prices of grain and oilseed products in dlrs per
  tonne FOB, previous in brackets, as follows:
      Sorghum 64 (63), sunflowerseed cake and expellers 103 (102)
  , pellets 101 (100), meal 99 (98), linseed oil 274 (264),
  groundnutseed oil 450 (445), soybean oil 300 (290), rapeseed
  oil 290 (280).
      Sunflowerseed oil for shipment through May 323 (313) and
  june onwards 330 (320).
      The board also adjusted export prices at which export taxes
  are levied in dlrs per tonne FOB, previous in brackets, as
  follows:
      Bran pollard wheat 40 (42), pellets 42 (44). 

Raw Text After Stop word Removal & Lemmaztization 

['rpt', '-', 'argentine', 'grain/oilseed', 'export', 'price', 'adjusted', 'argentine', 'grain', 'board', 'adjust', 'minimum', 'export', 'price', 'grain', 'oilseed', 'product', 'dlrs', 'per', 'tonne', 'fob', ',', 'previous', 'bracket', ',', 'follow', ':', 'sorghum', '(', ')', ',', 'sunflowerseed', 'cake', 'expellers', '(', ')', ',', 'pellets', '(', ')', ',', 'meal', '(', ')', ',', 'linseed', 'oil', '(', ')', ',', 'groundnutseed', 'oil', '(', ')', ',', 'soybean', 'oil', '(', ')', ',', 'rapeseed', 'oil', '(', ')', '.', 'sunflowerseed', 'oil', 'shipment', 'may', '(', ')', 'june', 'onwards', '(', ')', '.', 'board', 'also', 'adjust', 'export', 'price', 'export', 'tax', 'levy', 'dlrs', 'per', 'tonne', 'fob', ',', 'previous', 'bracket', ',', 'follow', ':', 'bran', 'pollard', 'wheat', '(', ')', ',', 'pellets', '(', ')', '.']

Quick Observation On Lemmatization :

We can see clearly that after performing stop word removal and lemmatization along with POS tagging

'prices' got lemmatized to 'price'
'products' got lemmatized to 'product'
'brackets' got lemmatized to 'bracket'

Summary Of Our Hands-On Python Lab:

WOW! We have finished our first hands-on lab on basic NLP preprocessing techniques and covered

Tokenization
Lowercasing
Stop Word Removal
Stemming
POS tagging
Lemmatization

What’s Next?

Hope you all got the gist of all the basic NLP preprocessing techniques from our python hands-on lab. Next, we will look into

Word vectorization
BOW: Bag Of Words
TF-IDF as features

Also, we will learn how to make use of a powerful spaCy library to perform NLP preprocessing with enhanced speed and efficiency.

Time to sign-off with this food for thought:

The true test of the data scientist or AIML engineers involved in the field of NLP happens when he is given the corpus which is extremely unstructured, polluted, ambiguous and lacks the required context. If you treat your data well you will get the best treatment from the model you will build

See you all in Part 4 of this NLP series.

Hands-On Lab On Text Preprocessing in NLP Using Python

How Does NLP Pre-Processing Actually Work?

What are NLP pre-processing techniques?

Where we covered,

So today in Part 3 of NLP series we will cover :

Excited! Let’s Get Started …..

What is NER & Why It Is Used?

Why It Is Used and What are its common application?

It is extremely helpful in:

What Is POS tagging?

POS-tagging algorithms fall into two distinctive groups:

POS Tagging Application:

Hands-On Python Lab Excercise To Understand NLP Text Preprocessing steps

Pre-requisite :

Its Code Time :

Let's See The File Id's in the Corpus:

The output of the above code:

Will look like given below

Let's See What Are The Categories In the Given corpus:

The output of the above code:

Let's see the list of strings in the given corpus & print the length of the same

Viewing a particular new category:

The output of the above code:

Let’s see the sentences in the given corpus:

The output will look like :

We can access a specific file with the given fileids argument.

The output will look like :

Let' see how many categories does Reuters corpus hold:

The output will look like :

Let see the raw content of the Reuters based on particular file-id:

The output will look like :

Let's See How Tokenization Works

Sentence Tokenization

The output will look like :

Word tokenization:

The output of the above code:

Lowercasing :

The output of the above code:

Removing Stop Words:

Output when we print stopwords_en:

Output after lowercasing and stop removal method :

Treating Punctuations:

Removing punctuations from the tokenized lowercase string we processed earlier:

Output :

Stemming & Lemmatization:

Let's start with Porter's Stemming One:

Quick Comment:

Lemmatization:

Output:

Quick Observation:

POS Tagging :

output :

Output after lemmatization:

Let's see how our defined lemmatize_sent function works on our raw text

The output of the above code snippet:

Quick Observation On Lemmatization :

Summary Of Our Hands-On Python Lab:

What’s Next?

Time to sign-off with this food for thought:

Thanks for being there ……

Written by @pramodAIML

We can access a specific file with the given `fileids` argument.