Getting Started With Natural Language Processing

A gentle introduction to NLTK library of Python with simple examples

S Joel Franklin
The Startup
6 min readDec 27, 2019

--

Image by StartupStockPhotos from Pixabay

The necessary packages are imported.

# Importing the necessary packagesimport nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import PunktSentenceTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

Let us understand the function of each package.

from nltk.tokenize import word_tokenize, sent_tokenize

word_tokenize’ gives an output as a list of words in the input sentence.

# Example usage of ‘word_tokenize’a = ‘Spending today complaining about yesterday will not make tomorrow any better’word_tokenize(a)

The output is [‘Spending’, ‘today’, ‘complaining’, ‘about’, ‘yesterday’, ‘will’, ‘not’, ‘make’, ‘tomorrow’, ‘any’, ‘better’].

Given below is the snippet of code that serves the same function as ‘word_tokenize’ (Optional)

def word(a):
return a.split(‘ ‘)
a = ‘Spending today complaining about yesterday will not make tomorrow any better’word(a)

sent_tokenize’ gives a list of sentences as output in the input paragraph.

Example usage of 'sent_tokenize'a = ‘Today is a good day. I would like to do more. Would you like to join me? It is gonna be fun’
sent_tokenize(a)

The output is [‘Today is a good day.’, ‘I would like to do more.’, ‘Would you like to join me?’, ‘It is gonna be fun’]

‘sent_tokenize’ is an instance of ‘PunktSentenceTokenizer’.

# Example usage of PunktSentenceTokenizer# Importing the PunktSentenceTokenizerfrom nltk import PunktSentenceTokenizer# Defining the training and sample texttrain_txt = ‘Mr. Mike met Shane. The appointment was critical. Mr. Mike was in serious condition’
sample_txt = ‘Mr. Watson met Mark. The appointment was critical. Mr. Watson was in serious condition’
# Training the PunktSentenceTokenizer on 'train_nxt'custom_tokenizer = PunktSentenceTokenizer(train_txt)#Tokenizing the sample_txt using the trained PunkSentenceTokenizertokenized = custom_tokenizer.tokenize(sample_txt)tokenized

The output is [‘Mr. Watson met Mark.’, ‘The appointment was critical.’, ‘Mr. Watson was in serious condition’].

‘sent_tokenizer’ is pretrained. It doesn’t require training text and can tokenize straightaway. But it is not much helpful if the text to be tokenized is very different and uncommon. The tokenization may not be perfect. In that case, ‘PunktSentenceTokenizer’ is a better option. The tokenizer is first trained on a training text which is similar to the text to be tokenized. Then the trained tokenizer is used for tokenizing.

Given below is the snippet of code that serves the same function as ‘sent_tokenize’ (Optional)

# Defining the sentence tokenizer function. There are 3 end punctuations :- '?','!','.'. All 3 end punctuations are replaced with '.'def sentence(a):
a.replace(‘?’,’.’).replace(‘!’,’.’)
return a.split(‘.’)
# Defining the text to be tokenizeda = ‘Today is a good day. I would like to do more. Would you like to join me? It is gonna be fun’sentence(a)

The set of ‘stopwords’ is predefined in the nltk library of python.

from nltk.corpus import stopwordsprint(set(stopwords.words('english')))

‘Stopwords’ are commonly used words in a particular language. They don’t help in classification of text as they are commonly and widely used. Hence in order to save computational power and time, the stopwords are filtered which is an important preprocessing step for natural language processing applications.

from nltk.stem import PorterStemmer

Stemming is the process of reducing a word to its word stem or to the roots of words known as a lemma. It is a kind of normalization for words.

# Example usage of ‘PorterStemmer’# Defining a list of words to be stemmed.a = [‘lift’,’lifts’,’lifting’,’lifter’,’lifted’]# Defining an empty list to stored the stemmed words.a_stemmed = []# Defining an instance of class PorterStemmer()ps = PorterStemmer()for x in a:
a_stemmed.append(ps.stem(x))
print(a_stemmed)

The output is [‘lift’, ‘lift’, ‘lift’, ‘lifter’, ‘lift’]. ‘lift’, ‘lifts’, ‘lifting’, and ‘lifted’ have all been reduced to ‘lift’.

The POS tagger is used for part of speech tagging. The following line of code gives the list of POS tags.

nltk.help.upenn_tagset()
  • CC coordinating conjunction
  • CD cardinal digit
  • DT determiner,the,a,an
  • EX existential, there (like: “there is” … think of it like “there exists”)
  • FW foreign word
  • IN preposition/subordinating conjunction
  • JJ adjective, ‘big’
  • JJR adjective, comparative ‘bigger’
  • JJS adjective, superlative ‘biggest’
  • LS list marker, 1)
  • MD modal could, will
  • NN noun, singular ‘desk’
  • NNS noun, plural ‘desks’
  • NNP proper noun, singular ‘Harrison’
  • NNPS proper noun, plural ‘Americans’
  • PDT predeterminer, ‘all the kids’
  • POS possessive ending, parent’s
  • PRP personal pronoun I, he, she
  • PRP$ possessive pronoun my, his, hers
  • RB adverb, very/silently,
  • RBR adverb, comparative better
  • RBS adverb, superlative best
  • RP adverbial particle, give up
  • TO, to go ‘to’ the store.
  • UH interjection, errrrrrrrm
  • VB verb, base form take
  • VBD verb, past tense took
  • VBG verb, gerund/present participle taking
  • VBN verb, past participle taken
  • VBP verb, sing. present, non-3d take
  • VBZ verb, 3rd person sing. present takes
  • WDT wh-determiner, which
  • WP wh-pronoun, who/what
  • WP$ possessive wh-pronoun, whose
  • WRB wh-abverb, where/when
# Example usage of POS Tagger# Defining a sample textsample_txt = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'# Word tokenizingword = nltk.word_tokenize(sample_txt)# POS Taggingtag = nltk.pos_tag(word)tag

Named Entity Recognition seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Named Entity Type and Examples
ORGANIZATION :- Companies, Agencies, Institutions. Google, Microsoft.
PERSON :- People including fictional. Mike, Jessica.
LOCATION :- Non-GPE Locations, mountain ranges, bodies of water. Mount Everest.
DATE :- Absolute or relative dates or periods. June, 2008–06–29.
TIME :- Times smaller than a day. three fifty a m, 2:30 p.m.
MONEY :- Monetary values including unit. 50 million Canadian Dollars, GBP 10.65
PERCENT :- Percentage including ‘%’. twenty five pct, 18.75 %.
FACILITY :- Buildings, airports, highways, bridges. Washington Monument, Stonehenge.
GPE — Countries, cities, states. Bangalore, India.

# Example usage of Named Entity Recognition# Defining a sample textsample_txt = ‘Amit Shah asserted the government will ensure that the non-Muslim refugees from Pakistan, Bangladesh and Afghanistan get Indian nationality and live in the country with honour.’# Word tokenizingwords = nltk.word_tokenize(sample_txt)# POS taggingtagged = nltk.pos_tag(words)# Named entity recognitionnamed_entity = nltk.ne_chunk(tagged,binary = False)print(named_entity)

The main aim of Chunking is to group words into phrases as per the regularized expression defined. Let us chunk using the given below regularized expression :-

<RB.?>*<VB.?>*<NNP>*<NN>?

RB = Adverb
VB = Verb
NNP = Singular Proper noun
NN = Singular Noun

+ = Match 1 or more
? = Match 0 or 1 repetitions
* = Match 0 or more repetitions
. = Any character except a new line

RB.? = Any tense of Adverb (RBR, RBS)
VB.? = Any tense of Verb (VBD, VBG, VBN, VBP, VBZ)

# Example usage of Chunkingfrom nltk.corpus import state_union# Defining the training and sample texttrain_text = state_union.raw(‘2005-GWBush.txt’)
sample_txt = state_union.raw(‘2006-GWBush.txt’)
# Training the tokenizer and tokenizing the sample textcustom_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_tokenizer.tokenize(sample_txt)
# Defining function which chinks as per the regularized expressiondef process_content():
for i in tokenized[7:9]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkgram = r’’’chunk: {<RB.?>*<VB.?>*<NNP>*<NN>?}’’’
chunkparser = nltk.RegexpParser(chunkgram)
chunked = chunkparser.parse(tagged)
print(chunked)process_content()

Chinking helps to remove words from phrases that we don’t want. The following regularized expression is used for Chinking.

{<.*>+}
}<VB.?|IN|DT|TO>+{

The main difference here is the ‘}{‘ vs the ‘{}’. ‘}{‘ means we are removing from the chink one or more verbs, prepositions, determiners, or the word ‘to’.

# Example usage of Chinking# Defining the training and sample texttrain_text = state_union.raw(‘2005-GWBush.txt’)
sample_txt = state_union.raw(‘2006-GWBush.txt’)
# Training the tokenizer and tokenizing the sample textcustom_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_tokenizer.tokenize(sample_txt)
# Defining function which chinks as per the regularized expressiondef process_content():
for i in tokenized[0:3]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkgram = r’’’chunk: {<.*>+}
}<VB.?|IN|DT|TO>+{‘’’
chunkparser = nltk.RegexpParser(chunkgram)
chunked = chunkparser.parse(tagged)
print(chunked)process_content()

From the above output, it can be observed that none of the chunks contain verbs, prepositions, determiners, or the word ‘to’.

All the basics (Word tokenizer, Sentence tokenizer, Stop words, Stemming, POS Tagging, Named entity recognition, Chunking and Chinking) have been covered and now can start working on a Natural Language Processing project.

Happy Reading!

--

--

S Joel Franklin
The Startup

Data Scientist | Fitness enthusiast | Avid traveller | Happy Learning