It’s The Same Hamburger!!

Published in

Analytics Vidhya

6 min readJan 2, 2020

NATURAL LANGUAGE PROCESSING (PART II)
The following is part of a series of articles on NLP. (Check Part I & Part III)

As we saw in the previous article, NLP provides interesting capabilities that are changing many industries today. It’s cool that a computer could do that much, but how does it manage to get there? Oh yeah you got it, we’re about to dive in some serious s.. stuff !

NLP Framework

We are going to build step by step a Natural Language Processing framework, and by the end of this “tutorial”, you would be able to build your own NLP model. Let’s get started!

First of all, let’s look at this piece of text. It’s a quote from Bill Gates and it’s one of my favorites. It would be awesome if my computer could read this quote, and especially could “understand” it, isn’t? To get there, we need to apply a couple of steps.

Bill Gates — Microsoft Founder & Chairman

Data Pre-Processing

Data Pre-Processing is considered as the most annoying part of the job because it is technically unattractive and relatively laborious, but still important. There’s a famous saying among data scientists claiming, “Garbage in, garbage out”. That means if you feed your Machine Learning model with dirty data, it will throw it right back at your face (Sorry 😊 ).. In other words, it will give you meaningless results. Fair enough right? That’s why this part of the work should be rigorously done.
Usually, when dealing with structured data, data preprocessing often involves removing duplicate data, NULL values and errors. When it comes to text data, there are many common data pre-processing techniques, also known as text cleaning techniques.
In order to apply pre-processing techniques, we’re going to use a very powerful Python library: NLTK: Natural Language ToolKit. NLTK provides a suite of text processing libraries for classification, tokenization, stemming, tagging, etc. Hang in there we’re about to see all these features together in a couple of minutes. Stay tuned!

• Sentence Segmentation

Basically, it’s the act of splitting our text apart into separate sentences. In our case, we’ll end up with this:

1. “I can understand wanting to have millions of dollars, there’s a certain freedom, meaningful freedom, that comes with that.”
2. “But once you get much beyond that, I have to tell you, it’s the same hamburger.”
3. “Bill Gates — Chairman & Founder of Microsoft”

We can assume in this case that each sentence represents a separate idea. As a result, it’ll be a lot easier to develop an algorithm that understands a single sentence than the whole paragraph.

• Tokenization

Now we split our text into sentences, let’s do even better and break it down into words, or “tokens” more correctly.
For example, let’s start with the first sentence from our quote:

“I can understand wanting to have millions of dollars, there’s a certain freedom, meaningful freedom, that comes with that.”

After applying tokenization, it will end up as following:

“I”, “can”, “understand”, “wanting”, “to”, “have”, “millions”, “of”, “dollars”, “,”, “there’s”, ”a”, “certain”, “freedom”, “,”, “meaningful”, “freedom”, “,”, “that”, “comes”, “with”, “that”, “.”

text = '''I can understand wanting to have millions of dollars, there’s a certain freedom, meaningful freedom, that comes with that. But once you get much beyond that, I have to tell you, it’s the same hamburger. Bill Gates — Chairman & Founder of Microsoft'''#Import NLTK Library
import nltk#Segmentation
nltk.tokenize.sent_tokenize(text)#Tokenization
nltk.tokenize.word_tokenize(text)

• Text Stripping

If you thinking what I’m thinking then you’re wrong.. but we still going to take off few things.
✓ Make text lowercase: It’s a sort of normalization checkpoint to avoid the number of characters we’re dealing with.
✓ Expand contractions: Informal English is full of contractions that should be replaced, always in an attempt to normalize our text as much as we can.
For example, in our quote, “there’s” will be replaced by “there is”.

I found the cList used below on StackOverFlow.

###Make text lowercase & Expand contractions#Load English contracted/expanded words list from a .py file
from contractions import cList# Compile a regular expression pattern for matching 
import re
c_re = re.compile('(%s)' % '|'.join(cList.keys()))#Create a function to look for contractions and replace them with their full form
#Put text in lowercase to make sure all words are included
def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text.lower())#Notice it's a bit grammatically incorrect, but it doesn't matter since we gonna remove the stopwords later
expanded_text = expandContractions(text)

P.S. Notice it’s a bit grammatically incorrect, but it doesn’t matter since we gonna remove the stopwords later 😉

✓ Remove punctuations: Punctuation marks represent unwanted characters so let’s take them off.

###Remove punctuations#Import string library
import string#Create a function to remove punctuation / special characters '!"#$%&\'()*+,-./:;<=>?#@[\\]^_`{|}~'
def clean_text(text):
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    return text

✓ Spelling Correction: The idea is simple; we are going to use a big corpus as a reference to correct words spelling within our text.
✓ Remove stop words: Stop words are overused words that carry no additional significant information to the message each text is holding. Most of the common stop words are determiners (e.g. the, a, an), prepositions (e.g. above, across, before) and some adjectives (e.g. good, nice). Let’s kick them out!

I can understand wanting to have millions of dollars, there is a certain freedom, meaningful freedom, that comes with that.
But once you get much beyond that, I have to tell you, it is the same hamburger.
Bill Gates — Chairman & Founder of Microsoft

###Remove stopwords#nltk.download('stopwords')
from nltk.corpus import stopwords#Create a function to remove stopwords
def remove_stopwords (sentence = None):
    words = sentence.split()
    stopwords_list = stopwords.words("english")
    clean_words = []
    for word in words:
        if word not in stopwords_list:
            clean_words.append(word)
    return ' '.join(clean_words);

✓ Part of Speech filtering: The purpose is to identify each word’s lexical category by giving it a tag: Verb, Adjective, Noun, Adverb, Pronoun, Preposition…

###Part of Speech Tagger#nltk.download('averaged_perceptron_tagger')
import nltk
from nltk import pos_tag, word_tokenize#Create a function to pull out nouns & adjectives from text
def nouns_adj(text):
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if 
    is_noun_adj(pos)] 
    return ' '.join(nouns_adj)#Return list of tuple [word; PoS]
tokens = word_tokenize(clean_text_sw)
tuple_list = nltk.pos_tag(tokens)

✓ Text Lemmatization: In most of languages, words could appear in different forms. It would be interesting to replace each word with its base form, so that our computer could understand that different sentences could be talking about the same concept. So, let’s LLAMAtize our quote!

In our case, “I can understand wanting to have millions of dollars”

becomes “I can understand [want] to have [million] of [dollar]”

###Lemmatization#nltk.download('wordnet')#Lemmatize text with appropriate POS tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet#Create a function to map NLTK's POS tags to the format wordnet lemmatizer would accept
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)#Create an instance of the WordNetLemmatizer()
lemmatizer = WordNetLemmatizer()#Create a function to return text after lemmatization
def lemmatize_text(text):
    lemm_text = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(clean_text_sw)]
    return ' '.join(lemm_text)

✓ Named Entity Recognition: it is a process where an algorithm takes a string of text (sentence or paragraph) as input and identifies relevant nouns (people, places, organizations, etc..) that are mentioned in that string.

###Named Entity recognition#nltk.download('maxent_ne_chunker')
#nltk.download('words')
from nltk import ne_chunk#Create a function to tokenize and PoS your text
def NER(text):
    text = nltk.tokenize(text)
    text = nltk.pos_tag(text)
    return texttext_NER = NER(text)pos_list = ne_chunck(text_NER)