A Brief Introduction to Natural Language Processing

Christos Chen
The Startup
Published in
8 min readMay 17, 2020

What is “Natural” Language?

Ah… to be human!

Natural Language describes the way humans communicate to each other, whether via text or speech. Human language is highly ambiguous and constantly evolving. We are very adept at comprehending and producing language that could include very complex and nuanced meanings.

However, just like how not all intelligent individuals are capable of teaching their knowledge, humans generally cannot describe the rules that govern language. This makes the field of natural language processing very difficult.

What is Natural Language Processing (NLP)?

Natural language processing is a form of artificial intelligence that lies at the intersection of the fields of computer science, information engineering, and linguistics.

Simply put, natural language processing is the ability of computers to decipher the human language.

Source: ThinkPalm

This sounds too high-tech… why should I care?

Natural language processing has risen to prominence in every-day life in recent times. Nowadays, it’s harder to miss as it is at the core of many computer and electronic capabilities. To name a few:

  • Machine Translation: NLP has been able to break many communicative barriers, ranging from language to disability (i.e. Google Translate, converting sign language to text, etc.)
  • Smart Assistants: Siri, Alexa, Google Home, etc. all rely on NLP to understand and carry out tasks at our ease.
  • Chatbots & Automated Phone Answering: Improving customer relations & driving revenues of businesses.
  • Spell Check: Microsoft Word utilizes NLP to analyze potential grammatical errors and spell check words.
Source: TechTunnel.com

As you can see, NLP has a wide range of applications, and has been key in sales & marketing abilities to gain valuable insight on user-customer interactions and behaviors through sentiment analysis (i.e. 70% of the reaction to our product announcement was positive → continue the product rollout!).

NLP has been extremely useful in many business contexts including market intelligence, advertising, and hiring & recruitment. It has found use cases within industries ranging from banking, where it has streamlined credit assessments, to healthcare, where it has improved clinical documentation, clinical trial matching, and risk adjustment.

Source: Prescient&Strategic Intellienge

The Basics & The Process

Source: Xenon Stack

Basic NLP has a few processes: understanding the concepts will help in a second when we actually start performing it!

  • Tokenization: Breaking up a body of text into smaller portions, or tokens.

Example: “Hi. I am Joe” → “Hi”, “.”, “I”, “am”, “Joe”

  • Stemming: Reducing words to their root word stems. We need to break down/simplify words to ensure they are interpreted the same.

Example: Although they have different endings, “laughing” and “laughed” are just different tenses of the same word. These would all be reduced to “laugh”.

  • Lemmatization: Resolving words to their dictionary form, or their lemma. This is similar to stemming in intent, but more calculated and complex. Because it is more complex — the computer needs to often identify its part of speech first.

Example: Are = Is = Am = Be. These, although appearing totally different, have the same dictionary meaning.

Sound tough? Thankfully Python already has a library, Natural Language Toolkit (NLTK) , which can execute these tasks for us!

Source: Data Flair

We will be exploring 7 very basic forms of NLP below:

1. Tokenization

2. Removing stop words

3. Stemming

4. Lemmatization

5. Word Frequencies

6. Part of Speech Identification

7. Named Entity Recognition

Let’s Get Started!

First, make sure that you have Python and the NLTK library installed on your computer. These can all be downloaded online!

import nltk

Before diving into basic NLP, we must first ensure that all text is in the same case, as NLP is case-sensitive. Below, we will create a text variable that we will use later as examples & convert it all to lower case:

text = "I really LOVE their new product called the iPhone. Everything is so usable and quality. My sons really love the iPhone as well. They think it is SO fun to play with in their free time !"text = text.lower()OUTPUT: 'i really love their new product called the iphone. everything is so usable and quality. my sons really love the iphone as well. they think it is so fun to play with in their free time !'

Easy!

1. Tokenization

Now we can tokenize, or break apart, the text. We can tokenize either by word, or by sentences.

By Sentence, using sent_tokenize():

sentence_tokenization = nltk.sent_tokenize(text)OUTPUT: ['i really love their new product called the iphone.',  'everything is so usable and quality.',  'my sons really love the iphone as well.',  'they think it is so fun to play with in their free time !']

By Word, using word_tokenize():

['i', 'really', 'love', 'their', 'new', 'product', 'called', 'the', 'iphone', '.', 'everything', 'is', 'so', 'usable', 'and', 'quality', '.', 'my', 'sons', 'really', 'love', 'the', 'iphone', 'as', 'well', '.', 'they', 'think', 'it', 'is', 'so', 'fun', 'to', 'play', 'with', 'in', 'their', 'free', 'time', '!']

2. Removing “Stop” Words

This removes redundant words that don’t provide useful information, such as “and”, “with”, “but”, “a”, “the”, etc.

First, import stopwords:

from nltk.corpus import stopwords

Next, we will remove any stop words within the review, using stopword’s words() function:

no_stop_words = [word for word in word_tokenization if word not in stopwords.words()]
#This removes every word token within the review that was found in the list of stopwords that NLTK has stored.
Output: ['really', 'love', 'new', 'product', 'called', 'iphone', '.', 'everything', 'usable', 'quality', '.', 'sons', 'really', 'love', 'iphone', 'well', '.', 'think', 'fun', 'play', 'free', 'time', '!']

As you can see, even though words were removed, you can still get the premise of the review — that’s what we want!

3. Stemming

We will now attempt to isolate the roots of the words in the review. To do this, we import SnowballStemmer.

from nltk.stem import SnowballStemmer

Now, we’ll create a SnowballStemmer object, specifying the language used:

snowballStemmer = SnowballStemmer('english')

And we’ll stem all the words, which we previously tokenized by word!

stem_words = [snowballStemmer.stem(word) for word in word_tokenization]OUTPUT: ['i',  'realli',  'love',  'their',  'new',  'product',  'call',  'the',  'iphon',  '.',  'everyth',  'is',  'so',  'usabl',  'and',  'qualiti',  '.',  'my',  'son',  'realli',  'love',  'the',  'iphon',  'as',  'well',  '.',  'they',  'think',  'it',  'is',  'so',  'fun',  'to',  'play',  'with',  'in',  'their',  'free',  'time',  '!']

Oof… what is “realli”? As you can see above, a problem that one can run into during stemming is under stemming, and in our case, overstemming.

But since we’re simply covering the basics, we’ll move on to lemmatization.

4. Lemmatization

We will now resolve words to their dictionary forms, with the help of WordNetLemmatizer:

from nltk.stem import WordNetLemmatizer

Now, we’ll create a WordNetLemmatizer object and use its lemmatize() function:

wordnet_lemmatizer = WordNetLemmatizer()lemma_words = [wordnet_lemmatizer.lemmatize(word) for word in word_tokenization]OUTPUT: ['i',  'really',  'love',  'their',  'new',  'product',  'called',  'the',  'iphone',  '.',  'everything',  'is',  'so',  'usable',  'and',  'quality',  '.',  'my',  'son',  'really',  'love',  'the',  'iphone',  'a',  'well',  '.',  'they',  'think',  'it',  'is',  'so',  'fun',  'to',  'play',  'with',  'in',  'their',  'free',  'time',  '!']

Great. What else could we do that could be helpful?

5. Finding Word Frequencies

This can be particularly helpful when trying to determine the topic and general themes of the review.

We’ll need to import FreqDist first:

from nltk import FreqDist

Now, let’s feed in our review, having removed the stop words earlier:

word_frequencies = FreqDist(no_stop_words)OUTPUT:
FreqDist({'!': 1, '.': 3, 'called': 1, 'everything': 1, 'free': 1, 'fun': 1, 'iphone': 2, 'love': 2, 'new': 1, 'play': 1, 'product': 1, 'quality': 1, 'really': 2, 'sons': 1, 'think': 1, 'time': 1, 'usable': 1, 'well': 1})

Ignoring the punctuation, we see that the most frequent words in the review were “iPhone”, “really”, and “love”, which can lead us to infer the review was a very positive review regarding the iPhone product!

6. Identifying Part of Speech

NLP can also enable us to identify the part of speech of each word in the text we are analyzing!

First, we must tokenize the text — which we learned earlier!

tokenized_text = nltk.word_tokenize(text)

We will utilize nltk’s pos_tag() function to generate the part of speech tags now:

part_of_speech = nltk.pos_tag(tokenized_word)OUTPUT: [('Does', 'VBZ'),  ('anything', 'NN'),  ('think', 'VB'),  ('global', 'JJ'),  ('warming', 'NN'),  ('is', 'VBZ'),  ('a', 'DT'),  ('good', 'JJ'),  ('thing', 'NN'),  ('?', '.'),  ('I', 'PRP'),  ('love', 'VBP'),  ('Lady', 'JJ'),  ('Gaga', 'NNP'),  ('.', '.'),  ('I', 'PRP'),  ('think', 'VBP'),  ('she', 'PRP'),  ('’', 'VBZ'),  ('s', 'PRP'),  ('a', 'DT'),  ('really', 'RB'),  ('interesting', 'JJ'),  ('artist', 'NN'),  ('.', '.')]

VBZ?? NN? JJ?!?!

... What do these mean?? Here’s a table that explains the tags:

Source: PythonSpot

So far, the NLP has been fairly basic. But NLP can do some pretty awesome things. Take a look at a slightly tougher form of NLP:

7. Named Entity Recognition (NER)

NER locates and classifies named entities into categories such as organizations, locations, names, times etc.

Let’s do a real-life example in which we utilize NER on this chaotic tweet:

We will utilize web scraping to extract the tweet — Check out how to do the basics of web scraping here!

After scraping, we are left with the extracted tweet:

text = “Does anyone think global warming is a good thing? I love Lady Gaga. I think she’s a really interesting artist.”

To do this… we will need to first tokenize & then identify part of speech as we have done before!

tokenized_word = nltk.word_tokenize(text)part_of_speech = nltk.pos_tag(tokenized_word)

Now, we will “chunk” the part of speech breakdowns we just generated, using ne_chunk(). Chunking will enable us to recognize named entities using a classifier, which adds category labels such as person, organization, and location.

chunk = nltk.ne_chunk(part_of_speech)

We will feed this chunked text to identify named entities within the text, which will utilize the classifiers to extract named entities defined within the nltk library!

ner_words = [''.join(word for word, x in pos) for pos in chunk if isinstance(pos, nltk.Tree)]OUTPUT: ['LadyGaga']

Woah. That was cool.

What Now?

Source: Dell Technologies

NLP is far from perfect. The intricacies and complexities of the human language are immense, and there is still much work to be done in furthering our ability to formally understand language mechanisms, i.e. sarcasm.

So prepare yourself… mom may continue yelling at Alexa at 8 A.M. for… quite some time to say the very least.

Now that you’ve learned some basics and seen a little glimpse at the power of NLP, you are now ready to learn more complex methods and applications. Go off and have fun with it!

--

--

Christos Chen
The Startup

An individual who aspires to utilize data to tell relevant and important narratives within the numbers to catalyze meaningful change in the world.