A Brief Intro Into Natural Language Processing

Published in

Voice Tech Podcast

7 min readDec 5, 2019

Artificial Intelligence is something that mystified me for years.

I had always heard the word, but never truly understood what it meant. I thought it was simply just pre-programming answers into a computer. Boy was I wrong. There is SOOO much that goes under the broad umbrella of the term “Artificial Intelligence”. One of the concepts under this umbrella is Natural Language Processing.

A Venn Diagram portraying how different concepts are related to one another. You can see NLP in red.

What is Natural Language Processing? Put simply, it is a branch of AI that helps computers understand and respond with human language. This includes both understanding text and spoken word.

“Couldn’t you just input the entire dictionary into a computer and then it would be able to understand human language, right?”

In this visual, it can be seen how unstructured data is not organized

Nope. One of the major reasons why humans can understand each other is because they have context. Context is key in understanding language, and without it, computers having a hard time understanding language. In addition, most of the data in the world is unstructured, meaning that it does not have a pre-formatted outline that a computer can easily recognize. This brings up challenges because the computer does not know where to start. However, some steps can be taken to ensure that a computer does have context and meaning, allowing it to fully understand language, and even respond.

Preliminary Steps: Breaking up sentences into words

Before we go ahead and look at individual words, let’s break up the paragraphs into sentences and words.

Sentence Segmentation: This is relatively simple, as all an algorithm has to do is identify a location where punctuation such as a period or question mark is followed by a capitalized letter, and break up the string of words there.

Word Tokenization: This is also quite simple, where each sentence is broken up into different words, or tokens. In English and most other languages, this is easy, as the algorithm just has to make different tokens every time a space occurs. In the sentence, “I ate pie.” the sentence would turn into: “I” “ate” “pie” “.”

Now that we have the preliminary steps out of the way, let’s move on to the steps where things get interesting!

First Step: Morphology

Morphology is the understanding of the relationships between words. Huh? In other words, it simply means understanding the purpose of a word in a sentence. This is difficult for computers to do without help because words in one sentence may have a different purpose than that same word in another sentence. For example, are the words blackbird and blackbirds the same word, or different words?

Morphological segmentation breaks up the word into morphemes, the basic structure of the word. Each morpheme is either the suffix, affix, prefix, or the stem word itself. The computer can then analyze the stem word for its definition.

Illustration of a word and its Morphemes

Stemming is similar to morphological segmentation as it identifies the root word without the suffix and prefix. For example, the two sentences, “I was taking a ride in the car” and “I was riding in the car,” have the same meaning but the root word “ride” is changed to “riding: in the second sentence.
Lemmatization is a similar process where a word’s lexeme, or base word is found. For example, the base word walk in walking, walked and walks.
Part of Speech Tagging is a process where the computer analyzes each word in a sentence to find the verb tense of words. For example, run, and ran. This can be hard for a computer to do without data on which to base its guess on what the meaning of the sentence is. Text Corpora can help with identifying the part of speech. Text Corpora are a large collection of structured text which in combination with self-learning algorithms can identify part of speech rules, such as if the first word is a proper noun, the second is likely a verb.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Second Step: Syntax

Syntax is the understanding of the structure of the whole sentence, and the arrangement of the words that make it a sentence. Understanding the order in which the words appear is crucial to language because the sentence is not understandable with incorrect grammar. Coding in the grammar of a specific language is also necessary to understand it. There are also many languages, making it more pertinent to have correct grammar. There are several ways in which a Natural Language Processing system can understand syntax.

Parse Trees/Shallow Parsing are a way that computers can break up a sentence into smaller parts. The sentence is broken up based on the form of word (noun, verb, pronoun, etc.) each word is. The program already knows what part of speech each word is due to previous morphology steps, creating a visual representation of the sentence structure.

A parse tree for the sentence: “A group of kids playing in a yard and an old man is standing in the background”

Dependency Parsing is breaking up the sentence into words that relate to previous words. All words are dependent on the words before it, for example in the sentence “John is cool,” the word “cool” is an adjective that describes John, the noun.

A dependency parsed tree; you can see how it visually looks different than the shallow parsed tree

Third Step: Semantics

Semantics is concerned with the meaning of each individual word. Computers can have trouble distinguishing the meaning of some words, especially homonyms. Homonyms are words that are spelled the same but have different meanings. For example, the word “space” in the sentence “There is a lot of space in his backyard” has a different meaning than the one in the sentence “I have not gone to space.” It is hard for an algorithm to identify the correct definition in each specific scenario, but it has some tools to help him out.

Named Entity Recognition is the recognition of words that are known pronouns and symbols such as countries, companies, numbers, and times. Once these are identified, it can be easier to identify the meaning of words near to it.

The groupings of words can be seen in these visualizations of Word 2 Vec algorithms

Word 2 Vec is an algorithm created by Google, which converts words into numerals based on the frequency and position of the word. A neural net can more easily understand these numerals. Word 2 Vec creates vectors of these words and can compare and group vectors in vector space to make smarter guesses as to the meaning of a word. Word 2 Vec can also be used to make guesses
Word Sense Disambiguation is an algorithm which looks at words around each word, and then use those surrounding words as context to understand the meaning of the word. Machine learning algorithms learning from text corpora are also used to improve accuracy.

Uses

Natural Language Processing has many uses because of its versatility. It is very useful because of its potential in data analysis and human-computer interactions. Here are a few examples of use cases for NLP.

Chatbots: Chatbots are not just a funny gimmick, they can often be helpful for companies because it allows there to be a virtual agent which can help with preliminary customer support, and pass on this information to a human customer support personal. They can also be utilized as personal assistants which can help people who may need support or encouragement.
Language Translation: This is one that most of us have probably used on those foreign language projects. Natural Language Processing is essential for language translation, as the translator has to understand the meaning of the sentence, and then replicate that meaning in a new language. Recurrent Neural Networks are also used here to make guesses on the translations of one language to the next. For example, while training an algorithm on Russian to English and German to English, the algorithm can make a guess at Russian to German.
Autocomplete: Search engines such as Google and Bing use autocomplete to improve user experience. Natural Language Processing is used in autocomplete services as the service has to understand what the user is typing in, and then suggest the rest of the search based on the meaning of what they have already typed.
Gathering Data: Gathering data from product reviews and social media can give companies an understanding of how customers and users use their products. It allows companies to use reviews that are unstructured because the NLP algorithm can understand the language.

Wrap Up

Natural Language Processing is a major part of AI that can revolutionize the way people interact with computers. If computers are able to understand human language with 100% accuracy, computers could integrate into normal life fluently. But there is still a ways to go. Currently, NLP algorithms are reaching accuracy levels in the 90s. Software engineers around the globe are working hard to get to a future with NLP algorithms which have complete accuracy, and I wait patiently for that day.

Connect with me:

Linkedin: https://www.linkedin.com/in/vikram-menon-986a67193

Email: vikrammenon03@gmail.com

A Brief Intro Into Natural Language Processing

Preliminary Steps: Breaking up sentences into words

Uses

Wrap Up

Something just for you

Written by Vikram Menon