Rise above archaic tech with the use of ML, NLP, and BOTS ! Part 2 — Natural Language Processing

A gentle introduction to natural language processing, its tasks, and which tech you need to learn to get started.

Andreas Rein
BDO Digital Labs
5 min readJul 19, 2017

--

Our languages is so diverse

Part 1 of this series can be found here.

What is NLP?

“A way for computers to analyze, understand, and derive meaning from human language in a smart and useful way”

NLP, or natural language processing is about mimicking natural language. Natural language is the language humans use to communicate. Some use cases for NLP ranges from automatic summarization, translation, speech recognition, topic segmentation, bots and much more.

Today, NLP has come a long way, but it’s far from perfect in most languages. It’s generally extremely hard to teach a computer how to understand how we speak since human language is rarely precise, or plainly spoken.

Human language is rarely precise, or plainly spoken

Languages are also based on experience and there are also a lot of dialects in most languages making learning a computer how to “speak” strikingly difficult. Words usually have several meanings based on the context of the sentence. For example, “You were right” vs “Make a right turn at the light”. This makes the process more convoluted than a simple dictionary lookup.

History

It all started in 1964 with Eliza. Eliza was an early attempt at a program that could mimic human understanding of language. It used pattern matching extensively.

No more visits to the therapist when you have Eliza

It was purely a script so it was incapable of learning new patterns of speech or new words. You can go and try the JavaScript version in your own browser to see how some people thought they were talking with a real human in the 60's.

NLP Tasks

Today NLP is a mix of several tasks.

Tokenization

With tokenization, we mean breaking down text into words, sentences, phrases, symbols or tokens.

Tokenizing text in Jupyter Notebook

Stop word removal

Stop words is some extremely common words that add little to no value in the meaning of the sentence or phrase.

Removing stop words in Jupyter Notebook

N-Grams

N-Grams is a set of co-occurring words. It’s about identifying unigrams, bigrams, trigrams etcetera. For example, in this sentence (the cow jumps over the moon) we have 5 n-grams. (the cow, cow jumps, jumps over, over the moon). It is used for spelling correction, word breaking, and text summarization.

Identifying bigrams in Jupyter Notebook

Word sense disambiguation

This means to identify the meaning of a word based on context. For example, the word ‘bass’ have a lot of different meanings, it all depends on the context of the word.

Word sense disambiguation in Jupyter Notebook

Parts of speech tagging (POS tagging)

POS-Tagging is about finding out if a word is a noun, verb, adjective etcetera. This is a very simplified explanation because it requires more than a list of words and their parts of speech to find out if a word is a noun or a verb. There are many words in our language that are ambiguous.

Stemming

With stemming we extract only the root of the word. It’s about reducing words to their root form. This means the forms of fish: fishing fished, fisher will be reduced to fish. Google search uses word stemming to show more relevant search results.

Stemming and POS-Tagging in Jupyter Notebook

In the real world

Natural language processing is all over us. Facebook use it to track trending topics and popular hashtags. Online communities, newspapers and social media sites are also using it to block offensive comments to automate the moderator’s job. IT Security is also all over it when it comes to monitoring malicious digital attacks.

#learningnaturallanguageprocessing

Some of the most well-known uses of NLP is Microsoft Word’s spell checker. Other helpful use cases are to check the tone of a written message. This can be used to prevent online bullying or to prevent those annoying cyberspace trolls.

Language and frameworks

Python

When it comes natural language processing, Python is a clear winner. The leading framework when using Python for NLP is Natural Language Toolkit or the shortened version, NLTK. It can do some awesome things, from tokenizing text to part of speech tagging. All the examples you saw earlier were done using NLTK. Together with Jupyter notebook, you can achieve truly godlike results.

Java

Alternatives are Java if you need the performance. This is only better if you are working with a massive amount of unstructured text.

Others

There are NLP libraries in most of the popular programming languages, so it shouldn’t be a problem testing it out in your favorite language.

What can you do with Natural Language Processing?

NLP has a lot of use cases. You can use it in your bots or as a spam filter. With NLP you can extract the core meaning of a sentence, something you can utilize in statistics or in machine learning algorithms. You could generate keyword tags or summarize a block of text. Only your fantasy will set the limit on what you can do with NLP.

Where can you learn more about Natural Language Processing?

Pluralsight have a great course to get you started with NLP using the Python programming language. Coursera also has an introduction to natural language processing by the University of Michigan. Kaggle is a great platform to learn about data science in genera, including NLP.

This was part 2 of a series about machine learning, natural language processing and bots. Hit that love button if you enjoyed this article, and remember to follow BDO Digital if you want to read more stories like this.

Rise above archaic tech with the use of ML, NLP, and BOTS! Part 1 — Machine Learning

Rise above archaic tech with the use of ML, NLP, and BOTS! Part 3 — BOTS

--

--