NLP for Anyone Who Builds Products (Part 1 of 2)

Published in

Data Science and Machine Learning at Pluralsight

12 min readAug 13, 2020

Great shout-out to Levi Thatcher for reviewing the post and providing feedback.

TL;DR: Through non-technical explanations of NLP techniques, this post aims to enable leaders, product managers, and whoever builds products to discover NLP opportunities in their business problems and become conversational about NLP.

Introduction

The field of Natural Language Processing (NLP) is growing extraordinarily fast and we use it every day without even noticing it. If your job is planning and building products there is a good chance that you can make it smarter by leveraging NLP. I agree that there is a lot of jargon and technical details that make NLP conversations less interesting for leaders and product managers, however, I believe for solving business problems it’s not necessary to have a deep knowledge about NLP. You can think of it as a toolkit that you don’t need to know how it works but you know how to use! Gaining some high level knowledge about NLP use cases will give you the ability to look at your projects from different angles and think differently. Also, it helps you to be conversational about NLP techniques which makes it easier to discuss your ideas with the data scientists in your team.

It is our mission at Pluralsight to “create data-first thinking across the organization”. This is my main motivation for writing this post because we want everybody — and not only data scientists or machine learning engineers — who is involved in the process of product building to have a high level knowledge about novel skills that can make products better. NLP definitely is one of those techniques.

Below are five categories that I have created only for the sake of organizing the topics. NLP use cases are very interdependent and it is hard to categorize them into groups without overlaps. So, you might disagree with these categories but that is okay. As long as you can remember them while brainstorming, you are good to go 😃.

NOTE: In this post (Part 1 of 2) I will cover the first two columns of the below table. In Part 2, we will go through the rest of topics in the table.

1. Classification and Clustering

Let’s start with two of the most common models: Classification and Clustering. You can think of a model in the Data Science world as a black box algorithm that gets input and returns some results. These algorithms/models are trained to predict something based on their inputs. Classification and Clustering models are very common approaches used in NLP. Clustering is a technique that allows you to group things together and in the NLP domain those things are text. So, if you have a lot of text like customer reviews, search queries, or tags and you are trying to organize them you can start with clustering. As you can see in this image, clustering over tags would help you to identify groups of tags that are semantically similar to each other.

You could also do hierarchical clustering over tags and this will help you to create a taxonomy! In one of the projects at Pluralsight, we used this method to create a simple taxonomy over our tags and you can see in the below image that the model was able to nicely cluster some of the Data Science and NLP libraries. Manual work might be needed afterwards to make sure the taxonomy is valid, but undoubtedly using hierarchical clustering is going to save you a lot of time.

Classification though is different from clustering as in classification you define some buckets in advance and then you ask the model to put things in those pre-defined buckets, however, in clustering the main task of the model is defining those buckets. For instance, if you want to cluster a bunch of emails, you run a clustering algorithm over those emails — in practice algorithms cannot run over text and emails should be converted to numbers, but for the sake of simplicity I just ignore it here — and in the result you get different clusters each of which contains emails about similar topics i.e. travel, job, conferences, and etc. However, in classification of emails first you should define some groups such as “spam” and “non-spam” and then ask the model to put the emails in each of these categories.

Sentiment Analysis is a good NLP example for text classification. Sentiment Analysis models classify a piece of text into some predefined groups such as positive, negative, neutral, and etc. You can run this model over your customer reviews and get a sense of customer satisfaction.

2. Information/Knowledge Extraction

Relational databases retain a lot of useful information and we have specific languages like SQL to query them quickly and accurately. But dealing with free text could be a little bit challenging! We collect and store a lot of unstructured text because they contain useful information but restoring and discovering information is not usually straightforward. In this section we are going to address some of the common NLP techniques for extracting information from text.

When we think about information extraction from text the first thing that might come to mind is extracting a sentence or a phrase that is written somewhere in a large corpus of text. In order to do that well, however, we must first talk about extracting one of the most challenging parts of the Natural Language — Semantics — and dive into the benefits that we can gain by teaching machines to understand the meaning of words and sentences.

Semantic Text Similarity is one of the first things you encounter when dealing with NLP problems. The assumption here is that the word “cat” is semantically closer to the word “tiger” than the word “snake” and the question is how can we score this similarity? You can ask the same question about documents and say whether document A is more similar to document B than document C. Assigning a similarity score for two pieces of text — whether it can be a single word, a sentence, a paragraph, or a document — might not seem interesting in the first look but it can be a game changer if you want to build a product that involves text.

Think about a search engine. Aren’t the search results the most similar content to your query? Of course there are many details in finding the most “relevant” content to the query because the relevance is not only based on semantic similarity between query and content but also must account for many other features. However, if you are building a simple search engine to discover products, using text similarity is a good start. This way you could retrieve the products that are semantically similar to the search query and later you can improve the results by applying other techniques such as Click Models and Word Embeddings. We were able to significantly improve the relevance of Pluralsight Search results by using these methods.

In this post I’ll refrain from going into technical details and I intend to only highlight the important NLP use cases. Therefore, I don’t discuss how we build models that retain semantic information of text and how we calculate the similarity, but it is important to know that many other NLP tasks benefit from semantic representation of text and many of the NLP techniques mentioned in this post depend on it. Also, once you have built this capability in your NLP portfolio, you can use it for many other products such as recommending similar products to the users, showing similar search queries, finding similar users, and so on.

This image shows all of the courses in Pluralsight library mapped into 2D space. Courses that are similar to each other are neighbors in the plot. We extracted some text from the course titles, tags, and descriptions and trained Word Embedding models that map text to numbers. Computers don’t understand the meaning of words, but they can represent them as numbers in a multidimensional space. This way they can calculate the similarity between them.

Now that we talked about semantics, let’s see what techniques are out there to extract information from text. Say you have a database full of customer reviews and you want to analyze them. Topic Modeling is a common approach to discover the various topics found in customer reviews as a whole and also distinguish the main topics that a specific review is about. Topic modeling usually is used in the exploration phases and similar to the clustering approach (mentioned in section 1), which helps us get a feeling about our text corpus. The figure below shows a sample result of running topic modeling over customer reviews. The colors show different topics. Based on the words in each topic listed on the left side of the image you can tell that the topics are “food”, “location”, and “service”. The percentages on the right side of the image assigns a score to each topic mentioned in each review.

Another technique for information extraction from text is Named Entity Recognition (NER). NER models try to find named entities (person, organization, location, etc.) within a text. You can see that 3 entities are highlighted by NER in the image below.

The difference between Hash Tables and NER is that hash tables are not smart but NER models are. Hash tables are a long list of key-value pairs. You can create a list of entities such as Apple-ORG or Cat-ANIMAL and then try to find them in any given text. But we all know that the word “apple” also refers to a fruit and labeling it as an organization all the time leads to low accuracy. Because NER models are context sensitive they can solve this type of issues by predicting the entities and their types. The prediction power comes from training the model over a lot of samples and then asking the model to predict the type of unseen data. There is no one size fits all NER model and if you want to identify the entities beyond generic entities such as location or organization, you need to train a new NER model. Each organization has its own context and specific vocabularies and entities. So, if your plan is to extract some context specific entities from a big list of text, probably you should start thinking about training your own NER model!

There are also other techniques if you want to dig more into details and extract more knowledge from the text. Temporal Processing is a technique for automatically predicting the date of an event based on its content. For example, a temporal processing model can infer that Mary was born in 1983 in this text: “Her son John was born in 1980 and 3 years later she gave birth to Mary.” Temporal processing can be an addition to NER models if you are sensitive about the time of events.

Now let’s switch the gears towards Knowledge Graphs! Extracting knowledge from text is one of the ways of creating Knowledge Graphs, which are getting very popular these days. The image below shows a sample knowledge graph which constitutes a few triples like (Albert Einstein → Born in → German Empire). Each triple has two nodes/entities that are connected with an edge.

A good knowledge graph can enable your organization to discover new knowledge through applying graph algorithms on it. However, the very first benefit that you can gain from a knowledge graph is to improve the discoverability of your content or products. No matter what your products are, it is highly likely that they are connected in some ways and this network is very valuable. A simple way to play with a powerful knowledge graph is doing a google search! For example, if you search for python (image below) a Google Knowledge Panel will show up on the right side of the screen providing detailed information about Python language. Python is an entity/node in Google’s knowledge graph with attributes that include the description and other information about python. There are also other entities connected to the Python node which are shown in this image as “Python books”, and “People also searched for”.

If you are convinced that you can benefit from a Knowledge Graph in your organization then we can talk about some ways to create it. The simplest way to extract these triples is using Regular Expressions (Regex). A regular expression is a sequence of characters that helps us to extract certain patterns from text. For example this image shows a very simple regex for extracting roll numbers from sentences.

More elaborate regex patterns can be useful in extracting triples for knowledge graphs, however, because of the intrinsic complexity of human language it is extremely hard to write regex rules for all of the edge cases. A solution could be taking advantage of models that can predict those edge cases by seeing a lot of similar samples in the training process. Open Information Extraction (OpenIE) is a more complicated model for extracting triples and ranks the extractions based on their estimated quality of being triples. Therefore, it is capable of extracting information even though the text does not follow the pattern that is written in the regex rules.

Once you have created your knowledge graph by extracting entities and relations from text you might end up with a lot of duplicated entities that need normalization. For instance “python 3” and “python 3.x” might mean the same thing in our context but having two separate entities in the knowledge graph without any relation in between can lower the accuracy of the knowledge graph. Named Entity Linking (NEL) could solve this problem by finding the relationship between these two entities and link them together. The difference between NER (explained above) and NEL is that NER only identifies the entities and their types such as “Python 3” with type “programming language”, but it cannot tell you that “python 3” is the same entity as “python 3.x”. This is where the NEL models come into the game.

A similar field to entity linking is Coreference resolution, which is the task of grouping mentions in text that refer to the same entities. In the below example, “Nader” and “he” are referring to the same person and “I”, “my”, and “she” are all about another person.

And finally my favorite technique, Semantic Role Labeling (SRL) which is used to answer the questions “who did what to whom, when, and where?” based on a given text.

Finding the answer of these questions in a text can also be useful in building a knowledge graph, but usually SRL is used to accomplish more complicated NLP tasks such as machine translation, text summarization, question answering — I am going to talk about these tasks in part 2. Even if you don’t want to build models like summarization, SRL still can be useful if you can find a good use case for it. One example that I used SRL for is extracting information pieces from Pluralsight “learning objectives” that have the pattern like “Do X with Y to accomplish Z”. Even though SRL is not designed exactly for this pattern, it is very helpful because it extracts the key parts of the sentence for further analysis. You can find these patterns in the curriculum of courses or instructional plans. As shown in the image below, SRL model can extract the useful information from the text.

Conclusion
That concludes part 1 of NLP for Anyone Who Builds Products. In part 2, I will go over the rest of the topics (user intention, text generation, and text improvement) that is mentioned in the table at the beginning of this post.

My hope is that these posts can help you to approach your business problems differently. If you enjoyed the post but you want to learn more, here is an amazing repository that Sebastian Ruder has created to track the progress of NLP research.

NLP for Anyone Who Builds Products (Part 1 of 2)

Introduction

Written by Masoud