Building An NLP Project From Zero To Hero (1): Project Overview

Published in

UBIAI NLP

7 min readDec 7, 2021

A machine reading an e-book! — Image by Brian J. Matis on Flickr — Machines are beginning to understand the human language

Artificial Intelligence (AI) today is changing many aspects of our lives, from the introduction of recommendation systems and chatbots to the embedding of smart features like Face recognition, Automatic Video Annotations, and improved translations. It extends even to fields such as network security, finances, healthcare, and also the understanding of human speech also known as Natural Language Processing (NLP).

NLP is indeed pushing businesses to a new level of prosperity, as it enables scaling efficiently their operations and improving the quality of their products and services, making it more of a personalized experience for their customers. According to the IBM Global AI index 2021, almost half of the respondents mention that their company is currently using NLP, and one-quarter plan to use it in the next 12 months. The report also mentions that an essential reason for the increase in AI adoption is that it became more accessible over the years especially in the last two to three years.

While AI and NLP in particular are getting adopted by large companies at a fast rate, medium and small businesses are facing challenges implement NLP in their businesses. According to the same report, lack of skills or training to develop and manage trustworthy AI is one of the biggest barrier. We believe that there are still barriers that can prevent many companies and organizations from embracing this magnificent tool. One such barrier is the lack of educational content for those who want to start in the field. For this reason, we have decided to make a series to showcase all the steps to complete a fully-fledged NLP project.

This article is the start of this series. We will be first making a gentle refresher on the core concepts of NLP and then we will be explaining the plan of the project. So let us dive in!

A Refresher on Natural Language Processing (NLP):

NLP lies at the intersection of three crucial fields

Simply put, NLP is a set of computational techniques that allow machines to understand and manipulate human spoken languages.

But how is that possible? Machines only understand numbers, specifically the binary system (0 and 1). In recent years, a lot of progress allowed a better design for text representations that can be used by computers.

Let us say that we have a corpus (a dataset) of textual documents. First, we make a vocabulary out of the corpus by collecting all its unique words. There exist many techniques to process text and clean it if we wish to refine the vocabulary.

Now, that we have obtained a dictionary containing all the words of our vocabulary, we can transform or encode our documents into vectors, a more fitting mathematical representation that ML models can comprehend. A very simple encoding technique is one hot encoding where for each word vector, we put 1 at the index corresponding to its position in the dictionary and the rest of the vector is filled with zeros.

Next, we convert every document to a set of vectors that will be considered as an input for our model to predict a label y (let us say a sentiment or a topic classification task for the sake of simplicity), you can simply concatenate the one-hot vectors for each word contained within the document.

As you have noticed, this approach is not effective due to essentially the high size of memory needed to model all the vectors and the documents.

Imagine you have a vocabulary of 10 000 words. And then imagine you have some sentences or documents that stretch out to 100 words. So, this document vector will be of length 10 000 * 100 = 1 000 000 values. And most importantly 99.99% of these values are just 0 which does not bring anything useful to the model. Furthermore, this representation is oversimplifying the complexity of languages as they require more attention to the meanings and the context of words.

Of course, there exist much better text representation techniques like Term-frequency Inverse-document-frequency (TF-IDF) and Word Embeddings. We will not be delving into them for now as it will be only necessary for a future article when we will be designing and training our model.

All you should retain from this section is that in NLP, we need to encode our text into vectors, a mathematically structured data. I believe this is what lies at the core of NLP and the rest is either details or lies at the intersection of other fields like Machine Learning and Linguistics: To process your text, you will need to understand linguistic concepts like stopwords, part of speech tags, and tokenization. And to train your model, you will need statistical models like support vector machines (SVM) and neural networks (NN).

Why NLP is important?

“Well NLP is cool and stuff, but how can we leverage it to improve our businesses more efficiently? How it could differ from the more traditional techniques?”

As we have said before, NLP allows machines to effectively understand and manipulate human languages. With that, you will be able to automate a lot of tasks and improve their rapidity and scale, like data labeling, translation, customer feedback, and text analysis. Applying NLP to real world cases and not just for research purposes, will bring a significant competitive advantage to many business.

An interesting an article written by the HealthCatalyst. In 2005, Indiana University Health (IU Health) implemented a machine learning early-warning system to identify unusual trends in the emergency department (ED). At some point, it detected an abnormal number of patients having the same specific symptoms (include dizziness, confusion, nausea, etc). At first, the existing data did not show something unusual, unlike the early-warning system. Later, it was revealed that these individuals lived in the same apartment complex and that their heater was malfunctioning. That caused them to get sick from carbon monoxide.

This ability to analyze massive amounts of data, specifically unstructured data, is a game-changer. From our little story, we can see how the model was capable of leading its developers to the right way in their analysis of the problem at hand. It did not exactly provide the full answer, but it helped them pinpoint this ‘black swan’ hidden in plain sight as the existing data did include anything about this phenomenon.

Another fascinating story is that of Kasisto. Founded in 2015, the company created a chatbot called KAI that can help banking and financial organizations develop their own chatbots which would help their customers with receiving their services and managing their finances. These chatbots, of course, are made using NLP.

For example, a bank can feed KAI data containing transaction records and account details, in order to train a model for the customer’s support. By learning over a moderate amount of time and with enough data, the chatbot will be able to answer questions and fulfill services in the chat interface. You can ask it simple questions like what is my largest transaction so far, or you can ask for a recommendation for a certain need you have and it will share with you the links you need. It can also redirect customers to human service agents in case of need.

NLP entered also the legal domain as there are many companies like Ross Intelligence, which uses IBM Watson, developed natural language query interfaces so that you ask questions as if there is a lawyer that will answer all your questions.

Now, these are a few stories of many. I hope that you can see the reasons why one should actually think about adopting NLP. So, now, let us take an overview of what we will be learning in this series!

Project Overview

So you have a collection of documents, like pdf or XML or even txt, you want to analyze them thoroughly. For example, you want to detect all entities present within the entire corpus. You can decide to train a Named Entity Recognition model. You can annotate your text manually or go for text annotation tools. The annotated documents are then fed to the NER model so it will be finally able to perform the desired analysis.

For this series, we will be training a custom NER Model to use for stock news analysis. We will also give special care to the Data Labeling part. Data labeling or data annotation is so important in Machine Learning. Garbage in, garbage out.

Here is the outline of the series:

Project Overview
Data Collection
Data Preprocessing
Data Labeling
Model Training
Model Deployment
Model Monitoring
Text Mining

Each part of this series will have its own proper article. We will try to preserve the gentle tone and not complicate things more than they should.

Conclusion

This series is aimed mainly at those who know at least some bits of NLP but are struggling to go to the next level. We will also try to make the series friendly for the non-technical folk, especially those who want to leverage their businesses with its power. UBIAI will share some of their tips across the series. UBIAI is a company that specializes in data annotations and creating custom NLP models. Feel free to contact us at admin@ubiai.tools or Twitter.

Stay tuned and see you in the next article!

Building An NLP Project From Zero To Hero (1): Project Overview

Written by Khaled Adrani