Discovering the Value of Text: An Introduction to NLP

Published in

Analytics Vidhya

5 min readApr 6, 2020

When we talk about data, it is common practice to imagine continuous features that describe quantities or categorical features that contain items from fixed lists. Although there is a third kind of feature, which is text and it can be generated in many applications letting us extract valuable information.

Text data is generated not only in written forms as books, news, tweets, messages, comments, customer reviews, chats with chatbots but also in spoken forms as speeches with humans or with machines such as virtual assistants. All these channels are constantly generating large amounts of text data every moment in which organizations can systematically process on a large scale.

Text data is represented as strings and made up of characters that can form words, sentences, and paragraphs. However, it is highly complex and unstructured. Not only because it can be retrieved both in written and spoken but human language itself is extremely complex and diverse. As an example, the meaning of words changes according to the combinations and sequences of words. Moreover, there are hundreds of natural languages with their grammar and syntax rules, terms, slangs, and dialects. Moreover, emojis.😅

Luckily enough technology is rapidly advancing and there is an increased interest in human-to-machine communications. Natural Language Processing (NLP), is a branch of artificial intelligence, linguistics and computer science, which deals with interactions between machines and humans using the human natural language. In other words, NLP makes possible for machines to read text or hear speech, analyze and interpret it, understand the important parts, and even the opinions and emotions.

Basic NLP tasks:

With NLP we can break down the unstructured text into shorter and more structured information pieces. We first start with sentence segmentation to split our text into sentences. After having sentences perfectly separated, we can apply word tokenization to break sentences into tokens of words. Even now, we can start getting some ideas about our text by analyzing the length of sentences and the most frequent words used.

An example of a tokenization process from https://spacy.io/

We can apply Parts of Speech (POS) Tagging to deepen our analysis by understanding what is the function of each word in a sentence. POS tagging is also called grammatical tagging or word-category disambiguation and basically, it tags each word with a particular part of speech as; noun, pronoun, verb, adverb, adjective, conjunction, preposition, and interjection. This step is extremely useful to gather information on the linguistic signal, syntactic and semantic analysis of how a word is being used within the scope of a sentence or document.

Similarly, we can apply Named Entity Recognition (NER) to extract information that marks and locates named entities into predefined categories such as person names, organizations, locations, quantities, monetary values, time expressions, etc.

Example of Named Entity Recognition

Another very useful step is the removal of stopwords. Stopwords are the most common words in any language such as determiners (i.e. “the”, “a”, “an”), coordinating conjunctions (i.e. “for”, “but”) and prepositions (i.e. “in”, “towards”). Even though these words are the most common ones, they do not add much value to the meaning of the sentences. Therefore, it is important to filter them out while preparing our data for modeling. This can be done on Python either using predefined stopwords lists of NLP libraries such as NLTK, Gensim, SpaCy or you can also write your custom list.

To prepare data for further processing the next thing to do is text normalization that is known as Stemming and Lemmatization. The goal of both stemming and lemmatization is to reduce the inflectional forms to a common base form. By stemming, we can restore the words to their stem or root forms even if the stem is not a dictionary word and by lemmatization, we can restore the words to their lemmas or dictionary forms.

Studies -> Stemming-> Studi
Studies -> Lemmatization-> Study

After performing these steps, we can expose higher-level NLP capabilities to retrieve extensive insights. These NLP capabilities can be listed as below:

Topic discovery and modeling: As an unsupervised machine learning technique, topic modeling can scan sets of documents, detect patterns between words and phrases, and find out the different topics that they cover and group them by those topics. It does not only help to uncover the hidden topics from large documents of text but also unlocks further analyses like optimization and forecasting. It is also widely used in recommender systems by finding similarities from topic to topic and recommending the closest topics.

Topic classification: As a supervised machine learning technique, differently from topic modeling, in topic classification we need to predefine a list of topics. After training our model on the predefined topics and texts, we can build topics for the unseen text based on similarities in the contents. It can be used to measure and improve customer satisfaction, support efficiency, sales conversion, retention and more.

Contextual extraction and text summarization: We can automatically extract structured information from text sources and generate extractive and abstractive summaries.

Machine Translation: Automatically converting one natural language into another while preserving the information equally.

Sentiment Analysis: Also known as opinion-mining, sentiment analysis understands the emotional tone and subjective opinions behind the text. It is widely used to help organizations for gathering insights from customer reviews, social media channels, forums or comment forms.

I hope this short introduction made you curious about the world of NLP.

In my next post, I will be talking about the required preprocessing in NLP more in detail 😊

Stay safe!

Discovering the Value of Text: An Introduction to NLP

Written by Idil Ismiguzel