Exploratory Data Analysis for Natural Language Processing

Isaac Cohen Sabban
Voice Tech Podcast
Published in
4 min readMay 22, 2020

As a data scientist for an insurance company, I found myself working on text data.

Text is an unstructured data which can provide a lot of information. And doing a statistical analysis on it allows to draw some information.

Text analysis allows companies to automatically extract and classify information from text.

Popular text analysis techniques include sentiment analysis, topic detection, and keyword extraction.

Today I’m going to show you three tools that I use each time for extract and classify information from text

The Data

This dataset contains the sentiments for financial news headlines from the perspective of a retail investor. You can found the dataset on Kaggle.

The dataset contains two columns, “Sentiment” and “News Headline”. The sentiment can be negative, neutral or positive.

Before using these techniques, I advise you to clean the data :

  • Remove punctuation.
  • Remove stop word.
  • Normalize (by applying a Stemmer or a Lemmatizer ).
  • Set all characters to lowercase.
  • Remove numbers (or convert numbers to textual representations).

What are the most common words ?

Identified the most common words allows us to discover useful information, informing conclusions and supporting decision-making.

Word Cloud

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.
It is often found on social networks, it is a tool widely used by digital marketers

Word cloud for each label

Word cloud helps us identify the most common word on each label. But as we can see here, “Company” and “Finnish” are very present in all classes, and we dont really know in what proportion.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Text Frequencies

In this step, we decide to look at the words that are most present in our corpus.

First 20 common words

The words found here can be considered as stop words. Indeed, if they are present in each label with similar proportions, it is very likely that they do not provide information.

While we are looking at the most common words of all the data set, zooming in on each label can guide us.

Word Frequencies for each label

Are there significant words ?

Even if the top 10 for each label looks similar, by looking at the words outside of it, we start to distinguish interesting words like “decreased” and “loss” for negative sentiment or “rose” and “increased” for positive sentiment.

By removing the most common word of the corpus, we can have a better word frequencies tab.

Why and How to use N-grams ?

An n-gram is a contiguous sequence of n items from a given sample of text or speech.

For example, on the first observation

“Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host ..”

  • 1-Gram : “Technopolis” “plans” “to” “develop” “in” …
  • 2-Grams : “Technopolis-plans” “plans-to” “to-develop”

N-Grams helps us to catch the context of the sentence.

2-grams analysis for each label

The 2-Grams bring us information on the combinations with the most common words. Even if “eur” is present in all classes, in one case we have “loss_eur” and “decreased_eur” which expresses the idea of ​​a loss which is a negative feeling. In an other class, we have 2-Grams which expresse positive feelings like “rose_eur” or “profit_rose”.

Conclusion

Data Analysis is an important processing step to understand the data and the information that it can bring.

By using Word Cloud, Text Frequencies and N-Grams, we can identify before making a model :

  • added value words
  • most common words
  • best combination words

Link to my Github repo.

Something just for you

--

--