Natural Language Processing — Topic modelling (including latent Dirichlet allocation-LDA & analysis) and Sentiment Analysis

Vivek Sasikumar
5 min readMar 1, 2019

--

I have been working on NLP for some time now and thought that I would put together a list of tools & libraries that I normally use for various purposes such as topic modelling, sentiment analysis, named entity recognition etc. When dealing with a lot of text, as data scientists, we need to find order in the chaotic mix and find patterns that tell a story. Topic modelling and sentiment analysis helps makes this more achievable.

For the purpose of this article, I downloaded text format of ‘Harry Potter and the Sorcerer’s Stone’ and am using it to showcase the use of these tools.

The downloaded data is preprocessed to final state by removing common stopwords in english, removing punctuations and lemmatization. We use lemmatization instead of stemming since we care about the lemma words for correlation.

  1. Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
  2. Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.
Input data
Output data

Topic Modelling

1. Latent Dirichlet Allocation topic modelling

In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document.

The LDA model is highly modular and can extended easily. It can be used effectively to quantify the relationship between topics in a document.

Corpora
20 topic words
Scoring random example against the topics
Interactive LDA result plot

Playing around with the interactive plot is the best way to understand the salience and relevance of the topics.

2. spaCy topic modelling

Topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

I prefer to use spaCy for tagging, parsing and entity recognition.

Other than LDA and spaCy, one can also use nltk library to tokenize and tag for topic modelling which I find more manual. Check out this link for more information.

3. Sentiment Analysis

I am using 2 libraries for sentiment analysis: Vader and TextBlob.

1. TextBlob

Polarity is a value between range [-1.0, 1.0] and indicates whether the sentence is negative or positive. Nearer to zero means it is neutral. Subjectivity ranges from [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

TextBlob is versatile and can be used for speech tagging, tokenization, lemmatization, translation, spelling correction, word dictionary and corpora creation, parsing and n-grams splitting for machine learning uses.

2. Vader

Vader comes from nltk and is another good tool for sentiment analysis.

Vader takes capital and exclamation marks into account which really adds value in sentiment analysis of online feedback, twitter comments etc. The scores can be used to create features for machine learning prediction models.

I always use Vader for most of my sentiment analysis needs.

Now these are some of the tools for NLP that can be used. The full code can be obtained here.

For structured data, we can use a plethora of feature engineering tools such as CountVectorizer, TF-IDF Vectorizer, n-grams (for sequences), cosine & linear similarity etc. to create NLP machine learning models. Check this link for an example. Also, recurrent neural networks (RNN) can be used for text data for identifying the sequence relationship.

--

--