COVID Tweet Analysis- Part 2

Finding latent topics using Topic Modelling

Pooja Mahajan
Analytics Vidhya
5 min readOct 2, 2020

--

In continuation of part-1, where we explored twitter data related to COVID. In this post, we will use Topic Modelling to get to know more about the underlying key ideas that people are tweeting about.

Let’s first understand what Topic Modelling is!

Topic Modelling

Topic Modelling is an unsupervised technique which helps to find underlying topics also termed as latent topics, present in a plethora of documents available.

In real-world, we observe a lot of unlabelled text data, in form of comments, reviews or complaints, etc. For these scenarios, Topic Modelling comes as a quick fix in finding underlying topics being discussed, to aid in process of labelling.

Techniques used for Topic Modelling

  • Latent Dirichlet Allocation (LDA)

It works under the assumption that documents with similar topics use similar words. It assumes documents as a probability distribution over latent topics and topics as a probability distribution over words.

In nutshell, LDA represents documents as a mixture of topics and those topics are related to the words with certain probabilities.

  • Non-Negative Matrix Factorisation

It is a mathematical technique in which a matrix is factorised into two matrices, with a property that all the three matrices have no negative elements(that’s why the name!).

We can create a matrix X (Data Matrix)with dimensions n * p, which can be represented as a product of two matrices i.e. A(Basis Vectors) with dimensions n*k and B(Coefficient Matrix) with dimensions k *p. Here n is the number of documents, p is the number of tokens and k is the number of topics. This decomposition thus corresponds to relating documents to topics and topics to tokens and hence can be used as a way to do topic modelling.

Both the techniques use the concept of relating documents to topics and topics to tokens(or words). While LDA models it as probability distribution to find words that relate more to a topic and documents that relate more to a topic, NMF uses coefficient values(obtained during matrix factorisation) of the words per topic for interpretation.

Implementation in Python on COVID Twitter data

So, let's move back to where we left last time and apply topic modelling on Twitter data. In the last blog, we already did data cleansing so I will be proceeding from that point.

Creating the document term matrix

Using sklearn.feature_extraction.text.TfidfVectorizer to create document term matrix and then applying fit_transform on the processed tweets. I have set a few parameters like max_df,min_df,ngram_range,stop_words. As an output, you will get dtm as a sparse matrix of 41122*43359 corresponding to the number of tweets and number of tokens respectively i.e. Matrix X with dimensions n*p in terms of the definition stated above.

Creating NMF Model

I tried both LDA and NMF, NMF was performing better, so taking you through NMF results.

Creating an NMF object using sklearn.decomposition.NMF by setting n_components as 5 (although I tried other values for n_components like 3,6,7 too) followed by fitting and transforming the document term matrix obtained from Tfidf process.

Let’s explore how the output of transformation looks like! This is the A matrix with dimensions n*k as we discussed above in the definition.

For topics[0] corresponding to document at index 0, I am getting an array with five values (corresponding to the number of topics) and highest coefficient value at index 3 implies that this particular document belongs to topic-3.

Explaining each topic

Since we got an idea of how topics will be related to documents using the output of nmf_model.fit_transform(dtm) discussed above. Let's also understand what these topics are made up of!

In this case, we are using nmf_model.components_ (this is the second matrix B with dimensions k*p) to get the relation of topics to words by viewing the top 10 words with the highest coefficient values corresponding to each topic.

These words give the idea in terms of composing topic names to report these findings in a better way.

Here, topic-0 talks about oil and gas prices, topic-1 relates to grocery stores, topic-2 corresponds to hand sanitiser and toilet paper, topic-3 to online shopping and topic-4 to panic buying.

Labelling tweets

So after getting the idea of what constitutes these five topics, the next step is to label the tweets.

The first matrix we discussed i.e. n*k (topics variable), we are assigning topic number corresponding to maximum coefficient value using: df[‘Topic’]=topics.argmax(axis=1)

Based on the words associated with each topic we can create label corresponding to that. I have created 5 labels as ‘oil_gas_prices’, ’grocery_store_workers’, ’toilet_paper_sanitiser’, ’online_shopping’, ’panic_buying_hoarding’. These names can be redefined based on creativity :D

Mapping topic names to topic numbers
Original Tweet and corresponding Topic Name

We can see that these tags somehow did a good job in giving an understanding of the tweets. More tweets have been tagged to panic_buying_hoarding and oil_gas_prices topics.

So that’s it! You made till the end.

You can find corresponding codes for this analysis here.

References:-

--

--

Pooja Mahajan
Analytics Vidhya

Data Scientist. Passionate about problem-solving and creating impactful solutions through AI/ML. LinkedIn — https://www.linkedin.com/in/pooja-mahajan-69b38a98/.