Topic Extraction from Tweets using LDA

Usen Osasu
6 min readDec 13, 2019

--

In this post, we are going to be looking at the task of performing topic modelling on twitter data to figure out what people are tweeting about. We would first load our data, perform simple EDA on the data to explore popular hashtags and users, lastly we would apply a machine learning algorithm, LDA ( Latent Dirichlet Allocation), to explore the topics in the tweets.

The code for this blog post can be found here.

To follow this post, you should be comfortable with using basic python, thepandas and numpy python package. You will need to have the following packages installed: numpy, pandas, seaborn, matplotlib, sklearn, nltk

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is one example of a topic model used to extract topics from a document. LDA is an unsupervised machine learning algorithm that allows a a set of textual observations to be explained by unobserved groups that explain similarities within the data. LDA represents documents as mixtures of topics that spit out words with certain probabilities.

Topic Modelling using LDA

Data

Twitter is a fantastic source of data, with over 8,000 tweets sent per second. Those tweets can be downloaded and used to try and investigate mass opinion on particular issues. This can be as basic as looking for keywords and phrases or can be more advanced, aiming to discover general topics contained in a dataset. The first thing we will do is to get your data.

Lets load the required packages.

# packages to store and manipulate data
import pandas as pd
import numpy as np

# plotting packages
import matplotlib.pyplot as plt
import seaborn as sns

# model building package
import sklearn

# package to clean text
import re

EDA

After loading the data, we check through the tweets to find retweets.

# make a new column to highlight retweets
tweets['is_retweet'] = tweets['full_text'].apply(lambda x: x[:2]=='RT')
tweets['is_retweet'].sum() # number of retweets

We can also see the most frequent tweets. Count number of times a tweet is duplicated and sort.

# 10 most repeated tweets
tweets.groupby(['full_text']).size().reset_index(name='counts')\
.sort_values('counts', ascending=False).head(10)

To visualize, we plot the distribution of repeated tweets in the dataset.

Next we find out who is being tweeted at and the most common hashtags

def find_mentioned(tweet):
'''This function will extract the twitter handles of people mentioned in the tweet'''
return re.findall('(?<!RT\s)(@[A-Za-z]+[A-Za-z0-9-_]+)', tweet)
def find_hashtags(tweet):
'''This function will extract hashtags'''
return re.findall('(#[A-Za-z]+[A-Za-z0-9-_]+)', tweet)
# make new columns for mentioned usernames and hashtags
tweets['mentioned'] = tweets.full_text.apply(find_mentioned)
tweets['hashtags'] = tweets.full_text.apply(find_hashtags)

In this section we will perform an analysis on the hashtags. First we will select the column of hashtags from the dataframe, and take only the rows where there actually is a hashtag.

Currently each row contains a list of multiple values. Next we will make a new dataframe where we take all the hashtags but give each its own row.

# number of unique hashtags
flattened_hashtags_df['hashtag'].unique().size
# --> 13807

Try using the above process on the “mentioned” column in you data.

Find Correlated Hashtags

We will need to turn the text into numeric form. It is possible to do this by transforming from a list of hashtags to a vector representing which hashtags appeared in which rows.

# take hashtags which appear at least this amount of times
min_appearance = 1000
# find popular hashtags - make into python set for efficiency
popular_hashtags_set = set(popular_hashtags[
popular_hashtags.counts>=min_appearance
]['hashtag'])

Next we are going to create a new column in hashtags_df which filters the hashtags to only the popular hashtags. We will also drop the rows where no popular hashtags appear.

Next we want to vectorise our the hashtags in each tweet like mentioned above. To do this, we create a dataframe where the hashtags contained in each row are in vector form.

# make new dataframe
hashtag_vector_df = popular_hashtags_list_df.loc[:, ['popular_hashtags']]
for hashtag in popular_hashtags_set:
# make columns to encode presence of hashtags
hashtag_vector_df['{}'.format(hashtag)] = hashtag_vector_df.popular_hashtags.apply(
lambda hashtag_list: int(hashtag in hashtag_list))

Drop the “popular_hashtags” column and plot the correlation.

Topic Modelling

We remove web-links from the tweets. We will also remove retweets and mentions. We remove these because it is unlikely that they will help us form meaningful topics.

We would like to know the general things which people are talking about, not who they are talking about and not the web links they are sharing.

Next, we stem the words in the list. This is essentially where we knock the end off the words. We do this so that similar words will be recognised as the same word by the algorithm.

my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'
# cleaning master function
def clean_tweet(tweet, bigrams=False):
tweet = remove_users(tweet)
tweet = remove_links(tweet)
tweet = tweet.lower() # lower case
tweet = re.sub('['+my_punctuation + ']+', ' ', tweet) # strip punctuation
tweet = re.sub('\s+', ' ', tweet) #remove double spacing
tweet = re.sub('([0-9]+)', '', tweet) # remove numbers
tweet_token_list = [word for word in tweet.split(' ')
if word not in my_stopwords] # remove stopwords
tweet_token_list = [word_rooter(word) if '#' not in word else word
for word in tweet_token_list] # apply word rooter
if bigrams:
tweet_token_list = tweet_token_list+[tweet_token_list[i]+'_'+tweet_token_list[i+1]
for i in range(len(tweet_token_list)-1)]
tweet = ' '.join(tweet_token_list)
return tweet
tweets['clean_tweet'] = tweets.full_text.apply(clean_tweet)

Now that we have clean text we can apply some processing to turn the clean tweets into vectors and then build a model.

from sklearn.feature_extraction.text import CountVectorizer# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=0.9, min_df=100, token_pattern='\w+|\$[\d\.]+|\S+')
# apply transformation
tf = vectorizer.fit_transform(tweets['clean_tweet']) #.toarray()
# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()
tf.shape # --> (200000, 2296)

Each row in the tf matrix is a tweet and each column is a word. The numbers in each position tell us how many times this word appears in this tweet.

Next we create the model object. Lets start by randomly choosing 10 topics.

from sklearn.decomposition import LatentDirichletAllocationnumber_of_topics = 10model = LatentDirichletAllocation(n_components=number_of_topics, random_state=45) # random state for reproducibility# Fit data to model
model.fit(tf)

Next we will inspect our topics that we generated and try to extract meaningful information from them.

Improvements

In order to produce better results, you can try some of the following;
- hyperparameter tuning
- try out a different model, non-negative matrix factorisation (NMF).

Conclusion

In this article, we looked at topic modelling on tweets using an unsupervised learning algorithm named Latent Dirichlet Allocation (LDA). We performed basic exploratory data analysis on the data, then we cleaned and coverted the data to vector format and fitted the data to the LDA model. After the process was complete we were able to extract topics from the tweets and their probabilities.

--

--

Usen Osasu

Senior Data Scientist | Generative AI | Deep learning | Bringing data-driven strategies to the forefront of the fintech industry