Using LDA based Topic Modeling on Uber India related tweets to get consumer insights

Twitter is very common and is used in any area such as policy, promotion, branding, public awareness, etc. For debates and views, Twitter is widely used by consumers and service providers. It is therefore critical that Twitter talks are analyzed, visualized and summarized for new insights into customer experience respectively. The explanation why Twitter talks should be concerned by advertisers is that customers’ and service providers’ tweets may affect the customers’ feelings for their products. With this in mind, this study aims to study Uber’s customer involvement in India.

Extracting the Dataset from Twitter

For extraction of tweets, I have used snscrape library. The development version of snscrape requires at least Python 3.8 or higher. I won’t cover upgrading your Python as there are a multitude of tutorials available online.

Using Snscrape

Importing Required Libraries

Data Cleaning

We had to clean emojis and urls’ data after we pulled and refurbished Twitter data and imported the necessary packages in order to to tokenize it for the next phase.

Our aim during the data preparation phase is to translate sentences into terms, to transform words into their source and to exclude words which are too general or unrelated to our subject modeling project. I will share the code and guide you through all phases: We use the following strategies to achieve our goal:

Tokenization: Split the wording into terms and phrases. Lower the terms and delete the dot.

Deleted words in less than 3 letters.

All stop words are deleted.

Stemming & Lemmatization: The terms in the third person are transformed into the first, the pronouns are changed to the present in the past and the future.

Tokenization is always the first step before we can do any text data processing. What this means is that spacey will segment sentences into words, punctuations, symbols and others by applying specific rules to each language. Spacy is a pre-trained natural language processing model capable of figuring out the relationship between words.

Lemmatization is a method in which words are converted to their source word. For instance, ‘studying’ is turned into ‘studying,’ ‘meeting is turned into ‘meet’ and ‘better’ and ‘best’ into ‘good.’ The benefit is that the overall number of single terms in the dictionary can be reduced. In the document-word matrix the number of columns is thicker with smaller columns. Lemmatization’s ultimate aim is to ultimately enable the LDA model to yield better subjects.

LDA Model

Model perplexity and topic coherence in the topic offer an easy way to assess how strong a given subject model is. In my experience, coherence was especially helpful in the topic.

Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with Google Colab and Jupyter notebooks.

How to understand the production of pyLDAvis

A subject represents any bubble in the left side story. The bigger the bubble, the wider the issue is. A strong subject model would have relatively large uncontrolled bubbles spread around the map rather than concentrated in one quadrant. In general a model of so many subjects is clustered in a map area by many overlaps, little bubbles.

Hyperparameter Tuning