Exploring Topic Modelling using Semi-Supervised Learning (Correlation Explanation)
In today’s world, digital transformation is something pursued in every industry, and data-driven insights are at the very centre of it. Data is generated and collected in different forms every second and all organisations strive to ensure it has a complete view of its data to provide real-time insights along with the ability to take data-driven actions. The challenging part of any data-driven process is to obtain the relevant and desired information within a short timeframe. There are many algorithms and technologies developed to fetch information that one is looking for. In this article, we explore one of the popular techniques in the area of natural language processing known as Topic Modelling and a specific package that enhances this area.
What is Topic Modelling?
Topic modelling is the process of identifying topics within a document. With the increase of digitized text such as emails, tweets, books, journals, articles, and more, Topic modelling remains one of the most important techniques to identify and automate the classification of such documents into categories or topics wherever necessary. There are many standard methods to approach topic modelling with some of the popular ones being Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF).
Both LDA and NMF are standard implementations within the python scikit library and are widely used across the data science community. These methods are unsupervised learning algorithms that generate a variety of interconnected topics and are highly dependent on the assumptions it makes about the dataset. However, these methods have limitations when it comes to generalising the underlying detail and the complex assumptions around the data generative process. In some cases, due to the high-dimensionality of data input by humans these models end up with the wrong assumptions.
In this article, we explore the CorEx package which allows some degree of control on the topics generated by the model.
Correlation Explanation (CorEx)
Correlation Explanation (CorEx) is a flexible framework developed by Greg Ver Steeg for topic modelling to identify topics that maximise the information available in a corpus of text. The CorEx model allows the incorporation of domain knowledge through user-specific anchor words which guide the model towards the topics of interest. This enables the model to represent topics that do not naturally emerge and provides the ability to separate keywords allowing distinct topics to be identified.
For example, the below table shows some of the keywords associated with topics in a newgroup, and the anchor words which would be used to identify these topics.
How to use CorEx?
The python implementation of CorEx is available on Github.
You can install CorEx on your python using the pip command.
pip install corextopic
Below are examples of topic modeling using a standard unsupervised method such as LDA and NMF.
Topic Modelling using LDA
from sklearn.decomposition LatentDirichletAllocation as LDA
no_of_topics = 4
tfidf = TF-IDF matrix of your documents# Run LDA
lda = LDA(n_topics=no_of_topics).fit(tfidf)# Display top n words for each topic identified
def display_topics(model, features, words_count):
for topic_no, topic in enumerate(model.components_):
print("Topic %d:" % (topic_no))
print(" ".join([features[i] for i in topic.argsort()[:-words_count - 1:-1]])
words_count = 10# Display top 10 words for each topic
display_topics(lda, tfidf_feature_names, words_count)
Topic Modelling using NMF
from sklearn.decomposition import NMF
no_of_topics = 4
tfidf = TF-IDF matrix of your documents# Run NMF
nmf = NMF(n_components=no_of_topics).fit(tfidf)# Display top n words for each topic identified
def display_topics(model, features, words_count):
for topic_no, topic in enumerate(model.components_):
print("Topic %d:" % (topic_no))
print(" ".join([features[i] for i in topic.argsort()[:-words_count - 1:-1]])
words_count = 10# Display top 10 words for each topic
display_topics(nmf, tfidf_feature_names, words_count)
Both examples above do not take any input from the user to identify the topics. Topics are chosen purely on the underlying concepts and might not capture any relationship between keywords that are always connected or always used in separate concepts across topics in a complex dataset.
Topic Modelling using Correlation Explanation overcomes this limitation using anchor keywords as shown in the example below:
from corextopic import corextopic as ct
no_of_topics = 4
anchor_strength = 3
tfidf = TF-IDF matrix of your documents# Anchor Keywords
keywords = [
["congress", "clinton", "trump"],
["bible", "christian", "muslim", "hindu"],
["circuit"],
["pitching","goal"]
]# Run Anchored CorEx
topic_model = ct.Corex(n_hidden=no_of_topics)
topic_model.fit(tfidf, anchors = keywords, anchor_strength = anchor_strength);# Display top n words for each topic identified
def display_topics(model, words_count):
for i, topic_words in enumerate(model.get_topics(n_words = words_count)):
topic_words = [words[0] for words in topic_words if words[1] > 0]
print("Topic #{}: {}".format(i+1, ", ".join(topic_words)))words_count = 10# Display top 10 words for each topic
display_topics(topic_model,words_count)
The anchor keywords are sets of keywords assigned to each topic. In the above example, keywords are used to identify topics such as politics, religion, sport, and Electricity(Utility services). The CorEx model also has a strength parameter that defines the bias of the topics generated towards the anchor keywords. This value should always be above 1 and higher values indicate a stronger bias towards the anchor keywords.
An example notebook for CorEx by Ryan Gallagher is available at this link.
What we do at Version 1?
At Version 1, we gain insights into how satisfied our customers are through the quarterly survey. This helps us understand what our customers needs, an area that could be improved, and also identify possible opportunities to innovate and add value.
The Corex Topic Modeller helps to identify topics across customers in key areas allowing us to improve and also ensure we are consistently providing an excellent service.