Snapshot of interactive visualization of the topics identified by Guided LDA and the keywords in each topic (pyLDAvis)

How did I tackle a real-world problem with GuidedLDA?

Shahrzad Hosseini
Oct 10, 2019 · 9 min read

The prevalent use of online platforms for interaction and large size of the text data from users’ input makes digesting the data increasingly time consuming. Sown to Grow is an online educational company with the aim to empower students by providing a platform to set goals and reflect on strategies and interact with their teachers. In order for this company to be able to scale up across the US, automated parsing of reflections is necessary. It helps teacher to customize the feedback and channel the limited resources to vulnerable kids.


The company shared 180k of student’s reflections that based on company’s rubric system were considered as high quality (having a strategy/strategies). I cannot show the actual data due to privacy reasons but my dataframe looked like below:

content               index0                  reflection            0 

1 reflection 1



184835 reflection 184835

After cleaning the data which included removing the duplicates, non-related content, and non-English content, I ended up having 104k reflections that I used to identify the strategies. Below is the function I used to correct the misspelled words.

from enchant.checker import SpellCheckerdef spell_check(text):       
spell_check: function for correcting the spelling of the reflections
Expects: a string
Returns: a list
Corr_RF = []
#Grab each individual reflection
for refl in text.split():
#Check to see if the words are in the dictionary
chkr = SpellChecker("en_US", refl)
for err in chkr:
#for the identified errors or words not in dictionary get the suggested correction
#and replace it in the reflection string
if len(err.suggest()) > 0:
sug = err.suggest()[0]
#return the dataframe with the new corrected reflection column
return ' '.join(Corr_RF)
data['Corrected_content'] = data.content.apply(spell_check)document = data #to change the name of the dataframe to documents

To remove the non-English content, I used langdetect to tag the language of the text and remove the non-English ones. langdetect is pretty accurate when input is sentence but when entering just a word it is not prefect.

from langdetect import detectdef lang_detect(text):
lang_detect: function for detecting the language of the reflections
Expects: a string
Returns: a list of the detected languages
lang = []
for refl in text:
return lang

Initial strategies to solve the problem

Regular LDA

Then I started to model topics in the reflections using Gensim Topic modelling package through Latent Dirichlet Allocation (LDA). To prepare the data for topic modelling I tokenized (split the document to sentences and sentences to words), removed punctuations and lower cased them. Words with the length smaller than three characters are also removed. All these can be done using Gensim simple preprocess module. After that I defined function to change words in third person to first person and verbs in past and future tenses to present. Then words are reduced to their root form (stem and lemmatizing).

import gensimfrom gensim.utils import simple_preprocessfrom gensim.parsing.preprocessing import STOPWORDSfrom nltk.stem import WordNetLemmatizer, SnowballStemmerfrom nltk.stem.porter import *from nltk.corpus import wordnetimport numpy as npnp.random.seed(42)

After importing the necessary packages and modules. Now it is time for some preprocessing as explained before:

def lemmatize_stemming(text):
stemmer = SnowballStemmer('english')
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
return result
processed_docs = documents['content'].map(preprocess)

Below example is to show the result of preprocessing (I have used a hypothetical example):

doc_sample = documents[documents['index'] == 34].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
print('\n\n tokenized and lemmatized document: ')


original document:['Something', 'I', 'think', 'I', 'have', 'done', 'correct', 'is', 'studying', 'in', 'advance.']tokenized and lemmatized document:['think', 'correct', 'studi', 'advanc']

To create a bag of words on the data set, Gensim dictionary can be used. Bag of words is simply a dictionary from ‘processed_docs’ containing the number of times a word appears (words count) in the whole documents (corpora).

dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
print(k, v)
count += 1
if count > 10:

To remove the tokens that appear in less than 15 documents and above the 0.5 document (fraction of the total document, not absolute value). After that , keep the 100000 most frequent tokens.

dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

I created a dictionary that shows which words and how many times those words appear in each document and saved them as bow_corpus:

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Now, the data is ready to run LDA topic model on it. I used Gensim LDA with capability of running on multiple cores.

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=7, id2word=dictionary, passes=2, workers=2)

To check the words for each topic and its relative weight:

for idx, topic in lda_model.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
Topic: 0Words: 0.046*"time" + 0.044*"read" + 0.041*"week" + 0.030*"work" + 0.024*"studi" + 0.022*"go" + 0.016*"good" + 0.016*"book" + 0.015*"like" + 0.014*"test"Topic: 1Words: 0.055*"read" + 0.036*"question" + 0.034*"answer" + 0.025*"time" + 0.018*"text" + 0.017*"strategi" + 0.017*"work" + 0.016*"think" + 0.014*"go" + 0.014*"look"Topic: 2Words: 0.037*"need" + 0.021*"work" + 0.018*"word" + 0.018*"write" + 0.015*"time" + 0.015*"complet" + 0.015*"essay" + 0.014*"goal" + 0.013*"help" + 0.012*"finish"Topic: 3Words: 0.042*"note" + 0.041*"help" + 0.032*"studi" + 0.029*"understand" + 0.027*"quiz" + 0.024*"question" + 0.021*"time" + 0.016*"better" + 0.014*"take" + 0.014*"test"Topic: 4Words: 0.031*"write" + 0.031*"work" + 0.027*"time" + 0.025*"think" + 0.024*"sure" + 0.019*"check" + 0.017*"thing" + 0.017*"strategi" + 0.014*"question" + 0.014*"help"Topic: 5Words: 0.058*"work" + 0.057*"grade" + 0.046*"goal" + 0.033*"class" + 0.027*"week" + 0.022*"math" + 0.017*"scienc" + 0.016*"improv" + 0.016*"want" + 0.016*"finish"

As you could see from the words in each topic, some of the words are shared between topics and there is not distinct topic that can be tagged for each group of words.

Part of Speech (POS) Tagging

After LDA, I decided to tag the part of speech (POS) for each reflection and extract the verbs from them. As I assumed students are reflecting on what they did, so reflections that have verbs in past tense could give me clue of the topics for learning strategies (e.g. I studied my notes and practiced the past exams). I parsed the reflection and extracted the all verbs used in the reflections via part of speech tagging. Then, I looked for the tense of the verbs to identify relation between reflections having a learning strategy and the tense of the verbs used in it. I noticed there are reflections that clearly have learning strategies and are not necessarily in past tense.

So, this also did not help me in finding the distinct topics of learning strategies. However, both LDA and POS give me an idea and that was using GuidedLDA (). Guided LDA is a semi-supervised learning algorithm. The idea is to set some seed words for topics that user believes are representative of the underlying topics in the corpus and guide the model to converge around those terms. I used a python implementation of the algorithm explained in paper by J. Jagarlamudi, H. Daume III and R. Udupa “Incorporating Lexical Priors into Topic Models”. The paper talks about how the priors (in this case priors mean seeded words) can be set into the model to guide it in a certain direction.

In regular LDA, first each word is randomly assigned to a topic controlled with Dirichlet priors via the Alpha parameter (now you know where LDA gets its name from). The next step is to find out which term belongs to which topic. LDA uses a very simple approach by finding the topic for one term at a time.

Let’s assume we want to find the topic for the word ‘study’. LDA will first assume that every other word in the corpus is assigned to the right topic. In the last step, each word is uniformly distributed in all topics and it is assumed that is the correct topic for those words. Then LDA computes which words, ‘study’ frequently comes along with. Then, which is the most common topic among those terms. We will assign ‘study’ to that topic. ‘study’ will probably go near whichever topic ‘textbook’ and ‘notes’ are in. Now these three words are closer to each other than they were before this step. Then model moves to next word and repeats the process as many number as needed to converge. With guided LDA, we explicitly want the model to converge in a way that words ‘study’ and ‘textbook’ are in one topic. To do so, GuidedLDA gives some extra boost to ‘study’ and ‘textbook’ to lie in a specific topic. In this algorithm, the parameter of how much extra boost should be given to a word is controlled by seed_confidence and it can be range between 0 and 1. With a seed_confidence of 0.1 you can bias the seeded words by 10% more towards the seeded topics.

To use python implementation of GuidedLDA you can:

pip install guidedlda


cd GuidedLDA
python sdist
pip install -e .

To start GuidedLDA, as you do with any NLP work, is to preprocess the data. For that, I have defined my own preprocessing functions:

def get_wordnet_pos(word):    '''tags parts of speech to tokens
Expects a string and outputs the string and
its part of speech'''

tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
def word_lemmatizer(text): '''lemamtizes the tokens based on their part of speech'''

lemmatizer = WordNetLemmatizer()
text = lemmatizer.lemmatize(text, get_wordnet_pos(text))
return text
def reflection_tokenizer(text): '''expects a string an returns a list of lemmatized tokens
and removes the stop words. Tokens are lower cased and
non- alphanumeric characters as well as numbers removed. '''
text=re.sub(r'[\W_]+', ' ', text) #keeps alphanumeric characters
text=re.sub(r'\d+', '', text) #removes numbers
text = text.lower()
tokens = [word for word in word_tokenize(text)]
tokens = [word for word in tokens if len(word) >= 3]
#removes smaller than 3 character
tokens = [word_lemmatizer(w) for w in tokens]
tokens = [s for s in tokens if s not in stop_words]
return tokens

After defining all the necessary functions for preprocessing the, it is time to apply it to the target column (here, corrected_content) of the dataframe, and save it as new column ‘lemmatized_tokens’.

df['lemmatize_token'] = df.corrected_content.apply(reflection_tokenizer)

Now, it is time to generate term-document matrix. For that I used CountVectorizer class from scikit learn package:

from sklearn.feature_extraction.text import CountVectorizer

First, we need to instantiate CountVectorizer. For the full list of the parameters you can refer to scikit learn website. I changed the tokenizer to the customized one I defined previously and the stop words to the list of stop words that I have created based my own dataset. Here, I have used n-gram range of 4 words. Now, it is time to fit and transform the corpus to generate the term-document matrix:

token_vectorizer = CountVectorizer(tokenizer = reflection_tokenizer, min_df=10, stop_words=stop_words, ngram_range=(1, 4))X = token_vectorizer.fit_transform(df.corrected_content)

To model the topics with GuidedLDA, after importing the package a dictionary of the terms is created.

import guidedldatf_feature_names = token_vectorizer.get_feature_names()word2id = dict((v, idx) for idx, v in enumerate(tf_feature_names))

Now, it is time to provide a list of seed words to model. For that I used the semantic of the text along with initial keywords I got from LDA modelling and dictionary of the verbs from POS. For that, I created a list of lists in which each list included the keywords that I wanted to be grouped under a specific topic.

seed_topic_list= [['take', 'note', 'compare', 'classmate', 'highlight', 'underline', 'jot', 'write', 'topic', 'main', 'complete', 'point', 'copy', 'slide'],['read', 'study', 'review', 'skim', 'textbook', 'compare', 'note', 'connect', 'sketch', 'summarize', 'relationship', 'map', 'concept', 'diagram', 'chart'],['question', 'essay', 'assignment', 'exam', 'test', 'quiz', 'answer', 'practice', 'review', 'repeat', 'strength', 'weak', 'solve', 'problem', 'identify'],['plan', 'calendar', 'time', 'task', 'list', 'manage', 'procrastinate', 'due', 'stress', 'manage', 'anxiety', 'express', 'break', 'sleep', 'nap', 'eat', 'exercise'],['group', 'partner', 'classmate', 'brainstorm', 'ask', 'answer', 'verify', 'peer', 'teach', 'clarify'],['ask','aid', 'resource', 'teacher', 'tutor', 'peer', 'verify', 'explain', 'clear', 'talk']]

As, you can see I have provided model with seed words for 6 topics.

model = guidedlda.GuidedLDA(n_topics=6, n_iter=100, random_state=7, refresh=10)
seed_topics = {}
for t_id, st in enumerate(seed_topic_list):
for word in st:
seed_topics[word2id[word]] = t_id, seed_topics=seed_topics, seed_confidence=0.15)

Checking the words for each topics :

n_top_words = 15
topic_word = model.topic_word_
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(tf_feature_names)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))

and the results looks like:

Topic 0: write time reading book know essay start idea take people read keep focus first completeTopic 1: read study time note take test reading quiz question book look understand day word reviewTopic 2: question time study quiz understand check problem note answer knowledge take practice ask mistake learnTopic 3: time finish assignment homework complete study reflection school day test quiz home keep win lastTopic 4: question answer time read look text reading evidence write find understand word know back rightTopic 5: ask finish teacher talk time school stay attention pay focus extra test pay attention homework know

To visualize the data I sued pyLDAvis package’s powerful interactive visualization and below is the result. As It seen 6 topics are distinctly separated and the theme from each topic can be grouped as:

  1. Finish homework/complete assignment

2. Check past quizzes and questions/understand answers

3. Talking and asking teacher/pay attention

4. Read/study notes and books

5. Answering questions and learn the problems

6. Write stories, essay and book

Distribution of the keywords in each topic shown by red

Source code can be found on . I look forward to hearing any feedback or questions.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Shahrzad Hosseini

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

More From Medium

More from Analytics Vidhya

More from Analytics Vidhya

More from Analytics Vidhya

The Illustrated Word2vec

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade