Topic Modelling of Toronto Historical Plaques

Using Spacy and LDA to examine the main topics of local history celebrated in Toronto Heritage Plaques

Published in

DataExplorations

15 min readFeb 5, 2019

I recently collected the text of historical plaques in the city of Toronto for another project(Toronto-Walks). The plaque texts were taken from the Toronto Plaques website. Using this information, I was curious to uncover the the dominant themes /topics covered by the plaques.

In this post, I’ll start with an overview of Spacy and then use Latent Dirichlet allocation (LDA) topic modelling on the plaque data to see what I can discover.

First… an overview of Spacy

Since this is the first time I’ve played around with Spacy, I’d like to see what it can do.

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = spacy.load('en_core_web_md')  # use larger model!from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import string

You will need to download the model (I’m using the large model here)

# run this from a normal command line
#python -m spacy download en_core_web_md

I’ve already loaded a plaques pandas dataframe with the plaque text in the Details column. Let’s play with the first plaque in our dataset — one about The Beatles playing at Maple Leaf Gardens

Beatles Plaque at the form Maple Leaf Gardens in Toronto (http://torontoplaques.com/Pages/Beatles.html)

Here is the plaque text:

‘Inside the former Maple Leaf Gardens, now a Loblaws, at the “Canteen” on the main level imbedded in the top of a table is this plaque. Here’s what it says: Beatlemania wasn’t quite dead, but it was clearly on the wane. In 1964, on the first North American tour, the Fab Four were greeted by 10,000 fans at the airport in Toronto, and were mobbed at their hotel — where one enterprising teenaged girl even managed to hide in their closet. Their two shows set an attendance record at Maple Leaf Gardens. A year later, when they rolled into town on the heels of their wild and triumphant concert at Shea Stadium, Gardens’ owner Harold Ballard turned up the heat, turned off the water fountains, and made a fortune selling soft drinks as the lads from Liverpool again packed the arena for two shows. But by August 17, 1966, something had changed. John Lennon’s crack about the band being more popular than Jesus had stirred controversy and protest. Only 800 fans were at the airport to greet the Beatles, and on the day of the shows, tickets were still available. As before, the Beatles only played a dozen songs, and were preceded by several opening acts — which included The Ronettes. Rumours abounded that this would be the band’s final tour. “It would be embarrassing to perform Long Tall Sally when we’re 35,” Paul McCartney said at the pre-concert press conference. “We can’t go on holding hands forever,” John Lennon added. Writing in the Toronto Star, Arthur Zeldin had it right. He noted that the band didn’t play some of their more ambitious recent songs like Eleanor Rigby live, because they couldn’t replicate what they did in the studio on stage. Meanwhile, their more subtle numbers — Yesterday was a clear crowd favourite — were all but drowned out by the screaming. “So, by the very forces of their musical experimentation and development, the Beatles may be taking themselves off the touring, mass live audience scene,” Zeldin wrote. Maple Leaf Gardens was the only building the Beatles visited on all three of their North American tours, and after that show in 1966, they would play only eight more concerts, and then never do an official live show again. But of course, that was hardly the end of the story.-Stephen Brunt’

Now let’s see what Spacy has learned about each word/token in that text

# create the Spacy pipeline for text-processing
mytokens = nlp(mytext)# examine what spacy thinks of each word
tokens_lst = []
for token in mytokens:
    mytoken = {}
    mytoken['text'] = token.text
    mytoken['lemma']=token.lemma_
    mytoken['POS'] =  token.pos_
    mytoken['tag'] = token.tag_
    mytoken['dep'] = token.dep_
    mytoken['shape'] = token.shape_
    mytoken['Is_Alpha?'] = token.is_alpha
    mytoken['Is_stop?'] =  token.is_stop
    tokens_lst.append(mytoken)
tokens_df = pd.DataFrame(tokens_lst)
tokens_df.head()

Taking a look at the first sentence in this dataframe (“Beatlemania wasn’t quite dead, but it was clearly on the wane”), we can see how Spacy has treated each word. The first column, 1_text, contains the original text.

is_alpha: is alphanumeric?
is_stop: is stop word?
pos: part of speech (i.e. PROPN for a proper noun, ADJ for an adjective, DET for a determiner) (a complete list is available in the Spacy docs)
dep: syntactic dependency (relation between tokens) (see Spacy docs)
lemma: lemmatized word (base word, stripped of plurals etc) (According to the docs, Spacy uses WordNet for english)
shape: word shape with capitalization, punctuation, digits
tag: Fine-grained part-of-speech

Note: if a term comes up that you’re not familiar with, i.e. the tag JJ, you can use spacy.explain('JJ') to get a description of that term (adjective, in this case). You can find a complete list of the available Token attributes here

Named Entity Recognition (NER)

Spacy also allows you to pick out the named entities — let’s see what it finds in our sample plaque.

ner_lst=[]
for ent in mytokens.ents:
    my_ners = {}
    my_ners['1_NER'] = ent.text
    my_ners['2_Label']=ent.label_
    ner_lst.append(my_ners)
ners_df = pd.DataFrame(ner_lst)

When we look at the resulting DataFrame, we can see what entities it has identified and what label it has applied to them. For example, “Maple Leaf Gardens” (a former hockey arena) is correctly identified as FAC ( Buildings, airports, highways, bridges, etc.), 1964 is correctly identified as DATE, Harold Ballard as a PERSON and Toronto as a GPE (Countries, cities, states). However, Loblaws (a Canadian grocery chain) appears to be mis-identified as a GPE (Countries, cities, states) instead of an ORG ( Companies, agencies, institutions, etc). But, overall, it’s done a great job of pulling out the major entities in this plaque description

If you’re a more visual person, you can also use the displacy.render()command to see annotated version of your text with the entities highlighted (see the docs for more info)

displacy.render(nlp(mytext), jupyter=True, style='ent')

This is, of course, only scratching the surface of what Spacy can do! But let’s jump into using Spacy to tokenize our Plaque descriptions so we can run topic modelling on them

Tokenize Plaque Descriptions

The following function will pre-process and clean up our plaque descriptions

use Spacy English() parser
run through NLP pipeline
convert words to lowercase and lemmatize (stem) (add special handling for Pronouns Spacy lemmatizes as “-PRON-”)
remove stopwords and punctuation
remove digits
return string

punctuations = string.punctuation
stopwords = list(STOP_WORDS)
def spacy_tokenizer(plaque_descr):
    if plaque_descr != None:
        mytokens = nlp(plaque_descr)
        mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
        mytokens = [ re.sub('\\d+', '', word) for word in mytokens if word not in stopwords and word not in punctuations ]
        mytokens = " ".join([i for i in mytokens])
        return mytokens
    else:
        return ""

For example, we can see what our tokenizer will return for the following sample sentence:

spacy_tokenizer("Writing in the Toronto Star, Arthur Zeldin had it right")
>>> 'write toronto star arthur zeldin right'

Next I will add a “details_parsed” column to the dataframe with the cleaned up plaque description

df_plaques['details_parsed'] = df_plaques['details'].apply(lambda x: spacy_tokenizer(x))

Finally, I’ll vectorize it — for now I’m just going to do a simple CountVectorizer and extract both unigrams and bigrams (1 and 2 word tokens). I found it was very beneficial to limit the max_features (2000 in this case) and found a max_df of about 0.8 worked best.

# Creating a vectorizer
vectorizer = CountVectorizer(min_df=5, ngram_range=(1,2), max_df=0.8, stop_words='english', lowercase=True, max_features=2000) 
detail_vec = vectorizer.fit_transform(df_plaques['details_parsed'])
df_vect = pd.DataFrame(detail_vec.toarray(), columns=vectorizer.get_feature_names())
df_vect.head()

Topic Modelling

Before we jump into running our Topic Model, let’s review what exactly it is. A topic model is a statistical analysis of the themes that appear in a collection of documents/texts (plaques in our example). For example, we could guess that our Beatles plaque is 60% about the Beatles, 10% about Maple Leaf Gardens, 20% about touring and 10% about fans. Topic modelling is a form of unsupervised learning, since we don’t know the correct labels/themes ahead of time.

Latent Dirichlet allocation (LDA) is the type of topic modelling I’ll be exploring today. Analytics Vidhya has a nice description of LDA:

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

It builds two models:

topic per document: what percentage of each document is built from a given topic
words per topic: each word has a probability of belonging to the various topics (i.e. wheel is highly likely to belong to the Cars topic, while it is much less likely to belong to the Pets topic (although it’s never impossible))

How we do evaluate the model’s performance?

There are a few methods commonly used to measure the performance of a topic model, although it seems that they all need to be used with a strong dose of human-judgement. I found the clearest information on this topic in this post by Sooraj Subrahmannian and in the book Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen.

Perplexity: Perplexity is essentially a measure of how puzzled a trained model is by previously unseen documents. The core measure here is log-likelihood, which looks at the probability of seeing the content of the unseen documents given the trained model — i.e. does the model generalize? If it doesn’t, the log-likelihood is very low and perplexity ((exp(-1. * log-likelihood per word)) is high. A good model will have low perplexity.

However, experts seem to agree that perplexity doesn’t translate well to what humans would consider good topic models.

Topic Coherence: This seems to be used more frequently and corresponds better to human judgement. The book Hands-On Machine Learning for Algorithmic Trading provides this explanation:

Topic coherence measures whether the words in a topic tend to co-occur together. It adds up a score for each distinct pair of top-ranked words. The score is the log of the probability that a document containing at least one instance of the higher-ranked word also contains at least one instance of the lower-ranked word.

Coherence looks at the most-frequently occurring words in each of the generated topics, rates the semantic similarity between them (using either UCI or Umass to do the pairwise calculations) and then finds the mean coherence score across all the topics in the model

Run a Latent Dirichlet Allocation (LDA) topic model

Phew… that was a lot of theory! Let’s get going and actually try running our LDA analysis on top of our vectorized plaque descriptions. One issue with clustering tools like LDA is that you need to know in advance how many topics you’re looking for. We could make an educated guess, but in this case I’m going to loop through a range of possible topic models numbers and check the coherence scores to see if we can find the optimal number of topics.

First, we’ll prepare the Gensim Dictionary and Corpus (the Document Term Matrix with the frequency of each word within each document). Then we’ll run LDA for 3–20 topics and record the u_mass and c_v coherence scores. Warning: this is slow!

dictionary = corpora.Dictionary(plaques)

corpus = [dictionary.doc2bow(plaque) for plaque in plaques]# source: https://github.com/smsubrahmannian/Topic-Modeling/blob/master/code/Preparing%20a%20Topic%20model.ipynb
Lda = models.LdaMulticore
coherenceList_umass = []
coherenceList_cv = []
num_topics_list = np.arange(3,20)
for num_topics in range(3,20):
    lda= Lda(corpus, num_topics=num_topics,id2word = dictionary, 
             passes=20,chunksize=4000,random_state=42)
    cm = CoherenceModel(model=lda, corpus=corpus, 
                        dictionary=dictionary, coherence='u_mass')
    coherenceList_umass.append(cm.get_coherence())
    cm_cv = CoherenceModel(model=lda, corpus=corpus,
                           texts=plaques, dictionary=dictionary, coherence='c_v')
    coherenceList_cv.append(cm_cv.get_coherence())
    vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
    pyLDAvis.save_html(vis,f'pyLDAvis_{num_topics}.html')

So what is the right number of topics for our Plaques data set?

Below I’ve plotted the resulting topic coherence scores for topic numbers between 3 and 19, using both Umass and CV scores:

I’ve read that you should pick the number of topics in the range where the chart is relatively level (and as high as possible of course). After studying the above charts and looking at the produced pyLDAvis charts, I decided to choose 9 as the “best” number of topics for this dataset.

Aside: The first few times I went through this process, my coherence plots were extremely erratic and hard to interpret. I did some research and found some suggestions that it’s helpful to limit the number of features in your vector. When I went back and limited the max_features of my CountVectorizer to 2000 words, I found I got cleaner and more interpret-able results.

Examining our Uncovered Topics

Gensim gives us some tools to learn a little bit more about the topics learned in our model:

Show_topics: Show_topics() finds the top 10 keywords that contribute the most to each topic and how important that keyword is to the topic (probability of association with the topic).

topics = lda_final.show_topics(num_topics=9,formatted=False,num_words=10)

If we look at the results for topic 0 (topics[0]), we see that “ontario” is the most predictive keyword for this topic, with a probability of 1.38%.

Topic Coherence: Similarly, top_topics() orders the topics in the decreasing order of coherence score

coherence = lda_final.top_topics(doc_term_matrix,dictionary=dictionary,topn=10)

If we look at the results for the most coherent topic(coherence[0]), we see the top 10 keywords that contributed the most to this topic, their relative probability and the average overall coherence score for this topic (-0.0397). “board canada” was the most important keyword for this topic, with a probability of 2.82%.

We can turn these results into an easier to read dataframe, as well as collect the average coherence scores for plotting

# modified from https://nbviewer.jupyter.org/github/smsubrahmannian/Topic-Modeling/blob/master/code/Intrepreting%20Topic%20model.ipynbtopic2topkeywords = {}
topic2csb = {}
topic2keywords = {}
topic2csa = {}
num_topics =lda_final.num_topics
cnt =1for ws in coherence:
    wset = set(w[1] for w in ws[0])
    topic2topkeywords[cnt] = wset # set with top keywords for topic
    topic2csb[cnt] = ws[1] #avg coherence scores for each topic
    cnt +=1for ws in topics:
    # create a unique set of keywords for each topic
    wset = set(w[0]for w in ws[1])
    topic2keywords[ws[0]+1] = wset
    
for i in range(1,num_topics+1):
    for j in range(1,num_topics+1):  
        if topic2keywords[i].intersection(topic2topkeywords[j])==topic2keywords[i]:
            topic2csa[i] = topic2csb[j]finalData = pd.DataFrame([],columns=['Topic','words'])
finalData['Topic']=topic2keywords.keys()
finalData['Topic'] = finalData['Topic'].apply(lambda x: 'Topic'+str(x))
finalData['words']=topic2keywords.values()
finalData['cs'] = topic2csa.values()
finalData.sort_values(by='cs',ascending=False,inplace=True)
finalData

This sorts the topics by coherence score and lists the top 10 keywords that contribute the most to each topic.

We can plot these results

plt.figure(figsize=(8,5))
sns.barplot(data=finalData, x='Topic', y='cs')
plt.title('Coherence Score per Topic (with 12 Topics)')
plt.ylabel('Average Coherence Score')

There is no sharp drop-off in coherence score among the topics, which suggests that they are all worthwhile topics to consider.

Visualizing the model

We can use the excellent pyLDAvis tool to visualize our topics. Below I’ve selected a relevance metric of 0.3 (more on this later). The highlighted topic, 6, appears to deal primarily with architecture.

An interactive version of this chart can be seen here: https://ag2816.github.io/TorontoPlaques_pyLDAvis_9Topics.html

I found it extremely beneficial to adjust the relevance metric in the pyLDAvis chart, but was struggling with how I could extract that information to use in a dataframe. Happily, Sooraj Subrahmannian’s post showed that we can extract this information by storing the visualization as an object and calling a function such as the following:

def get_relevant_words(vis,lam=0.3,topn=10):
    a = vis.topic_info
    a['finalscore'] = a['logprob']*lam+(1-lam)*a['loglift']
    a = a.loc[:,['Category','Term','finalscore']].groupby(['Category'])\
    .apply(lambda x: x.sort_values(by='finalscore',ascending=False).head(topn))
    a = a.loc[:,'Term'].reset_index().loc[:,['Category','Term']]
    a = a[a['Category']!='Default']
    a = a.to_dict('split')['data']
    d ={}
    for k,v in a: 
        if k not in d.keys():
            d[k] =set()
            d[k].add(v)
        else:
            d[k].add(v)
    finalData = pd.DataFrame([],columns=['Topic','words with Relevance'])
    finalData['Topic']=d.keys()
    finalData['words with Relevance']=d.values()
    return finalDataget_relevant_words(vis,0.3).merge(finalData,how='left',on ='Topic').sort_values(by='cs',ascending=False)

This results in the table shown below, where we can see the original “top” words for each topic (“words” column) and compare those with the tuned most-relevant terms found by pyLDAvis (“words with relevance”). Before I limited the max-features of the CountVectorizer to 2000 words, changing the relevance metric made a BIG difference in the interpret-ability of the results. But, with this more limited dataset, I don’t see as big a difference. We can see a bit of an improvement for topic 1, where keywords like “humber”, “lake”, “fort york”, “lake ontario” have been emphasized over the more generic keywords like “west”, “york”, heritage trust” etc

So, finally, what are the top Topics discussed in Toronto Historical Plaques?

Based on the above analysis and my own interpretation, it appears that the top topics discussed in Toronto Plaques are:

Historical Sites (hardly surprising!)
Religion (with a bit of Scarborough thrown in for fun)
Architecture
Waterways / Fort York / Early Toronto History
Art / Education
Transit
Business
1837 Rebellion (and William Lyon Mackenzie) / Justice
Maple Leaf / Hockey / Sport

Further Digging

Borrowing from the Machine Learning Plus Gensim tutorial, we can find the dominant topic for each of our plaques

# source: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
def format_topics_sentences(ldamodel=lda_final, corpus=corpus, texts=plaques):
    # Init output
    sent_topics_df = pd.DataFrame()# Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # prints the topics for each document and their percent contribution
        # i.e. plaque 1 is 95% topic 1 and 4% topic 8 [(1, 0.9483137), (8, 0.043318484)]
        # plaque2 is 39% topic 1, 37% topic 2 and 22% topic 3 [(1, 0.3942705), (2, 0.36660466), (3, 0.22493163)]
        
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']# Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_final, corpus=corpus, texts=plaques)# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Plaque_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']# Show
df_dominant_topic.head(10)

If we examine the results for our first plaque, the one relating to the Beatles, we can see it is 95% considered to be made up of “topic 1.0”. Closer examination reveals that this matches with “Topic 2” from our pyLDAvis analysis above and corresponds to our assigned topic label of “Maple Leaf / Hockey”. Hmm… on one hand, this isn’t bad since the plaque was about the Beatles playing AT Maple Leaf Gardens; however, it has completely ignored what I would subjectively consider the dominant topic of “music”!

How about our 2nd plaque, which is about Alexander Muir and the Maple Leaf Forever song?

maple tree on the southwest corner of Laing Street and Memory Lane, a block south of Queen Street, is reputed to be the tree that inspired Alexander Muir to compose the song “The Maple Leaf Forever” in 1867. For many years it became like a second national anthem before fading away. I have printed the words to the song at the very bottom of this page. Reading those words may give you some idea as to why the song is no longer popular. A 1958 Grand Orange Lodge of British America plaque is near the tree and reads as follows: Principal of nearby Leslieville Public School who was inspired to write Canada\’s national song “The Maple Leaf Forever” by the falling leaves of this sturdy maple tree.

This plaque has also been assigned the topic of “Maple Leaf / Hockey” — which in this case, is certainly appropriate!

Conclusion

This was a very interesting exercise to work through and showed me how influential the pre-processing steps to the final model and how parts of the process seem to be fairly subjective. Nonetheless, in general, the same core subjects tended to keep coming up — the big difference was whether they were on their own or grouped in with another subject.

The source code for this project can be found on my github account (here and here)

Resources

Learn to Find Topics in a Text Corpus

Be it customer reviews, news articles or conversations between people, when we are tasked with the ordeal of having to…

medium.com

Jupyter Notebook Viewer

Check out this Jupyter notebook!

nbviewer.jupyter.org

PacktPublishing/Hands-On-Machine-Learning-for-Algorithmic-Trading

Notebooks, resources and references accompanying the book Machine Learning for Algorithmic Trading …

github.com