Topic Modeling with SpaCy and GenSim

Cole Miller
3 min readMar 22, 2019

--

The Situation:

As a concerned citizen, I want to better understand the politics that affect me. While it’s easy enough to get an idea of what’s going on in the national context (ie Congress), state governments can be more tricky. Critically, state governments are way more productive (pass more laws) than Congress, so part of the problem is just keeping up with all the information.

When I realized how little I knew about what my state government was up to, I knew I had to do something. So first I pulled the active bills from the NY Senate using the openstates api.

Wow, over 15,000 bills! Thats way too many to make sense of. If only I could put them into categories, maybe I’d be able to gain some insight…

Unfortunately, without much policy background, I don’t know what these categories would even be! Fortunately, there is a process of deriving topics from text, its called Topic Modeling.

Tokenization:

The first step in natural language processing is really preprocessing. Generally, we want to reduce our raw text to a list of ‘important words’. This process is called tokenization, and spaCy is a great tool to accomplish this.

The Model:

There are many algorithms used for Topic Modeling. I ended up using a popular generative statistical model called ‘Latent Dirichlet Allocation’ (LDA). To implement the model, I used GenSim a python library for topic modeling. The GenSim’s LDA model has three required parameters: Corpus, Dictionary, Number of Topics.

Heres what the results look like:

This output shows our 15 topics and the marginal probability of a document being binned to each topic given the presence of a particular word. This is a great start to understanding our model, but we can do much better with visualization courtesy of pyLDAvis.

Naturally, there are a whole slew of hyperparameters you can pass as kwargs. I ended up fitting three models to compare different settings — goldilocks style. Using pyLDAvis, comparing models is much easier.

from left to right: too much, just right, not enough

--

--