Using Topic Modelling to Analyse 10-K Filings

David Ardagh
auquan
Published in
8 min readMay 23, 2020

To see the full article, with accompanying code, please go to: https://blog.quant-quest.com/using-topic-modelling-to-analyse-10-k-filings/

Using Topic Modelling to Analyse 10-K Filings

In this article we explore the subject of topic modelling, as well as an application to analysing trends in company 10-k filings.

A Google Colab file with all the code can be found here. We recommend you open it to see all the details: https://blog.quant-quest.com/using-topic-modelling-to-analyse-10-k-filings-notebook/

What is Topic Modelling?

Topic modelling is a subtask of natural language processing and information extraction from text. The aim is, for a given corpus of text, model the latent (hidden underlying) topics that are present in the text. Once you know the topics that are being discussed in the text, various further analysis work can be done. For example, using the topics as features, classification, trend analysis or visualisation tasks can be performed. This makes topic modelling a useful tool in a data scientist’s toolbox.

Latent Dirichlet Allocation (LDA) is commonly used for topic modelling due to its ease of implementation and computation speed. If we break down that term a little, we notice the word “latent”, which means unobserved; Dirichlet is named after the German mathematician and “allocation” because of the nature of the problem of allocating latent topics to chunks of text.

An intuitive way to understand how topic modelling works is that the model imagines each document contains a fixed number of topics. For each topic, there are certain words that are associated with that topic. Then a document can be modeled as some topics that are generating some words associated with the topics. For example, a document discussing Covid-19 and unemployment impact can be modelled as containing the topics: “Covid-19”, “economics”, “health” and “unemployment”. Each one of these topics has a specific vocabulary associated with it, which appears in the document. The model knows the document isn’t discussing the topic “trade” because words associated with “trade” do not appear in the document.

The underlying mathematics behind LDA are beyond the scope of this article, but reply on variational Bayes sampling. We also touch on the MALLET model, which is similar to LDA, but has some advantages over the original.

Now that you have an idea of what topic modelling is, and how it can be used, let’s explore how it can be used to analyse 10-k filings for some major tech companies.

Setup

Code snippets will be provided throughout the article to show how these ideas are implemented, but much will be left out in the interest of space. The full detail can be found in the notebook.

To download the data, we use a Python package called “sec-edgar-downloader”, which can be easily pip installed https://pypi.org/project/sec-edgar-downloader/. For this article we look at some major tech companies: Alphabet, Microsoft, Amazon, IBM, and Nvidia.

After downloading the data, we can produce a simple visualisation to get a better idea of the content of these 10-k filings. We define a function to create a word cloud given a Pandas Series object.

If we run this on the entire dataset, we get the following output:

We can also compare word clouds between companies. For example, here is the word cloud for Alphabet:

And for Microsoft:

Just by visualising the data from Alphabet vs Microsoft, we can see that Microsoft seems to talk more about their services and products, while Alphabet seems to be more concerned about macroeconomic factors.

Just like for all NLP texts, the text data needs to be cleaned and preprocessed to make it useful. We can apply regular expressions to filter out a lot of the junk, as well as removing stopwords. Stopwords are simple words that do not add any meaning to the document, and hence just generate noise.

SpaCy comes with a predefined list of stop words, which can be accessed as follows:

This is simply a Python list of 179 words. If there are any idiosyncratic stop words related to your application, you can simply just append them to this list before filtering them out.

Bag of Words

From there we tokenise the text using the Bag of Words model to put the text data in a way the computer understands it. The BoW model has two components: the vocabulary and the frequency (or some measure thereof). Finally, we can filter out both extremely high and low frequency words. We get rid of low frequency words to reduce chances of overfitting, and remove high frequency words because they are usually not very relevant and so can hide the signal.

There are two primary components that go into preprocessing the text for the LDA model. We need the bigrams and an id2word mapping. The bigram function serves to automatically detect (using the gensim.models.Phrases method) which words ought to be grouped as a phrase. For example, in our context “original” and “equipment” is concatenated to “original_equipment”. This is both useful for reducing noise, and creating better features. The bi_min parameter makes sure that there are at least 3 instances of the concatenated phrase in the document before confirming it as a valid phrase. We carry out this operation with the following function:

The Phrases method creates these bigrams from the list of words, then the Phraser method exports these bigrams to a trained model with which model updates are no longer possible, meaning less RAM and faster processing.

Now, for the id2word mapping, we take the list of bigrams, where each element is the bigram representation of a document, and then we feed it into Gensim’s Dictionary method. This creates a mapping between each token (as it is now called after converting into bigrams) and an id for that token. We also filter, such that tokens need to be in at least “no_below” documents, and in no more than “no_above” fraction of documents. Finally we convert all of our documents into a “corpus”, where a document is represented by a list of (id, frequency) tuples. The id value comes from the id2word mapping, and the frequency is calculated based on how many of these ids are in the document.

For example, the id of “microsoft” is:

So the word “microsoft” has id 539, and since we only passed the word in once, it has a frequency of 1. If we had passed in [“microsoft”, “microsoft”], then we would have gotten [(539, 2)]. More generally, the union of two documents if the disjoint union, summing the multiplicities of each element (token id). https://en.wikipedia.org/wiki/Bag-of-words_model

It is important to be aware of the BoW assumptions, predominantly that there are no relationships between words, and that a document’s meaning is solely composed of which words it contains, and not the order. This is of course highly unrealistic, however it seems to be a simplifying assumption that works quite well in a lot of cases.

It is also important to note that we do not need to use idf-tf (inverse document frequency — term frequency), because LDA addresses term frequency issues by construction.

Mallet

Mallet (Machine Learning for Language Toolkit), is a topic modelling package written in Java. Gensim has a wrapper to interact with the package, which we will take advantage of.

The difference between the LDA model we have been using and Mallet is that the original LDA using variational Bayes sampling, while Mallet uses collapsed Gibbs sampling. The latter is more precise, but is slower. In most cases Mallet performs much better than original LDA, so we will test it on our data. Also, as shown in the notebook, Mallet will dramatically increase our coherence score, demonstrating that it is better suited for this task as compared with the original LDA model.

To use Mallet in Google Colab, we need to go through a few extra steps. First we install Java (in which Mallet is written).

Then we download Mallet and unzip it:

Finally, we set the path to the Mallet binary:

To use the model, we use the Gensim wrapper for it. We need to specify the path to the Mallet binary as the first argument. The other arguments included the training corpus, the number of topics (hyperparameter) and the id2word mappings.

The last thing we need to do before we can use the model is to convert it to the Gensim format, like so:

Now we can use the model just like we would use the original LDA model!

Note — Here is an excellent guide to using Mallet with Google Colab: https://github.com/polsci/colab-gensim-mallet

Finding The Optimal Number of Topics

One hyperparameter that needs to be tuned is the number of topics your model looks for. This is some fixed constant and must be tuned by you. There is no real simple way to do this, but one way is simply create a number of models, each with varying numbers of topics and then compute the coherence score. Note we are using the Mallet variation of the model.

After computing this score for a number of models, plotting the coherence score against the number of topics should reveal an elbow shaped graph. The optimal number of topics is at the elbow, where the graph starts to flatten out. Our graph was pretty noisy and didn’t have an elbow shape. Exploring the results from each configuration led to 7 topics working the best. Choosing too few topics meant that each topic was too vague and high-level. Choosing too many topics is even more problematic, because there isn’t enough data for each topic and so there is a problem of overfitting.

Using The Model

We can use a great plotting tool called pyLDAvis. As the name suggests this enables you to visualise the Topic Modelling output by using a number of techniques, such as dimensionality reduction.

--

--

David Ardagh
auquan
Editor for

Cornish born and working in a Fintech in London (how original). I try to make big things simple.