FT Article Clustering
Using unsupervised machine learning to label FT articles
Why article clustering is important
FT is one of the largest providers of financial news in the world. We publish hundreds of articles every day. One of the most challenging tasks is consistently categorising these articles.
While FT journalists tag the articles manually, it is hard to ensure that similar articles will have the same tag. Having consistent labels attached to articles is very important when we want to use them for machine learning models and analysis of customer reading trends.
To make these classifications, the only data available to us is article text. We do not have a fixed set of labels indicating how articles should be classified.
Even if we were to label these articles manually, new tags emerge organically over time. A story such as the Coronavirus pandemic may begin as a health concern and mutate into a geopolitical topic. It would be time consuming to manually prepare the training data set, train it every time new trends emerge, monitor reading trends, and add relevant new labels.
We decided to use unsupervised machine learning to resolve the above issues. The idea was to allow a model to find similarities between stories that may be difficult or impossible for a human to identify, and to assign articles into groups that will be useful for both Editorial and Data Analytics.
Before we started experimenting with different methods, we set the following criteria for success:
- Clusters are labeled in a consistent manner
- New trends can be captured by the labeling algorithm
- There is little/no delay between cluster publication and labeling
- There is some control over how clusters are labeled
We experimented with many approaches, such as keyword extraction, topic modeling, and vector representation. After considering the pros and cons and analysing our results, we decided to label articles by first representing each article as a column vector with 30 elements and based on that, group articles for which vectors are close to one another.
Both of these steps are based on unsupervised machine learning. Using this type of machine learning means that it’s hard to assess the quality of models with standard metrics such as confusion matrices. Therefore, at each step, we performed a qualitative analysis of results to make sure that we used the correct methodology and hyperparameters.
Representing article as a vector
As the only input to our model is article text, we needed to find a way to group it into similar content. Clustering algorithms require numerical inputs. We have decided to use paragraph vectorization to represent each article as a column vector.
We treat each article as a separate data point and train the algorithm on the set of historical FT articles.
To vectorise articles, we used the document embedding generator doc2vec proposed by Le and Mikolov (2014). In particular, we use the Distributed Memory version of the Paragraph Vector (PV-DM) version of the model. Put simply, this algorithm transforms the vectorisation problem into a classification problem. It samples a word from a sentence, treats the sampled word as a predicted variable, and nearby sampled words as predictors.
We have tuned the following hyperparameters: type of algorithm, vector size, number of epochs, and minimum number of words within each article. We tested the model performance by assessing the number of keywords shared by nearest neighbour articles, with nearest neighbours determined by shortest euclidean distance between vectors. Accuracy was calculated through both shared keyword analysis and manual spot-checking and calculated at 80%.
The trained doc2vec model is placed in an AWS S3 bucket. We vectorize newly published articles using an AWS Lambda. Once the new article is published, we get a notification through a notifications API, which triggers doc2vec execution. We obtain article text using Enrichment Content API and get the latest model from the AWS bucket. The AWS Lambda includes a Python script, which uses the trained model and article text, vectorises the article, and loads it to the BigQuery table for further analysis.
Finding groups of similar articles
We perform initial clustering based on the vectorised articles, using top-down hierarchical clustering. We compared different types of clustering algorithms, such as k-means, hierarchical, hybrid, density clustering, as well as tuning hyperparameters for each. We decided to use hierarchical top-down clustering which performed best in categories such as connectivity, Dunn, Silhouette statistics, and qualitative validation of clusters.
The next step was to find a way to continuously update clusters. As the content of articles changes over time, new clusters should be created, and old clusters can become either irrelevant or change their content.
We used the following algorithm to detect new trends/update existing clusters (Figure 3)
- Find articles that are outliers to existing clusters
- Create new clusters from outliers using hierarchical clustering. Determine the number of clusters based on the height cut-off point of the dendrogram
This approach seemed the most intuitive, as we used a similar methodology for initial clustering. The only difference is determining the number of clusters. Initially, we have to determine a fixed number of clusters recommended by stakeholders. For updates, we decided to use height, which we determined by parameter tuning during back-testing and distance distribution analysis.
We also experimented with updates of the existing cluster by
- Running k-means reclustering, where we initialized centroids with the past centroids
- Re-computing cluster centroids with newly added articles
When validating results and tuning parameters, we were not able to find a version that meets our requirements of clusters stability and consistency over time. Hence, we decided to keep the column vectors of old clusters fixed and only add new clusters that are outliers.
Does it work?
We have been continuously validating and backtesting our model at each stage of development. However, it does not ensure final success. Our final test was checking if the clusters react as expected to given news events. For instance, did most of the articles with COVID-19 tags made by journalists end up being in a reasonable cluster?
Figure 4 shows the proportion of articles with COVID-19 tags published within the first wave of the pandemic. We found that clusters related to health and trade are the clusters with the highest proportion of COVID-19 articles. This looks reasonable, given that the FT has mostly financial focus. We repeated this exercise for other big events and tags, which validated our clusters.
A very important element of all data science models are bots that monitor if jobs are running correctly, check the quality of outputs, and notify developers/stakeholders if there’s something unusual.
We set up three notifications for our article clustering bot
- Check if new clusters have been added
- Check the quality of clusters
- Check that job is running on schedule
We set-up a separate slack channel that notifies our Editorial stakeholders if new clusters have been created (Figure 5). In that case, our stakeholders are prompted to go to the relevant tool, check cluster content and use it to update cluster names. Stakeholders use their judgment for naming clusters, which is augmented by automated suggested labels that are created using NLP techniques.
Another monitoring point is the cluster quality monitoring dashboard (Figure 6). It checks the number of overlapping words between cluster labels and articles within the cluster, summary statistics of the goodness of clusters such as average Euclidean distance between centroids, and the stability of cluster vectors over time. If any of these statistics differ from their expected outputs, the data science team gets a notification in the slack channel and one of our team members goes and investigates the issue.
Integrated Data Science, FT Product & Tech