Predicting FT Trending Topics
Identifying signals using time-series analysis and unsupervised machine learning.
Understanding the preferences of Financial Times readers is crucial for improving user experience and maintaining engagement with our products. Having accurate indicators showing which area is increasingly important can augment journalists’ work, by helping them to focus on topics of interest.
Trending topics prediction is a data science model built using machine learning and time-series analysis. We define article topics by an unsupervised machine learning algorithm and use time-series analysis to flag anomalies in data.
1. How can we help journalists write more relevant stories
Over time, different topics arise reflecting the changing interests in society. The streams of data we have available across the FT contain rich information and immediate feedback about what people currently pay attention to and how they feel about certain topics. Such trending topics can be associated with a subject and experience sudden increases in user popularity (“trending”). They usually correspond to real world events such as health events, incidents and political movements. Awareness of trending topics enables the newsroom to anticipate changing information needs of users ahead of time and make decisions around where to allocate resources.
A time-series is a series of data-points observed over time. We define ‘signal’ as any deviation from the historical time-series pattern. To detect time-series signals, we use various in-house data sets. The biggest benefit of an in-house data set is that it is easily accessible, marginal costs are small, the data set is clean, and entries include high-quality information about our users’ activity.
The main drawback of using an in-house data set is that it does not include external signals about wider reading trends. We are not going to use search engine/social media trending insights. For instance, we won’t use information about what’s popular on Twitter, or Facebook.
Another drawback is that using external data sets for signal detection is widely discussed in finance literature, where companies use social media traffic to track trends, and Natural Language Processing (NLP) to determine sentiment concerning a given topic and to try to forecast stock market movements. With in-house data we don’t have this body of work to help us.
Overall, however, we believe that predicting trends using in-house data will still help us to retain existing users, as Editorial will have the information to invest in stories more relevant to user preferences.
2. Model overview
The main model run as shown in figure 1 can be summarised in 5 steps
- Gather features.
- Get assignment of articles to topic groups defined by unsupervised machine learning and NLP.
- Collect features into 2 data sets, one on the day level, another on the day/topic cluster level.
- Derive Bollinger Bands, detect M/L shape signals, flag these signals.
- Surface data to stakeholders using Slack. Write signals data to Big Query.
The model scans through thousands of data points every day. We gather referrer pageviews from various social media platforms, search engines, new subscriptions, people who interact with the comment sections, and many more. We also break down some of this traffic by topic groups that were derived using unsupervised machine learning and NLP.
The model then seeks for an M shape, or inverted L shape signals in the time-series. These shapes are derived based on exponential moving-average and confidence intervals, to identify consistent outliers — we call these outliers ‘signals’.
Once the signal is identified for a given data set, we flag it and surface it to stakeholders in the form of graphs and text. The model doesn’t make any direct suggestions. It simply automates the signal detection and flags these signals to stakeholders. Journalists then may check the graph, decide if the signal is relevant, and investigate the story manually.
3. Initial analysis and backtesting
The project has been created in collaboration between the Data Science and Insights Team. The Insights Team first researched the sources of possible signals and backtested the model based on a simple moving average. Then they backtested the model on events such as George Floyd’s murder, COVID-19 waves, US Elections, and the UK Budget. This allowed to limit the total number of data sources, eliminated redundant signals, and checked the sensitivity of the signal.
During backtesting, Insights looked at daily pageviews (PVs) from multiple sources and checked to what extent the moving average was exceeded during relevant periods before the event.
Figure 2 shows the anonymous PVs (red solid line) against PVs moving average (dotted line). The Insights Team looked at the instances where the solid line exceeded the dotted line. If that happened several times in a row, we interpreted that as a signal. We also looked at days without any breaking news — especially weekends and public holidays — to determine the sensitivity of this approach: we wanted to avoid false positives. A model that surfaced a large number of false-positive signals to our stakeholders would lose their trust, and could simply end up being ignored. We discovered that sometimes we pick-up signals from low-volume data. Hence, we introduced volume filters to filter-out small data sets.
4. Time series algorithm
The Insights Team performed backtesting and threshold analysis manually. We didn’t begin with a metric to measure success or a threshold at which we should classify a result as a signal or dismiss it as a false positive. Insights’s analysis helped us to narrow-down useful data sets, check the sensitivity of data sources, as well as validate our choice of moving averages as a benchmark.
Before we fed variables into our model, we conducted a time series correlation analysis to remove collinear data sets. We made sure that the time-series is stationary, meaning the progression of time is not a confounding variable, which was quite easy given we analysed short time intervals. We then checked cross-correlations using a heat-map and used our domain knowledge to eliminate redundant variables.
The next step was deciding how to count signals. We decided to use exponential moving average and confidence intervals. This methodology is widely used in algorithmic trading, where signals are detected once the value falls outside pre-defined moving average confidence intervals. The confidence intervals are called Bollinger Bands (Bollinger, John. Bollinger on Bollinger Bands. McGraw Hill, 2002). A Bollinger Band is defined as a line above and below the exponential moving average of the given time-series. It tells us the degree of certainty that a given value will fall between, or outside these bands. For instance, if we have drawn Bollinger Bands with a parameter of 2 standard deviations from the simple 5-day moving average, it means that we would expect 95% of our data to fall within these bands (given our data is normally distributed).
Using Bollinger Bands (rather than percentage deviation) reduced the number of arbitrary choices and helped us to mitigate the danger that initial sensitivity analysis performed by the Insights team may become irrelevant.
Once we drew Bollinger Bands, we identified cases where the daily value of a given data set exceeded the upper band. We looked at the most recent 3 days, and if the sum of signals was either 2 (for M shape signal), or 3 (for inverted L shape signal), we flagged this data set and surfaced it to our stakeholders.
We decided on M/L shape signals to reduce the number of false positives. Having 2, or 3 signals within 3-day windows gives us more certainty that there is some unusual traffic. It is also in-line with financial theory, where M shape signals are quite common due to human psychology . We adjusted parameters and data for our model by backtesting the model variations through historical events.
5. Productionizing data science model
Our trending topics model relies on 2 jobs deployed on the server:
- API article vectorization and article-cluster assignment for topic modeling (AWS Lambda).
- Batch deployment of the main model run which gathers data and detects signals (RStudio Connect).
We use a task scheduler called AWS Lambda for near real-time article vectorization. This task scheduler transforms article texts into column vectors. It also assigns newly published articles to topic clusters. If you want to know more about how we approached topic clustering, we have described architecture and infrastructure in our previous article.
The main model run is deployed using R Markdown notebook and R Studio Connect server. The batch job gathers data from the past 3 weeks, computes Bollinger Bands based on moving average, and sends these signals through the Slack API.
Our team has written an R library to send clean messages and graphs to our stakeholders in Slack directly from the job deployed on the server.
6. Summary and next steps
The scheduled model will scan through 1000s of data points every day. Using an unsupervised machine learning algorithm ensures that we surface hard to spot and unique data points to the stakeholders. Then, Bollinger Bands are generalisable enough to account for the volatility of each time-series, at the same time giving us the flexibility to choose how often and how strong signals should be.
This model is the first step for predicting trending topics. There is a lot of scope for adding extensions. In the future, we may derive a more complex model, which will include external data such as tweets and Google trends. Once we collect more historical data, we may also use them to forecast trending topics using data science techniques such as ARIMA models, regression, or Long-Short-Term Memory models (LSTMs).
Integrated Data Science (IDS) team