Replatforming the Lantern API
Background: What Is the Lantern Api?
At the FT, our journalists are always looking to find out which stories are resonating with readers. They do this to make sure that our journalism is having the impact we want it to have. Stories that perform well may deserve more promotion, or introspectives on what is was about the story that had our readers coming back for it. Stories that don’t hit the mark may need a nudge, a rework, or become a lesson learned on something that our readers just didn’t find as valuable as we thought.
All of this informs not just what we do today, in terms of the stories that make it to the homepage, get promoted on social, but what happens in the future: where do we put effort in following up, where should we perhaps start a series, and what might we just want to do less of.
A defining factor in finding the answers to these questions is audience data, powered by our internal editorial analytics viewer: Lantern, and Pink Lantern, our overview of the homepage in real time. The data to drive these tools comes from the dedicated ‘lantern-api’.
Article ‘performance’ is reported using several metrics:
- Article pageviews (number of times article pages are requested)
- Median Time on Page (having opened the article page, how long on average does the user spend there before moving on?)
- Percentage ‘Quality Reads’ (what proportion of page-views are deemed to be ‘quality’, ie. the user seems to have read a decent proportion of the article, and not just skimmed over)
- Number of significant reader interactions (eg. comments posted, social shares)
- Click-through-rate (CTR): [homepage only] for each view of the homepage, how many times is a particular article link clicked on.
The site-wide metrics are calculated for a variety of ‘slices’ of website traffic, according to various content/reader attributes, for example:
- Editorial ‘Desk’ (section of the website)
- Content type (eg. regular article, video, special package)
- Reader type (anonymous, subscriber)
- Reader location (country/region)
- Traffic source (eg. has the page-view originated from clicking a link on search engines result pages / social media)
Lantern needs access to a ‘realtime’ datasource (up-to-the-minute data), but when Lantern was first built, no such datasource was available at the FT. The FT’s data warehouse (Redshift) was populated by a daily upload process, meaning the freshest tracking data available could be as old as 24 hours.
Therefore, a new database of (almost) realtime traffic data was created especially for lantern-api’s needs. To do this, tracking data coming through the existing data pipeline was processed and dispatched to a new, dedicated database (ElasticSearch).
Why Change It? Project Objectives
- Reduce costs. The infrastructure supporting this dedicated data pipeline/database was costing ~$17k per month)
- Reduce maintenance / complexity / key-person-dependency. The dedicated pipeline was prone to blips; only one engineer was fully au fait with the set-up; the technologies used were not well understood by the wider team.
- Eradicate duplication / divergence of business data logic. We wanted, as far as possible, to use the same data and logic as employed by our BI teams, to present a consistent view of key metrics to the business.
Since Lantern was first built, our BI and Data Platform teams have made more and more use of Google’s data warehouse offering, BigQuery. Crucially, traffic data with much lower latency (10–15 minutes rather than a day) had become available in BQ (by addition of dedicated sinks from our data pipeline); therefore we were able to consider this as an alternative datasource.
Switching to using BQ tables that are populated by the standard FT data pipeline, supported by the team (‘Data Platform’) whose main focus is to provide such a service team would fulfil all three of the project objectives. Thus it was a straightforward decision, although various challenges followed…
Although we had (near-enough) realtime page view data available in BQ to satisfy a lot of Lantern’s requirements, there were some data-points which were not fresh enough, eg. article comments, video views, time_on_page, quality_read. So we had a hard dependency on our Data Platform colleagues being able to add to the collection of real-time tables in BQ. They did so by creating an AWS lambda function which reads tracking events off the existing pipeline, performs some simple transformations and writes directly into new ‘stream’ tables in BQ. The streaming lambda costs only ~$500 per month, and unlocks access to real-time data for many other teams/services.
Dependencies on other Teams
Not knowing what they would be able to provide us, and when, made estimating projected costs/performance very difficult. Without knowing how the data could be structured in Big Query, we could not be confident that the BQ solution would be fast or cheap enough to be feasible. This led to quite a lot of extra work, calculating estimates for various options.
The main concern about using BigQuery was one of speed. Could our api call the BQ api and get results quickly enough to satisfy clients (the main one being Lantern). Or would we need to use some sort of (potentially pre-populated) cache?
Some queries were predictable and commonly needed (to serve the main landing pages of Lantern), so it was possible to pre-fetch these on a scheduled basis, and cache them locally. This has two benefits:
- Very quick results served back to client
- Reduced BQ query costs
Many requests had to remain on an ad-hoc basis, because they were called too infrequently to be worth caching, and couldn’t be predicted, anyway. (There are countless combinations of filters available to the end-user in the Lantern UI).
We ended up employing a variety of techniques to improve speed:
- Caching query results.
- Pre-fetching data that is known to be commonly and frequently requested, eg. data for the main page.
- Fetching partial datasets in parallel, and then aggregating.
- Optimising the structure of database tables for the queries needed by our api.
- Pragmatism about how the queries we support (eg. certain queries disregard tracking data for articles older than 5 years)
We estimated costs by analysing usage of the current api, to give an approximate frequency of each type of query that would need to be run (and against what date-range of data). While BQ represented a much cheaper storage solution than ElasticSearch, we had to be mindful of query costs, which could rack up if not careful.
Some of the techniques used to improve speed also had the effect of keeping reducing query costs, eg.
- Optimised structure of database tables.
- Pragmatism about how the queries we support (eg. rounding time-ranges to the nearest minute.
Some of the speed-saving techniques may have introduced extra cost, such as splitting queries up and running in parallel. Where possible, the splitting was done by raw dataset so that it didn’t incur too much extra cost. But speed v cost was a perpetual trade-off. Likewise the tradeoff with simplicity: some queries could probably have been written to run more cheaply/quicker but readability of the code would have suffered enormously.
Aligning Business Data Definitions
Having its own pipeline and datasource, Lantern had also evolved its own definition of a ‘desk’ (roughly speaking, the newsroom team that created the article), which was very different from the one used by non-Lantern analytics tools/reports. Agreeing on a new definition to be shared by both sets of stakeholders was by no means a trivial undertaking. Once done, though, we were able to create common ‘User Defined Functions’ in BigQuery to prevent future divergence of this business logic.
A much simpler implementation, for a fraction of the cost. Speed-wise, it’s faster than the old api in certain cases (pre-fetched or already cached content), slower in others … but not terribly so.
We have decommissioned the Lantern-specific legacy infrastructure, saving ~$13k per month in AWS costs (where most of the the legacy infrastructure was running).