Pulse: The Telegraph journey towards real-time analytics

Industry Outlook

Technology enables publishers to measure the impact that a piece of content is having as soon as it becomes public. These days, reacting to this data is a vital part of promoting quality journalism in the sea of online articles competing for our attention. The real-time understanding of how a story is performing can significantly help to improve the customer experience on both our website and mobile apps. It’s important to know what our registrants and subscribers want to read and how we can deliver articles that are relevant to our audience.

The Challenge

Under this premise, in 2017 the data team was challenged to build a real-time dashboard, to display in the newsroom (pictured below), to show which articles were driving registrations and subscriptions.

The first step was to identify a reliable data source on which we could build our analytics. The Telegraph’s entire website ran on Adobe Experience Manager and for this reason, we decided to consume the Adobe Livestream API in order to ingest behavioural information as soon as it was collected. Unfortunately, since no post-processing was applied to those data sets, filtering out the noise and retrieving only relevant records posed a challenge.

The First Iteration

We took an Agile approach and built a simple proof of concept (PoC) to establish that from this specific data source it was possible to extract meaningful analytics. We came up with the following design.

A poller consumes the live stream of data and without any transformation writes record by record in Pub/Sub. Then a Dataflow real-time pipeline consumes the queue and filters out irrelevant records. The rest of the data are cleaned and enriched before being uploaded into Elasticsearch. One of the beauties of Dataflow is how clean the data transformation process looks once the code is deployed on Google Cloud. A flow diagram is automatically generated that shows the different logical steps implemented in the pipeline.

In this way, it becomes easy to identify bottlenecks and errors in your process. Also, all the generated logs are automatically available in Stackdriver, which takes care to monitor the application and alert.

In Elasticsearch only a rolling window of 8 days of data is kept while all the history is available in real-time in BigQuery, with the possibility to plug this into the top DataStudio dashboards. There are multiple reasons why we decided to adopt Elasticsearch for this specific use case:

  • Knowledge of the technology. Elasticsearch had already been successfully used at The Telegraph in other solutions and we had in-house expertise.
  • It presented the possibility of using Kibana to quickly deliver a dashboard without involving any front-end developer.
  • Horizontal Scalability.
  • Low response time for the type of queries that we wanted to run.

Less than a couple of months since we started the proof of concept a basic Kibana dashboard was ready.

Figures in the dashboard above are purely illustrative.

The solution went live in September 2017 and despite some limitations, it was really well received by the newsroom.

The Second Iteration

By the beginning of 2018, the product had already been tested for a few months and most of the stability issues intrinsically linked to real-time data processing were solved. Due to the high scalability of Pub/Sub and Dataflow spikes, the handling of requests on our website to show how content is performing had become trivial.

We decided at that point to move further and build our bespoke dashboard on the top of the same backend system. A few months later a second version of the dashboard with richer information was released.

Figures in the dashboard above are purely illustrative.

During this second iteration, we decided to remove Kibana and decouple the visualisation from the storage through an API developed in NodeJS and using GraphQL. This was actually one of the first times that we played with GraphQL at The Telegraph and it was a pleasant surprise, since it allowed much more flexibility. We moved away from a rigid contract with multiple endpoints in favour of a simpler approach with fewer endpoints and a clear schema allowing us to extract and filter data from Elasticsearch in a cleaner way. Below is the updated design.

Pulse

After this second release, we decided to undertake a new challenge.

It’s one thing to have a dashboard displayed on a big wall that doesn’t allow much interaction, but quite another to have a product that allows users to conduct real-time exploration of how our content is performing. The idea of “Pulse” was born.

The PoC phase was officially terminated and we started to consider Pulse as a product with a well-defined roadmap.

A new team led by our Head of Data was created with the right mix of UX designers, data engineers and frontend developers. We ran a few workshops with different business users to understand the needs and priorities of the newsroom. After a couple of weeks, the first designs were ready.

Once these sessions were concluded it was clear which metrics and dimensions were relevant to measure the performance of our articles.

Luckily, from a backend point of view, the changes to the design were minimal, since we were starting from already a strong base, but most of the features requested would have required us to massively extend the solution.

Once we finished collecting the requirements and we had a clear understanding of what we were trying to achieve we updated the architecture as shown below.

In this third phase, it was not possible anymore to rely on a single data source to serve the data. Next to Adobe live stream we added Chartbeat, while Adobe post-processed Hitlog and the Unified Content Model (UCM, an article storage platform developed in house by The Telegraph Engineering team).

The new integration with Chartbeat has been developed in order to offer metrics that was not possible to track through Adobe analytics. An example of this might be the average engaged time on a page for a specific audience.
The post-processed Adobe Hitlog was added in order to offer a historical view of how our content was performing.

Aside from the new data sources, further development work was necessary. The API used to serve the dashboard was rebuilt from scratch using Python and GraphQL to conform with the stack of technologies that we normally use. A new Redis cache was introduced to improve the response time and offer a smooth experience to the end user. The real-time data pipeline that consumes Adobe live stream data was updated to include the new metrics and offer better data cleansing.

The need to also classify our articles, through a set of tags in near real-time, led to a hybrid design where both real-time, near real-time and batch data pipelines coexist. For this purpose, a tags data pipeline was developed. This specific pipeline runs every N minutes and for each article published on the day it checks if a set of conditions is satisfied in order to classify our content accordingly.

The frontend has been built from scratch as well. Since we didn’t have anything in place yet, our frontend team started from a blank canvas and in record time developed a responsive dashboard that offers the possibility to our users to explore the statistics of each article or section under a set of predefined filters.

Figures in the dashboard above are purely illustrative.

Pulse went live at the beginning of 2019 and it is now part of the tools that are constantly used by our journalists.

What’s Next

What will be the next step? This time we are definitely going big!

In next weeks (from writing this blog), we will release Pulse XL to replace the old editorial dashboard. This will introduce a historical data view, geographic information in our main dashboard and also will uniform all our real-time dashboards under the same product.

Regardless of whether you are on mobile, on desktop or in The Telegraph newsroom, Pulse will provide support with reliable figures on our strategy.

Pulse has changed our newspaper’s attitude to data; we are placing more confidence and trust in the information captured about our content. Put simply, we have one of the best available pieces of technology for capturing and analysing the stories that we publish in real-time. Pulse flags segments, such as engaged registered visitors, then prompts journalists on how to convert them to subscribers in real-time. This will be customisable for every team across editorial to ensure all content is achieving its purpose and contributing to the Telegraph’s broader strategy.

Stefano Solimito is a Principal Data Engineer at The Telegraph