Designing better digital metrics in The Times and The Sunday Times newsrooms

Dan Gilbert
News UK Technology
Published in
11 min readJan 22, 2020

See here for our latest post our ongoing adventures with metrics in the newsroom.

At The Times and The Sunday Times, like many publishers, we work hard to try and make it easier for our newsrooms to use data to make better decisions.

However, traditional digital analytics tools have fallen far short in helping us understand how our content is performing on our website and apps. The metrics these tools offer — average dwell time, unique readers, page views and the like— typically fail to provide meaningful insight, or can fundamentally mislead us about whether content is exceeding our benchmarks or clearly not meeting the needs of our readers.

Over the past few years we have developed INCA (Intelligent Newsroom Contextual Analytics), a tool designed to be accessible and actionable for our newsrooms, and that attempts to quantify how well our content is performing digitally (see here for more background on INCA).

A key aspect of INCA’s development has been the decisions we’ve made about which metrics we display and in what format. Instead of traditional metrics, INCA displays a set of indices. These indices can be used to help evaluate how well an article is performing for a given metric (e.g. dwell time or number of readers) relative to how well we expected it to perform given its context (e.g. length, position in the edition etc).

As Nick Petrie, Deputy Head of Digital for The Times and The Sunday Times puts it:

“INCA indices help augment editorial judgment to inform intelligent decisions using data, from identifying headlines or images that need modifying to commissioning new content. Ultimately it’s about reducing the time it takes for our journalists to get meaningful insights.”

Are longer articles better than shorter articles?

A metric such as average dwell time (for the uninitiated, this measures the average number of seconds a reader on our website or apps spends on an article before navigating to the next one) tends to tell us as much about the length of the article as anything else.

The chart below plots the average dwell time for individual articles against their length (number of words). Whilst we see a lot of variation in the data (we’ll come back to that later), longer articles tend to have longer average dwell times, and shorter articles tend to have shorter average dwell times.

As you would expect, readers tend to spend longer reading longer articles, but other characteristics of the article also influence it as well as the natural variation in the data.

Similarly, if we measure the unique number of subscribers reading an article on our website or apps we need to be cautious in how we interpret it. The Times and Sunday Times website and apps are published as editions — all readers see the same layout of articles on the page — and many readers navigate through the edition in a similar way to which they would read a newspaper (moving from beginning to end) and so articles nearer the top (the beginning) of the edition tend to be seen, clicked on and read by more people.

Is it surprising that our readers are more likely to click on articles further up the page or that they tend to spend more time when reading longer articles? Should we judge the number of clicks on an article at the top of our edition in the same way we judge the clicks on an article right at the bottom? Should we conclude that long articles are better than short articles just because people spend longer reading them? Obviously not.

Yet if we simply use average dwell time, unique readers or any one of many similar metrics that most analytics tools offer, implicitly this is what we are doing, and the insight we can hope to get from these numbers is limited. It’s like trying to make sense of geographic data without having any coordinates to plot them against.

Even when the people using these metrics recognise these limitations (as editors and journalists often do) and they are visualised appropriately and compared to relevant averages etc, at best users are left to juggle information about the factors that are likely influencing the metric and jump through hoops of mental arithmetic in the hope of gleaning some useful insight.

At worst, these metrics create a negative feedback loop, whereby articles placed more prominently are seen by more readers, which merely validates the original decision to place the articles prominently rather than offering any new insight into which stories are working well with our readers, or might work well in the future.

Accounting for context with indices

To overcome these problems we developed what we call our INCA indices. These indices provide a measure of how well an article performed on a particular metric relative to how well we would expect the article to perform against that metric, after accounting for the context of the article.

If an article exceeds expectations after taking into account the context, it is the first step to gaining some insight from our data into whether the article is resonating with our audience. Perhaps there is something about the specific storyline or writing style in the article that is engaging our readers more than other similar articles.

We care about context because it helps us set our expectations. Different contextual factors affect different metrics. As we’ve seen, average dwell time is influenced by the length of an article. Similarly the number of subscriber reads is strongly influenced by the position of the article in the edition. But in some cases the relationships between context and metrics are indirect (or at least less obvious), but still useful in setting expectations. For example, given the same word count, articles in different sections tend to have different dwell times (e.g. readers spend slightly more time reading Sports articles than News articles, after accounting for article length). These systematic differences might be caused by differences in presentation of the content (for example, different writing styles), differences in audience, or perhaps variations in the manner in which different content is consumed.

Given the number of characteristics we need to consider, we use a machine learning model to come up with the expected value for a specific metric (the target in the model) given the characteristics of the article (the features in the model). We then compare the expected value (the model’s prediction) to the actual metric for the article and rescale the resulting value to generate each index. A user can very easily tell from the index whether a metric is above or below expectations — no mental gymnastics required.

An index that most of us are familiar with is the body-mass index (BMI) — a normalised measure of body weight based on height (similar to normalising dwell time based on the length of an article). The BMI has come under fire because it fails to account for differences in characteristics such as gender, age group etc despite the fact that the ‘idealised’ weight for a given height will differ based on these characteristics (Geoffrey West covers this in his book Scale). A similar example might be standardising educational outcomes to account for differences in characteristics such as household income before making comparisons. A critical part of our work in developing the indices has been our choice of the characteristics (context) that we take into account to avoid unfair comparisons.

The breadth of characteristics that influence metrics is one of the reasons we see so much variation in the earlier chart of dwell time vs. word count. If word count was the only predictable driver of dwell time, we could avoid any machine learning and just use a modified metric based on dwell time per word (the equivalent of BMI). However, beyond length and section, there are a number of other factors which cause a systematic difference in dwell time including the number of comments, images, videos, the proportion of readers on our apps and more.

There is also the natural variation and uncertainty in the data and we need to somehow account for that too. Factors that can impact web analytics data are manifold and range from inevitable data quality and measurement issues, through to one-off events that might trigger interest in a specific topic for a short time (this happens quite often in news…). Not least, we have to account for the wide range of sample sizes in the data with the characteristics we care about. For instance, some sections have a smaller number of articles on which to model the data, and as can be seen in the earlier chart, the longer the article, the fewer examples we have to infer what a ‘normal’ dwell time is.

Whether we are using indices or plain old web analytics metrics, the risk of over interpreting differences that might be due to chance are always a concern but rarely respected in analytics tools and displays of data.

Respecting uncertainty in the data

Machine learning algorithms are also far from perfect when making predictions and you typically need to consider the amount of uncertainty in what the model is telling you. Therefore, when evaluating the performance of articles based on our indices, we deliberately avoid being too precise when making a judgment.

Rather than displaying indices as very precise numbers or scores, we take into account the level of uncertainty in our predictions and present each index as a whole number on a scale from 1 to 5. That is, an index can only be one of five values:

  1. significantly below expectations
  2. below expectations
  3. expected
  4. above expectations
  5. significantly above expectations

The screenshot below gives a glimpse of how we display the data for an individual article. We track several indices, but this highlights the two we discussed above (average dwell time and subscriber reads).

We display Indices for each metric we track, on a scale from 1–5. Given it’s context, the dwell time was as you’d expect (Index = 3), but had more subscribers reading it than expected (Index = 4).

We originally presented the indices on a scale from 0 to 200, where 100 was expected. However, we quickly ran into trouble with users focusing too much on small differences that were not necessarily meaningful, and discovered we could dramatically increase the coarseness of the indices we displayed while actually improving key insights.

This also helped reduce the need for data visualisations such as confidence intervals to always accompany the indices to ensure uncertainty is being communicated. This has made interacting with the data easier for some users, and alongside the increased coarseness we hope it will improve the quality of the discussion that the data stimulates. We can also use simple but effective tabular displays to highlight articles with interesting combinations of indices to prompt investigation whilst investing more effort in developing visualisations to identify less obvious patterns across articles as well as the distributions of data that underpin the models, to identify where the indices might lead us astray.

Communicating uncertainty is a growing theme in the data visualisation community that we happened upon whilst grappling with some of these issues ourselves. Kudos to the Encode data visualisation festival in London last year where I first heard Andy Kirk discussing the trend, in particular Jessica Hullman’s work on uncertainty visualisation.

More generally though, we’ve failed to find many parallel examples of metrics that are designed to contextualise information to improve insight, but there are some good sources. Some of the problems of meaningless comparisons between metrics that ignore underlying scaling laws are covered in Geoffrey West’s aforementioned book Scale. For clearly visualising comparisons of actual vs. expected performance in metrics, and respecting uncertainty in the process there is perhaps no better source than Stephen Few, and in Signal he also discusses standardising data to make fair comparisons and the use of visualisation techniques such as funnel plots and statistical process control charts to separate the signal from the noise.

However, there is still relatively little practical guidance on approaches to encoding more context into metrics themselves, particularly in the domain of digital analytics and we think this is an area worthy of further research and attention.

As we’ve been working on the indices for a while, we occasionally find new lenses to look at the problem through. Recently, this has been framing it in terms of causal models (inspired by Judea Pearl), as Ed Rushton, a data scientist working on the indices explains:

“We are trying to understand which factors cause a change in each target metric. We have control over some of these factors, such as position on the page, or the presentation of the article. Others are best viewed as being outside of our control, for example, the proportion of traffic arriving from our apps versus the website. Crucially, causal models allow us to introduce a factor which captures the effect of things we cannot observe directly. In this way, we can introduce an X factor, which represents the extent to which the article resonates with our audience.”

Data informed, not data led

Indices move us beyond the traditional web analytics metrics that at best provide a view of the past. This is still useful (we clearly find it helpful to know the total number of people reading our content, and we often look at the actual metrics alongside the indices), but indices help us look forward by learning what works with our readers, what could work better, and where there are opportunities to improve what content we create and how we present it.

The indices help us look forward by learning what works with our readers

We wouldn’t have developed INCA and these indices if we didn’t believe data can be used to improve our product, reveal our best and otherwise overlooked content and support editorial strategy. However, we’re also well aware of the risks of blindly following dictums such as “What gets measured can be managed and improved”, and the predictable pitfalls that beset attempts to improve things through measurement, captured in Jerry Muller’s The Tyranny of Metrics and elsewhere.

The concept we are trying to quantify — the quality of an article — can only be partially captured by data, and the indices represent a weak proxy, albeit significantly better than anything traditional web analytics metrics have offered us. We also make inferences from several indices, each of which captures different and often competing elements of our reader engagement. As such, data is used to inform, not lead the discussion on what content appears in our product.

One luxury we have in developing this approach at The Times and The Sunday Times is that our edition structure means the contextual features such as the placement of articles are more stable and simpler to capture than on a hyper personalised website or app where content constantly moves about or every reader sees their own personalised layout of content. Even so, a key lesson for us has been that acquiring context is costly and technology often gets in the way. Contextual data typically lives in separate systems to your analytics data and those systems are often not designed to talk to the analytics systems (our original content management system was one of them). However, the impact of connecting this information has been transformational and the effort worthwhile.

We hope that our approach will prove valuable to other newsrooms, and has broader application across digital products where the critical context of the product or the content is not captured in the data used to measure it, but should be.

Please follow News UK Technology and The Digital Times for upcoming posts from our data science team as we continue to experiment and develop INCA and other initiatives.

--

--

Dan Gilbert
News UK Technology

Director of Data at News UK — Otherwise family, sci-fi, cosmic disco and data science