Search-driven editorial analytics

Using search to augment analytics for news websites

At Trinity Mirror we publish thousands of articles every day, across dozens of websites. Working with this volume and velocity of content is a challenge.

Traditional web analytics has a natural bias towards the top-performing content. While this works for relatively static sites, on a news website it can mean definitions of success are skewed. Many good pieces of content disappear into a void if they don’t build enough concurrents to flash up on the big leaderboard.

As many other major publishers have identified, traditional web analytics software just isn’t good enough at providing analytics that meets the needs of news organisations. One approach I’ve taken at Trinity Mirror has been to look beyond blunt metrics such as concurrents, unique visitors and page views and take a closer look at the nature of the content itself and the factors that lead to its success or failure.

HiveAlpha

Our in-house editorial analytics platform HiveAlpha started life as a simple tool for the SEO team to keep track of content being published across our network in real-time.

Over time it has grown into a fully-featured editorial analytics and productivity platform. But how it was initially created has shaped a slightly different way of approaching editorial analytics as compared to other publishers’ in-house tools.

Why is it called ‘HiveAlpha’?

It is intended to be a ‘hive’ for users to deposit and access useful information. And ‘alpha’ because it’s something I’ve put together on my (long) daily commute.

Search-driven

At the core of HiveAlpha is a search engine. In common with the Guardian’s Ophan, ElasticSearch is a central technology.

Deep, highly customisable search is built-in to HiveAlpha thanks to ElasticSearch

ElasticSearch is excellent for analytics (as demonstrated by Kibana), and is used in HiveAlpha as both a noSQL database and a search engine. This means anything stored is inherently searchable, and through its aggregations, visualisation and other analytics tasks are easy.

Unlike Ophan (as far as I’m aware), HiveAlpha stores the content of every article we publish into a real-time searchable index. This allows us a level of insight into our content that no other web analytics tool I’ve seen offers.

As with any large dataset, part of the problem with using traditional web analytics to analyse millions of pages is findability. By integrating search with content analytics, it’s easy to segment articles on flexible criteria.

Some examples of what this allows us to do:

  • We can identify articles by language patterns. Q: when did we first mention ‘brexit’ in an article? A: 18 June 2013, almost a year before it was first used in the Commons.
  • We can perform highly advanced searches on our content in real-time, with more granularity than available on Google. For example we could easily find long-read articles published within the past year that mention a specific topic or phrase, by a specific author, with a lead video and more than three images.
  • With that query we could also view trends of other articles over time that match that criteria, and create a segment in our core business analytics tool to view how that content performs.
  • It’s no secret most news sites’ internal search functions are awful. With a better index internally, we can search a comprehensive index of all the content we’ve ever published, up to the minute. Google is fast, but it’s not always comprehensive, and will filter out content it doesn’t like.
  • Because we index article text separately from other meta-data, it means the number of false-positives you see on Google (which often happens with news sites as it indexes most read/shared/recent lists as well as the body copy) are massively reduced.
  • We can define new metrics on the fly — because we index everything we publish, often the answer to a new type of question is just a case of querying the dataset in a new way. For example recently we pulled out usage patterns of agency pictures on Mirror.co.uk; something no other internal tools were able to offer.

Visualisation

The aggregations offered by ElasticSearch make it easy to create visualisations for standard analytics charts. But beyond the traditional audience charts, the data indexed in HiveAlpha has enabled experimentation with other visuals, such as the NewsMap below which shows recently mentioned locations in our stories:

The NewsMap shows an overview of the locations mentioned in our articles

Another visual ElasticSearch allows us to do is view an overview of the tags we’ve used in our articles over a given period of time, shown below using the D3.js bubble chart. It’s not as accurate as a bar chart, but it can compress a lot of information into a small space — and it looks nice:

The topic bubble is another way of seeing a broad overview of the content we publish

Another nice visual is a calendar of our most viewed content over a given month, in much the same way as Google’s Trends calendar.

Most popular news topics each day in June 2016

One further way we can visualise our content is to see the internal link structure of our articles (which is handy for SEO). For any given topic, section or search, we can see the strongest pages and ensure they are being cited in the right way.

An example is shown below of the Mirror’s coverage of Jeremy Corbyn, created using data exported from HiveAlpha and visualised using the wonderful Gephi — where you can calculate PageRank and express that in a network graph (as well as in tabular form):

Link graph created using graphml and Gephi

Summary

There are a lot of exciting things that can be done using search as a basis for analysing content — much more than is available in commercial tools today.

HiveAlpha does maybe 0.1% of what I think it could do, and in a later post I’ll discuss what we’re doing with the more traditional editorial analytics side of the platform.

Read next: