Why the BBC is redesigning its dataset

John Larder

Published in

BBC Data Science

9 min readFeb 7, 2018

Fresh start. Blank slate. New beginning.

We’d all like one occasionally. It sounds attractive. Starting again. Resetting.

But it’s not always as easy as we think it’s going to be.

Divorce. Faking your own death. A new digital analytics implementation.

And it’s never really a completely new start. There are always legacy issues that need to be dealt with. Be that deciding who the children go to for Christmas, collecting the life insurance and getting away with it, or removing 8 years of tech debt and legacy configuration.

But it is always an opportunity, a chance to think anew about where you want to go and how you want to get there.

Background

The BBC has been with its current digital analytics supplier since 2011. Over that period, there have been many teams involved in implementing analytics across the huge range of different products and platforms on which the BBC exists. For example, iPlayer alone is available on more than 10,000 different devices.

There have been 100s of people involved in tagging during that time. Product owners. Developers. BAs. Researchers. Analysts. Suppliers of unknown name and provenance. The BBC has a central analytics function but, beyond a few key rules, only general guidelines exist on what and how to tag. This means our category-defining products such as News, Sport and iPlayer are able to capture the data that is most useful for them on a product level.

Documentation is kept up to date by individual products, but it isn’t always easily searchable and accessible. This is negated somewhat by the fact that tags are self-describing. For example, you could probably guess what something like ‘blog_title’ or ‘link_url’ means. But relying on self-description only gets you so far.

We currently have more than 8,000 labels live across the site, from those that generate barely any data, to those that generate 100s of millions of rows per day. Earlier this year we began to retrospectively document what the most commonly occurring 2,000 of these do.

what my 2017 looked like, and what my 2018 will continue to look like

This quickly brought home that we are inadvertently siloing our datasets. As well as a lack of easily accessible documentation, there is a lot of duplication.

For example, the labels ‘page_language’, ‘accept_language’ and ‘blog_language’ all report the same thing but in different locations. All three labels represent a user consuming content in a particular language, but it is impossible for an analyst to know what to look for.

This lack of control and governance has given us a dataset that is both wide and undefined. What is optimal on a local level, with each product deciding what and how they want to capture, is sub-optimal when looked at from a pan-BBC perspective.

It also became clear that we are capturing and keeping a lot of data that could be described as low-value. This deepens the dataset and increases the total amount that needs to be stored and processed.

We therefore currently have a dataset that is both very wide and very deep. While not linear, there is a relationship between width & depth and the cost of storage & speed of processing. This is becoming increasingly pertinent for the BBC, with consumption moving from a mode that is primarily broadcast with no data capture, to one where the majority of consumption is over IP. In an IP-first world, every interaction will leave a data trail, and this data is an increasingly important asset that needs to be maintained.

Opportunity

The BBC’s contract with its current digital analytics is nearing its conclusion. This means that we will have to retag the whole estate.

Any supplier that we go with is likely to limit the total number of labels. This will result in a wholesale change in our approach to tagging.

This gives us the chance to completely redesign our approach to data capture and take a much more strategic approach.

We would do this even if we weren’t changing suppliers.

The fact that we are makes it that much easier to rebuild from scratch.

And there are many good reasons for doing this:

Personalisation
Machine learning
Bringing it in-house
Efficiency
Organisational capability

A more personal BBC

The BBC is changing. While its heritage is broadcast, its future is personal.

“… when we look to reinventing ourselves for the future, personalisation is so fundamental. It’s at the core of so many of the priorities we have set ourselves: from reinventing iPlayer, to reaching 20 million members, to revitalising our education mission.”

Tony Hall, 2017

To make a BBC that is more personal, we need to see and understand what each audience member is doing. The content they engage in. The topics they’re interested in. How they experience the BBC.

Having more than 8,000 undocumented and unknown data points makes this impossible. Having a much slimmer set of known and documented signifiers makes this much simpler.

Until the BBC decided to become much more personal, the heterogeneity of our data wasn’t a problem.

Frustrating sometimes? Yes.

Strategically limiting? No.

But with an increased focus on personalisation, this is changing.

Digital analytics is our largest and, arguably, most important dataset. We collect more than 1 billion audience interactions a day. Interactions that implicitly describe a user’s content preferences. These past touchpoints are essential fuel for powering a personalisation programme. Getting this right is fundamental.

Remember the example of duplicate language labels given earlier? Well, what if the BBC wanted to surface content that existed in a particular language to the set of users that it knew used that language? Be that French, Pidgin or Esperanto. With the existing set-up, you’d be reliant on somebody knowing all the labels that referred to the language content was consumed in. It’s not well documented, and though I highlighted three known versions of the same label, there may well be others.

If as an organisation we want to leverage a user’s past behaviour and preferences to personalise their experience, then that rich dataset needs to be simple to work with.

Consolidating what we track and how we track it is an essential step on the BBC’s journey to becoming a more data-driven and personally relevant organisation. Which brings us on to the next benefit…

Machine learning and AI

Yes, yes. Very hot topic. I know. Doesn’t mean it’s not relevant.

Though the BBC is exploring machine learning and AI, we’re not doing that much on the data science side. Yet.

When we do, having a leaner and better documented dataset will enable us to make strides in machine learning much quicker.

If +90% of machine learning is data cleansing, then if you reduce the amount of data cleansing, you increase your capacity for machine learning. Q.E.D.

If scaling machine learning and AI is a knowledge management problem, when you reduce the knowledge management problem, you increase your potential.

The idea that you can throw a load of unstructured data into a machine and it’ll somehow magically sort it out is for the birds. To do this right you need to do the hard yards when it comes to data quality. And this is an opportunity for us to seriously increase the quality of our data.

(and though this section is titled ‘Machine learning and AI’, all of the above is just as relevant for conventional modelling — but ‘conventional modelling’ just doesn’t have the same cachet right now)

BYO

Build Your Own. These words will either strike fear into your heart or cause you to spring out of bed every morning with excitement depending on your perspective and/or experience. Though the BBC currently has no definite plans to bring its analytics in-house, it is a medium-term ambition.

Redesigning our approach now is a way station on that journey. Bringing in more discipline, process and governance now will pay dividends if/when we ever build our own.

Efficiency

i. quicker reporting

Restricting the number of unique labels available means it becomes viable to construct cubes containing all captured data, cubes that will run significantly faster than an n-dimensional database.

With more than 1,000 users across the organisation, the time that is no longer wasted in waiting for queries to run will be many thousands of man-hours per year. There is also the less measurable but equally important fact that the quicker a query runs, the more engaged an end user is and the more that they can remain in the flow of what they’re doing.

ii. build once, use many

With greater commonality, it’ll be far easier to build and share: reports, dashboards, models, segmentations, tagging guides. And probably some other things I can’t think of. This will mean less work for everyone. Productivity problem? What productivity problem?

iii. documentation

There will be a smaller set of documentation to keep up to date. It will be easier for developers and business analysts to understand what tracking to implement. It will also be easier for reporters to understand what is being tracked and how to report, and there will be a common language and set of definitions in use across products.

The secret of the industrial revolution was standardisation. The secret of good tagging is also standardisation. And good documentation.

Organisational maturity

End users will not only benefit from efficiency gains but there will be more trust in the data. We will move away from “is this data correct?” to “what can I do with this data?”

This is also part of a more general progression up the analytics maturity curve:

descriptive > diagnostic > predictive > prescriptive

Rather than leveraging data to develop audience facing features and products via analysis and insights, we will move to a more data-driven offering.

How will we change our approach to achieve this?

Process and governance

The bulk of design and definition creation will occur during the implementation phase. Post-implementation we will work with our colleagues in Data Management to develop a governance framework and processes around label creation and modification.

We will move away from creating specific tags for one or two use cases that are never used again. Any label created should be done so with a view to further use. No label should be created without first checking whether an appropriate label already exists.

Modular querying

As we move from a label creation process with few rules to one that is far more rigid, we’ll need to educate and explain to end users exactly how they can get the information they need. How what they want to achieve can be done in a modular manner.

For example:

product1_action = product1-favourite-add

is the same as:

action_type = addaction_name = favouriteproduct_name = product1

While the second of these uses more parameters, it can be easily adapted and applied to all products. The first method would need to be made unique for every product.

Being very specific with a label might work in one case but is not extensible. A modular design enables far more flexibility, is more efficient and is easier to work with.

Keep only what is valuable, discard what isn’t

“It’s not the daily increase but daily decrease. Hack away at the unessential.”

Bruce Lee

As mentioned, some of what we currently collect and keep could be described as low-value. This isn’t to say it’s not useful, more that the period of time within which we could actively make use of it is strictly limited.

Anything that we collect with the specific intent of understanding real-time performance should be purged on a regular basis. This is unless it can somehow be aggregated up to help us understand broader trends.

Putting theory into practice

While there are clear benefits in changing our approach, execution won’t be easy. The BBC is a diverse organisation and each product is at a different level of maturity when it comes to analytics.

Changing what currently works well for the majority will be no mean feat, with benefits that may not be obvious in the short-term.

Yet this is something that we have to do if we are to exploit the opportunities that an IP-first future presents.

Call me in 12 months. I’ll let you know how it went.

**********************************

Can’t wait 12 months? Then join us.

We’re (usually) hiring. Search “data science”, “machine learning”, “analytics” or similar on the BBC’s careers site to find out more.

https://careerssearch.bbc.co.uk/jobs/custom/?fields%5b32%5d=581

Why the BBC is redesigning its dataset

Written by John Larder