BBC Data Science
Published in

BBC Data Science

A Personalised Recommender from the BBC

The BBC produces fantastic content that appeals to a mass audience such as Killing Eve and Bodyguard on BBC iPlayer, the Danger Mouse games for CBBC, and Match of the Day on the BBC Sport, to name a few. While it’s great that we can produce content that is enjoyed by so many different people it does create data science challenges around understanding the personalities and preferences of individuals. Episode 1 of Killing Eve, for example, attracted a whopping 26% of the TV audience during first transmission.

Term frequency-inverse document frequency (tf-idf) is a metric most associated with text analysis and, in particular, as a rudimentary search engine. It assigns a weight to each text document that tells us how relevant each one is to a particular search term. The success of tf-idf is largely down to the inverse document frequency part that penalises popular words in the search term. Common words such as “the” carry far less information than more niche words like “BBC”. This is how tf-idf differs from simply counting the occurrences of the search terms in each document.

Here at the BBC, we’re using tf-idf for entirely different applications: recommender systems.

By analysing how our audience interacts with our content we can infer similarity between different TV shows on BBC iPlayer, or articles on BBC News. This allows us to make more relevant recommendations and improve the experience that we offer. In terms of content association, our top hitters that have mass appeal reveal less information about a user’s specific interests. Hence, tf-idf allows us to make recommendations that are relevant on an individual user basis rather than just serving up our top content in all recommendations.

In the next post in this series I will discuss how, with the addition of a network graph community detection algorithm, we can use tf-idf to identify topics or genres of content that are defined by user behaviours. This allows us to enrich our metadata, improve recommendations and audience segmentations.

Overview of tf-idf

Let’s begin with an overview of the standard version of tf-idf. In the context of search engines, tf-idf can be used to rank the relevance of a collection of documents, relative to a specific search term. It has two stages; the first simply looks for the frequency of a word within each document (term frequency) before then penalising popular words that appear in large numbers of documents (inverse document frequency).

Term frequency is a measure of how important a search term is to a specific document. The simplest choice is to count how many times a particular word, t, occurs in the document. It assigns a specific value to each and every document, d, and is collection-agnostic; meaning it doesn’t know, or care, about the other documents within the collection.

Inverse document frequency is a way of penalising frequently occurring words that appear in many documents, such as the word “the”. This prevents searches for terms such as “the BBC” from being dominated by documents that simply have the most occurences of “the” but without ever mentioning the word “BBC”. In essence, the word “BBC” carries more information than the word “the” and we want our ranking to reflect that. Inverse document frequency depends on the overall collection of documents that we’re searching, as well as the term or word that we are searching for. The value is the same across all documents within the collection.

The idf can be calculated by calculating the fraction of all documents in our collection, D, that contain the word, taking its inverse, and then the logarithm

Term frequency-inverse document frequency is the product of the term frequency and the inverse document frequency

Using tf-idf for Content Association

A simple example of tf-idf

As a simple use case I will be demonstrating tf-idf on just 3 games and 3 users. The games I have chosen are from 3 of our most popular brands: The Worst Witch, Danger Mouse and Dennis and Gnasher Unleashed. We first introduce the notions of a word, document and corpus for our content.

Word: A word in this context is the full name of a content item such as “Worst Witch Game” or “Danger Mouse Game”.

Document: If we first define a cohort of users that have interacted with a particular game, then a document is a list of all content consumed by these users. For example, suppose we have 3 users’ histories:

User 1:

Danger Mouse Game

Danger Mouse Game

User 2:

Danger Mouse Game

Dennis and Gnasher Unleashed: Leg It Game

Dennis and Gnasher Unleashed: Leg It Game

User 3:

Dennis and Gnasher Unleashed: Leg It Game

Dennis and Gnasher Unleashed: Leg It Game

Worst Witch Game

Worst Witch Game

Worst Witch Game

Only User 3 has played the Worst Witch Game so the document associated with the Worst Witch Game would just be User 3’s history:

Dennis and Gnasher Unleashed: Leg It Game, Dennis and Gnasher Unleashed: Leg It Game, Worst Witch Game, Worst Witch Game, Worst Witch Game

The Dennis and Gnasher Unleashed: Leg It Game, however, is played by both User 2 and User 3 so the Dennis and Gnasher Unleashed: Leg It Game document would be User 2 and 3’s histories concatenated:

Danger Mouse Game, Dennis and Gnasher Unleashed: Leg It Game, Dennis and Gnasher Unleashed: Leg It Game, Dennis and Gnasher Unleashed: Leg It Game, Dennis and Gnasher Unleashed: Leg It Game, Worst Witch Game, Worst Witch Game, Worst Witch Game

Similarly, the Danger Mouse document would be the history of both Users 1 and 2:

Danger Mouse Game, Danger Mouse Game, Danger Mouse Game, Dennis and Gnasher Unleashed: Leg It Game, Dennis and Gnasher Unleashed: Leg It Game

Corpus: The corpus is the collection of all such lists; with one list of concatenated user histories for each content item.

The term frequency in standard text-based tf-idf applications, can be unfairly biased towards documents of different lengths because a longer article is likely to contain more of the search terms. Our content association problem is no different. In this case the challenge arises from the fact that more popular content is consumed by more people who, in general, will have longer collective histories. There is an easy fix to this by exploiting the fact that we have far fewer content items than words that exist in the English language. Therefore we can count the occurrence of each content item in a cohort’s history and then normalise, essentially calculating the fraction that each individual item contributes to the combined history.

For our 3 users example above we could write our documents in the form

TABLE 1: Frequency of games (columns) that are played by different cohorts (rows)

and then divide each row by the row sum to get

TABLE 2: Fraction of a cohort’s history (row) that is comprised of each game (column)

If we take people who have, at some point, played the Worst Witch Game then we can see that 60% of the time they play just the Worst Witch Game. However, 40% of all the games they play are the Dennis and Gnasher Game.

The inverse document frequency part can is obtained by calculating the fraction of non-zero entries in each column of Table 1 or Table 2 above. This tells us how many cohorts consume a particular piece of content.

One situation that can arise is that for every pair of content items, there is at least one person who has consumed both. In this event the idf, and consequently the tf-idf, will be zero. To avoid this issue, a threshold can be set for either how many plays a game has to have had (using Table 1) or a lower limit on the proportion of a document consisting of a content item (using Table 2). So the idf would be the fraction of entries in a column above a certain threshold.

The choice of threshold becomes a hyper parameter that can be tuned. A larger threshold will penalise popular content more strongly but this can often be associated with less popular content items having zeros in every row, resulting in an infinite idf. Hence the need for tuning.

The term frequency-inverse document frequency can then be calculated by just multiplying the tf and the idf.

It is often useful to normalise the resulting tf-idf weights in someway. This is achieved by dividing by the mean or rescaling to give a mean of zero and standard deviation of one.

A Real Example applied to Content Association

I will now demonstrate the effectiveness of tf-idf on a real example using traffic to 3 Blue Peter Quizzes and 3 of CBBC’s top performing games.

The 3 quizzes in my data set are:

Could you bee a bumblebee Quiz?

Ultimate Unicorn Quiz

Duckling Quiz

and the 3 games I shall be looking at are:

Danger Mouse Game

Worst Witch Game

Dennis and Gnasher Unleashed: Leg It Game

Content Popularity Recommender

First, we will look at what would happen if we simply recommend the most popular content on each page. For this we can use the term-frequency data equivalent to the example given in Table 1. The heat map below shows this; each row indicates how popular different content items are to people who play the content item named on the y-axis. What we see is that there is a large amount of overlap between the 3 Games (bottom right 9 squares are dark blue). However, we can also see that the most popular content among people who played the quizzes is also the 3 games (top right 9 squares are darker than the top left 9 squares). A recommender based purely off content popularity would simply recommend the top 3 games on each page with no personalisation.

Content association based on frequency of co-consumption. Dark indicates high values.

Tf-idf Recommender

If we plot the equivalent but using our tf-idf weights then we arrive at a heat map like the one below. We now see an increase in association between the 3 quizzes (top left 9 squares are now darker than the top right 9 squares). This is a result of the inverse document frequency penalising the weights for the 3 games due to their mass popularity among all users. What we get in this case is 2 subcategories or genres within our content: quizzes and games.

If we used our tf-idf weights in content recommender system then we’d recommend more Blue Peter quizzes to when people complete another Blue Peter quiz and mores games to those who play one of the games.


Using tf-idf weights as measures of content association is a really effective method of identifying meaningful collections of content without the bias of popular content. This can form the basis of recommender systems or be used to enrich metadata based on our audiences tastes and interests. Both of these help us serve the most relevant content to our users.

In the next article in this collection, I’ll discuss how to apply network community detection to formally identify clusters of content. We use this approach for audience segmentation and creating behaviour-driven genres for enriching our metadata.




Learn more about how the BBC collects, interprets, visualises and democratises data to achieve our goal of putting the audience at the heart of everything we do.

Recommended from Medium

Credit Card Attrition countervailing, is it possible?

Tools and Platforms to Help Annotate Your Text Data.

My Data Science Career Journey: Studying Master of Data Science (Part 2)

Where Is Your Company On The Data Maturity Model?

How To Select The Right Variables From A Large Dataset?

This means you won’t trade seeing your family for Thanksgiving for a larger paycheck

Google Data Analytics Course Capstone Project 1

3 Easy Ways to Learn Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Matt Crooks

Matt Crooks

Senior Data Scientist at the TypeForm

More from Medium

Recommendation Engines — A breakthrough in AI

How to build an end-to-end Azure Machine Learning workflow

Natural Language Processing by Intuition

Evaluation of classification models on unbalanced production data