The BBC is on a journey to become more personalised, and recommendations are an important part of that goal. To date, recommendations in the BBC have been provided primarily by external providers. We feel that offering — and understanding — good recommendations is a crucial area for us in reaching our target audience of young listeners and so we have started exploring this area in-house. The Datalab team is a relatively new team specialising in machine learning, and looking after recommender systems in the BBC. We work with product groups to develop new ways to personalise their offerings, and also collaborate with BBC R&D.
We want to be able to explain the composition of our recommendations and so we need to understand how they are generated. Our recommendations should reflect the breadth and diversity of our content and meet our editorial guidelines, as well as informing, educating and entertaining! All these were good reasons for us to build the capability to constantly create challengers to the existing recommendation models.
Datalab was assigned this brilliant and fun challenge and began collaborating with the Sounds team, using a multidisciplinary group made up of data scientists, engineers, editorial specialists and product managers.
The team had some prior experience building personalised recommendations for our video clip app BBC+. For BBC+, the recommender was purely content based, using existing metadata information such as genres (e.g. Drama/Medical) or brands (e.g. Glastonbury Festival). This would probably have been a good approach if our content had been labelled for the express purpose of personalisation. However, the BBC’s production workflows were designed to meet the needs of broadcast systems, and we didn’t always have all the labels we would have wanted for recommendations.
Factorisation Machines come with the enticing promise of combining content-based recommendations with collaborative filtering.
Using a standard content-based approach, if a user had listened to podcasts from the genre ‘Health & Wellbeing’ the system would recommend a new episode from Radio 1’s Life Hacks but it could also recommend Radio 4’s Inside Health, which has a very different tone of voice. By contrast, collaborative filtering matches programmes based on what similar users have enjoyed — so if they listen to Radio 1’s Life Hacks, they might be recommended Radio 1 comedy. This model relies on ‘adjacent’ content similar to the recommendations found in shopping websites where ‘customers who bought this also bought that’. This approach often leads to better recommendations for established content, but is less effective for fresh content that hasn’t been consumed by significant numbers of people. Since the BBC continuously produces new content throughout the day this recommendation strategy by itself would be limiting.
Factorisation machines are a smart way to combine both. They have been around a few years, and open source toolboxes exist to support them. Our team programs primarily in Python, so we wanted a toolbox that integrates with that. Obviously, we also wanted it to be fast, give superior results and be easy to use (more on that later…).
We stored user-item interactions (i.e. the programmes a specific user has listened to) in a BigQuery table. The programme items with the corresponding genre and brand metadata were in a different table, and both needed to be assembled in the correct format for the factorisation machines. Our first choice of toolbox was xlearn. The code seemed relatively mature, running a first test example was easy, and the toolbox offers a variety of different flavours in terms of learning algorithm. But it was hard to get the data into the correct format and, even now that we have a version up and running, we’re still not sure we got everything right — mainly because the initial results are nowhere near as good as we had wanted (and expected) them to be!
The quality of recommendations can be subjective and we needed a way to test them before making them live anywhere on the BBC’s websites or apps. Predicting past behaviour is one way of doing this, but also comes with all sorts of problems: users only click on what they see, a piece of content might be brilliant, but if it does not appear in the results, the user will not see it and cannot click on it. Recommending the most popular items generally gives good numbers (as by definition these items get the most clicks), but rarely leads to recommendations of fresh content. In practical terms, it’s also a lot of work to set up if your data is stored in ways that were not devised with the ease of access for data scientists in mind…
So we decided to test the results using qualitative evaluation, asking about 20 editorial and other non-technical people to judge our new recommendations against those from an existing provider. We didn’t tell them which set came from which recommender! We used the individual history of the internal test participants to generate the recommendations by both providers and asked for their preference and general feedback.
Most of our test users preferred the recommendations we currently have live to our first set of test recommendations and we weren’t keen on them either, so we knew we had more work to do.
With the overall infrastructure set-up, it was quite easy to swap out the toolbox we’ve used for the factorisation machines. We had previously looked at lightFM, and it had a much simpler data format, so we decided to give it a go. We were able to compute new recommendations and run another qualitative experiment in less than two weeks. Our recommendations looked much better, and our test users agreed! However, these are still first results. We don’t feel we’ve fully solved the problem of recommending popular items versus programmes that are strongly tailored towards a specific user’s interests, and are looking into ways to improve this.
We are happy with the results so far, but there is still a lot of work to do to bring the recommender into production. The infrastructure needs decidedly more work to make it robust and able to scale, and we’d like to do more testing. Having a variety of offline metrics should help us to optimise parameters, and test new algorithms without having to go back to our testing panels every few days. We’re also still looking at a simple content-based recommender to have another baseline, so more results hopefully soon.
We also still have some more fundamental questions that we hope our practical work will help us to answer. For example, can we use the same approach for recommending entertainment as for news, or do we need specialised systems for each domain? And what if we change the medium and move from audio and video to text, or new interfaces like voice controlled devices? Even if the overall editorial guidelines do not change, we might need different technical approaches to be able to achieve them. But we also want to avoid starting from scratch for every new recommender we build, and we’re still trying to figure out how best to do that. In summary, there is lots to do, but it’s exciting and we’re enjoying the challenge!
Want to work with us? https://findouthow.datalab.rocks/