Data Science and Machine Learning Course

Today, the BBC Datalab team is releasing onto the open web the course materials it developed for newcomers to data science at the BBC.

Data science, and particularly machine learning, are trending like never before. At just 12 minutes (!), NIPS tickets sold out quicker than Burning Man this year and if anyone still needs convincing, check out this plot:

It shows us that as recently as this year, both terms overtook the word “hipster” in web searches. While this may not be great news for avocado sales, it does look like it’s a good time to be someone working with data! But as ever, with exposure can often come confusion. So first, let’s try and get some definitions straight.

The BBC handles terabytes (read — “a lot”) of new data every day, and data science is the field of study that helps us use that data to make better decisions and deliver greater value to our audiences. Machine learning, meanwhile, is just one class of statistical techniques widely used in data science that, as it turns out, is particularly powerful. However, it is not magic. Instead it is a collection of tools and techniques that use mathematics to draw conclusions from data.

It could be argued that nothing indicates more the need to publish a training course than the moment a topic surpasses in web traffic a cool word millennials (used to?) use. But our team did have other reasons for opening this course up. Here are just three:

  1. To get people excited about learning more data science and machine learning; particularly when data literacy is so high on many employers’ wish lists.
  2. To share some of the interesting problems faced by the BBC in the data space.
  3. To demonstrate how large organisations such as the BBC can use their audience’s data to generate a positive impact.

Continuing to create engaging content, while the expectations of our (especially younger) audience members are constantly shifting, is one of our greatest challenges. To ensure we meet these new expectations, it is important for us to analyse and understand the conditions and behaviour patterns that lead to more, or less engagement.

In this course, we use data to explore the question: “What makes BBC audiences engaged?”. If you have at least a basic understanding of Python programming (statistics would be a bonus!) and a healthy interest in taking your first steps in data science, we think you will have a lot of fun exploring our course.

We take readers on a four-part journey that follows the typical pattern of many data-led projects. Each part is expected to take readers around one hour to complete and focuses upon: data exploration, data transformation, classification models and regression models.

In data exploration, we first look into how to formulate our data science problem and perform some preliminary analyses to get a better feel for our data. In data transformation we then introduce readers to some basic machine-learning theory and walk through the process of how we prepare our data for ingestion into our statistical models. In the final two parts (classification and regression), the actual machine-learning starts, where we explain how to train, evaluate and choose the most appropriate model for our purpose.

The dataset that you get to work on contains the logs from 10,000 BBC iPlayer users. As you might expect, a public service broadcaster like the Beeb doesn’t take data privacy lightly. So while the dataset we use is ‘real’, you can be sure that we have enforced particularly strong anonymisation so that the identity of the users is impossible to recover.

With an introductory course like this, we don’t expect to make anyone an expert in data science overnight. However, we do hope that those of you who do take it will have a lot of fun, while also gaining some valuable insight into the decisions we make when working with data at the BBC.

The topics covered in the online course only scratch the surface of the data science problems we encounter daily at the BBC. If you would like to find out more about the challenges we face and how we are using data science and machine learning to find innovative solutions to connect with audiences, please get in touch!

We are always looking to improve the content of the course, so if you have any feedback or ideas for further instalments we would really like to hear from you.

Link to course: https://github.com/bbc/datalab-ml-training

And if you found this training easy and had fun doing it, why not join us? https://findouthow.datalab.rocks/