(Machine) Learning from YouTube8M dataset

Forecasting video renditions from descriptive labels

10 min readOct 31, 2018

Introduction

YouTube-8M (here) is one of Google’s latest contributions to research in Machine Learning and Artificial Intelligence.

Starting in 2017, it is a dataset with 8 million video IDs labeled with more than 4000 classes. It was released as raw data for a Kaggle competition. This year 2018, competing data scientist have been summoned once again and the dataset contains 6.1 million videos and a few classes less (only 3862). Videos are selected according to some criteria such as being public and having at least 1000 views, lasting between 120 and 500 seconds and having association with at least one entity from the target vocabulary (one of the 3862 words mentioned above).

The purpose of the challenge is to forecast the class or classes to which a video will be associated with by inspecting its internal frame structure and associated labels. Some tools are even provided (here) by Google with the goal of facilitating the task and raise the level of the results.

Last September 9th, a workshop versing on the different experiences from competitors and organizers of the challenge took place in Munich: The 2nd Workshop on YouTube-8M Large-Scale Video Understanding.

Although this year Epic Labs was too busy to attend (we were in IBC!!!), we still wanted to make some contribution to the community and help spread the word about this awesome dataset.

With the launch of Epic Labs’ LightFlow and its Smart Encoding abilities we got inspired to make some exploration of the data. For a while already, we had the intuition that different kinds of videos have optimal codifications, and this depends to some extent on their content. YT8M’s labeled videos gives us the perfect chance to scientifically prove our guess!

We have divided the necessary tasks into three Jupyter Notebooks available in our GitHub repository.

Crawling YouTube videos

YT8M TensorFlow records are organized by chunks or “shards” in the YT8M website.

Instructions on how to get them are available here:

https://research.google.com/youtube8m/download.html. As the whole data set takes almost 96GB of our valuable hard disk, we randomly picked three of them for this article, each listing about 1000 video IDs. For privacy reasons those IDs in the dataset were encoded by YouTube engineers. Instructions and further information on this can be found here: https://research.google.com/youtube8m/video_id_conversion.html.

In YT8M’s original data set there are four main blocks at video level:

id, containing the encoded YouTube ID for each video that we will need to convert into an actual useful url.
labels, containing a list with indexes pointing to another list of 3862 words describing the general idea about the video.
mean_audio, that we will omit, and contains a very long list of floating point values representing the audio track.
mean_rgb, also omitted for our purposes, and containing the video RGB values.

Our aim is to explore such data set but also make something more interesting with it than plain video labeling (everyone does that, after all).

We have resourced to the awesome youtube-dl library (https://github.com/rg3/youtube-dl) in order to gather valuable meta data associated to each video (duration, title, resolution ladders, etc.). For the interested reader, a jupyter notebook with the implementation of the crawler is available here.

We will merge the id and the labels columns with data extracted from youtube-dl, namely creator, duration, ladder and title.

Once executed the whole notebook, we have a data frame that looks like this (last five entries displayed below):

Nicely enough, the code also creates a .csv file for each YT8M shard that it processes, so we can exchange information between different notebooks.

Analyzing labels and bitrate ladders

Once the data set is created, we can begin to explore relationships among labels. According to Google AI’s research group, in this article they explain how labels are represented in the total video population. The paper is for the 2017 competition. Their word cloud below shows this relationship using font size as a proxy of the frequency.

Interestingly, the most common labels to all videos appear to be Vehicle and Animation, followed by Video-game, Music-video, Concert, Dance or Football.

Our notebook for this part can be found here.

Label analysis

As explained above, we will be using only a few shards (containing around 3000 video items) to run our experiments. From these, we are mostly interested in those whose resolution ladders have all elements between 144p, 240p, 360p, 480p, 720p and 1080p. This shrinks our population even further to a more manageable 583 elements.

With our reduced version of the data set, we can still create a word cloud, where similarities to 2017 edition are evident:

We can’t claim the exact same frequency distributions, but we are close to it:

Observe that the column YT8M index comes directly from the index given by the research group in their vocabulary file (label_names_2018.csv) and likely labels are given in order of occurrence (but this is only a guess).

Bitrates and ladders

Our next move in this exploration is to find out whether there are any correlations between the video labels and the resolution ladders used by YouTube to deliver their videos in an optimized manner.

With LightFlow we analyze the content to provide the best video quality at a lower Bitrate, providing an optimal ABR ladder and encoding profile.

A rendition ladder enables a video distributor to optimize the quantity of bits it takes to broadcast a stream of video with a least perceptible change in quality. In the image above, we can see how higher bitrates do not necessarily represent a perceptual increase in quality. Our Smart Encoding optimization process (and also that of YouTube as explained here), will choose the highest quality at the lowest possible bit rate.

On the other hand, we have our YouTube labels and their respective ABR ladders. As explained here, the labels above can be considered categories to which each of our videos can be associated with. This means that each one of the labels assign the measurement to one of those finite groups, or categories.

In essence, we need to supply our algorithms with unique numbers that later on can be added, multiplied and processed in matrices and arrays, and character strings lack such property. Moreover, these numbers must be chosen carefully, so that they are easily normalized between 0 and 1 to improve learning speed. The naive approach is to assign an integer pointing to each category, as we already have, but this would not comply with the second condition of being easy to normalize, as it implies an order that may be inexistent. Enter the embeddings.

As it is nicely explained here, we can better understand embeddings with a simple example. If we wanted to encode the two sentences below, we would start by assigning each word an integer number:

Welcome to Epic Labs = [1, 2, 3, 4]

Let’s be Epic = [5, 6, 3, 0]

So far, so good. Note that “Epic” appears in the third place, so it is associated to number 3, and we use 0 for the “unknown” category. Now, we need to map out this set of numbers whose sequential order is arbitrary into another “space” where the dimensions are fixed and subject to a constant truth. Keras has nice modules to do this.

If we wanted to create an embedding layer in Keras with those phrases we would need the following code:

model.add(Embedding(output_dim=7, input_dim=2, input_length=4))

Here, the first argument (output_dim = 7) is the size of the embedding vectors, i.e. the number of dimensions onto which we want to project our data. The second argument (input_dim = 2) is the number of chunks available in the training set. The third argument (input_length = 4) indicates the size of the input sentences.

In order to use it for predictions with some input data, we need to invoke the following:

output_array = model.predict(labels_df)

Which will take the arrays representing sentences ([1,2,3,4] and [5,6,3,0]) and returns the corresponding array of embeddings (with dimensions 7 x 2).

Once the network has been trained, we can get the weights of the embedding layer, which in this case will be of size (7, 2) and can be thought as the table used to map integers to embedding vectors:

So according to these embeddings, our second training phrase would be represented as:

Let’s be Epic = [[0.7, 1.7], [4.1, 2.0], [0.3, 2.1], [1.2, 3.1]]

This is exactly what we did already with our labels. In the table below, we have the first three videos of our Youtube dataset and their respective label indexes:

The snapshot above is one of the frames from the video with ID 1 in the table. It is named CAT D6K LGP at Zielger and is labeled with ‘Vehicle’, ‘Car’, ‘Tractor’, ‘Ballet’, ‘Diatonic button accordion’ and ‘Super Mario Galaxy’.

Leaving aside that in some occasions Google’s automatic labeling can misbehave, at least the first three labels are fairly accurate (but...Super Mario Galaxy?). Probably this is why we had another edition of the competition this year.

Nevertheless, considering that there must be some level of coherence, we can still find some interesting interactions between those labels and the ABR ladders.

Figure below displays the plots for both t-SNE (T-distributed Stochastic Neighbor Embedding) and PCA (Principal Component Analysis). t-SNE is a nonlinear dimensionality reduction technique well-suited visualization in a low-dimensional space of two or three dimensions, while PCA is mostly used as a tool in exploratory data analysis and for making predictive models. Both are used to visualize generic distance and relatedness between populations.

The plots have been generated with three distinct subsets of our dataset: one using only the output of the embeddings, another using only the ABR ladders and a third one where both embeddings and ladders have been merged.

It is possible to identify some clusters, particularly in the PCA representations, but also in the t-SNE regarding bitrate ladders. Seems like we could have a model to make ABR ladders out of labeled videos!

Forecasting ladders from labels

Encouraged by our findings, the plan now is to feed a predictor with sequences of labels and see it output the optimal ABR ladder according to YouTube algorithms for 144p, 240p, 360p, 480p, 720p and 1080p resolutions. We will divide our data in train and test sets, in an 80–20 ratio.

The Jupyter Notebook with the implementation can be found here. The diagram for our Deep Learning model is displayed below:

The input layers are slightly different from those defined in the previous part for t-SNE and PCA. Together with the embedding layer, we will complement the data with the duration of each video. A LSTM will transform the vector sequence into a single vector, containing information about the entire sequence of labels for each video. Then we stack a deep densely connected network and spice it up with dropout layers of different intensities to improve the learning rate and avoid overfitting.

In this case we are interested in forecasting the rendition ladders, so we have put them in the output layer as a fully connected output, using a ReLu activation function because our problem involves prediction of values, not classification.

Tadaa! We get a nice convergent model, trained with an Adam optimizer on an MSE (mean squared error) loss function, after only 200 epochs. Note the trick, though, as in order to avoid overfitting we have implemented an early stopper. The initial epoch setting was 1000.

Let’s see the results for a single video (the 530th of our test dataset, to pick one):

MAPE (Mean Average Percentual Error): 2.53%

YouTube IDabels:

‘Fishing’
Not bad! A 2.53% error with only the word “fishing”:

But, alas, a wider look at the prediction over the whole dataset may give lead us to a less optimistic conclusion:

MAPE for resolution 144 of all test set is: 6.43 %
MAPE for resolution 240 of all test set is: 8.05 %
MAPE for resolution 360 of all test set is: 26.56 %
MAPE for resolution 480 of all test set is: 22.84 %
MAPE for resolution 720 of all test set is: 28.59 %
MAPE for resolution 1080 of all test set is: 29.14 %

This output shows some relevant statistics of our toy dataset. Apparently, with a set of labels it is fairly accurate to predict the ABR for resolutions 144p and 240p, but with 360p and above, the error goes way higher.
As a further exercise, we can use the same setup to forecast ABR ladders using not just the labels, but also the RGB and the audio channels supplied by YT8M, which are likely to give more interesting results, but that’s another story for another time.