YouTube-8M Training & Inference

Published in

Google Cloud - Community

8 min readApr 30, 2020

Computer Vision | Video Understanding

Continuing on the previous YouTube-8M Dataset post, this one covers model training using what is provided in the getting started section of the GitHub repo. The goal of the models that are covered are to search for a specific moment within a video, which is called temporal concept localization.

In the past, metadata was used to search for videos. These newer models enable classifying specific segments in the video at a specific timestamp where those topics appear. For example the models can help identify in the video all the points where there is chocolate, someone is sleeping or someone is ice skating.

Below steps through how to train example models, evaluate and run inference with the code provided by the YouTube-8M project. The example code is Python and uses TensorFlow. This is a good example of how to get off the ground when working with this dataset.

File Structure Setup

Start by setting up the following file structures on your server where you are training, evaluating and running predictions on the model.

${HOME}/ yt8m/code/
${HOME}/yt8m/models/frame/
${HOME}/yt8m/2/frame/train/
${HOME}/yt8m/3/frame/validate/
${HOME}/yt8m/3/frame/test/

Starter Code

Pull the starter code from the GitHub repo.

cd ~/yt8m/code && git clone https://github.com/google/youtube-8m.git

Data

Download the data and put that into the data files under train, validate and test or reference them in Google Cloud. More information on the data and where to find it is in this post.

Use these folders for storage:

${HOME}/yt8m/2/frame/train/
${HOME}/yt8m/3/frame/validate/
${HOME}/yt8m/3/frame/test/

Starter Algorithms

Once the structure is setup and the data is in place, you are ready to start training using the initial starter algorithms to create a model. There are a couple algorithms that the codebase provides to experiment with.

Frame-Level Logistic Regression Model

Logistic models are trained in “one-vs-all” approach meaning it helps surface a single prediction out of multiple classes. In this algorithm, there is a model for each of the 1000 classes and a segment-wise logistic model is what is used. Predicted scores for each class are produced for each frame up to 5 in segment and those predictions are averaged (mean-pool) to get the segment-level prediction. This type of model is like having a panel of judges giving predictions and averaging the results for the final prediction.

Deep Bag of Frame (DBoF) Pooling Model

This model was initially inspired by the classic bag of words representations for video classification. A set of randomly selected frames from the segment are used as input and the features are extracted. This is a convolutional neural network where the convolution layer is an upsampling where weights are applied on the frame. This provides a strong representation of input features on frame level because it’s a technique to provide more feature details (increase the size of the input for the next layer). The second layer pools the previous layer into segment level results (reducing the size). So it expands the parameters to the frame level and then contracts them to the segment level to surface summary predictions for the segment. More input and features can slightly improve the results. For example including pixels and audio can help boost the score.

Train

To train the models, move into the youtube-8m folder (which should be under the ~/ yt8m/code/ directory if you did the setup above) to run the commands or make sure to adjust as it makes sense.

Frame-Level Logistic Regression Model

Use the terminal command to run the frame-level logistic regression model.

python3 train.py --frame_features --model=FrameLevelLogisticModel \
--feature_names="rgb,audio" --feature_sizes="1024,128" \
--train_data_pattern=${HOME}/yt8m/2/frame/train/train*.tfrecord \
--train_dir="${HOME}/yt8m/models/frame/sample_model_logistic" \
--start_new_model

Using the system I setup previously (n1-standard-8 CPU only), it trained over 12 hours with 400–500 examples processed per second and looping over 18,000 steps. The loss started around 8–9 and dropped down towards 5–6 and stayed there after 5 hours. So you can probably run it for less steps but consider tweaking the hyper parameters. The resulting model was stored in the sample_model_logistic folder. The time it took to run is a good case for why GPUs are good use. I will explore this more in a later post.

Deep Bag of Frame (DBoF) Pooling Model

Use the following command to run the the starter code for the DBoF model.

python3 train.py --frame_features --model=DbofModel \
--feature_names="rgb,audio" --feature_sizes="1024,128" \
--train_data_pattern=${HOME}/yt8m/2/frame/train/train*.tfrecord \
--train_dir="${HOME}/yt8m/models/frame/sample_model_dbof" \
--start_new_model

The resulting model would be stored in the sample_model_dbof folder. Note, my machine did not have enough computer power to complete training on this model. So you definitely need a GPU to run this one.

Train.py Flags & Defaults

Note there are a number of flags that can be passed into the above commands to adjust the hyper parameters and training settings. The flags and standard settings can be found in the GitHub repo train.py file. Here are the flags and defaults in the file:

train_dir=”/tmp/yt8m_model/”
train_data_pattern=””
feature_names=”mean_rgb”
feature_sizes=”1024"
frame_features=False
segment_labels=False
model=”LogisticModel”
start_new_model=False (you have to add it to start a new model otherwise it won’t)
num_gpu=1
batch_size=1024
regularization_penalty=1.0
base_learning_rate=0.01
learning_rate_decay=0.95
learning_rate_decay_examples=4000000
num_epochs=5
max_steps=None (max number of iterations of the training loop)
export_model_steps=1000
num_readers=8 (how many threads to use for reading input files)
optimizer=”AdamOptimizer”
clip_gradient_norm=1.0
log_device_placement=False

Look in the train.py file for more details about each one and experiment with using them and changing the defaults.

Model Output Files

After training is done, the model files are stored in the folder that was created. There are several files in the folder including a graph.pbtxt file that can be loaded into TensorBoard to visualize the model performance as it trains.

The main files to focus on are 3 types that will have many versions in the folder. By default, TensorFlow’s checkpoint saving method is used which shards the model’s trained weights into a collection of checkpoint-formatted binary files. There is an index file that helps navigate which weights are stored in which shard. The value of the way the saving is done is you can train the model over multiple machines and split out the data over different machines to speed up training. You can also stop and restart training and it will know where it left off.

Below lists the file types in the folder:

meta file (.meta): stores the saved graph structure which needs to be imported before restoring the checkpoint
index file (.index): it is a string-string immutable table. Each key is a name of a tensor and its value is a serialized BundleEntryProto. Each BundleEntryProto describes the metadata of a tensor: which of the “data” files contains the content of a tensor, the offset into that file, checksum, some auxiliary data, etc.
data file (.data-00000-of-00001): it is TensorBundle collection and saves the values of all variables

Evaluate

Once you have a working model, validate and evaluate to see if it is generalized enough for new examples.

Evaluate the model using the following command:

# Frame-level
python3 eval.py \
--eval_data_pattern=${HOME}/yt8m/3/frame/validate/validate*.tfrecord
--train_dir ${HOME}/yt8m/models/frame/sample_model_logistic \
--segment_labels --run_onceOR #DBoF
python3 eval.py \
--eval_data_pattern=${HOME}/yt8m/3/frame/validate/validate*.tfrecord
--train_dir ${HOME}/yt8m/models/frame/sample_model_dbof \
--segment_labels --run_once

It took 20 minutes to run evaluation on the Frame-level model. Total examples processed were 235,256. Note, there are specific flags in this file to help adjust how the evaluation functions. Below are the resulting evaluation metrics and details.

Examples processed = 235,256
Avg_Hit (accuracy rate on first prediction) = 0.558
Avg_PERR (precision at equal recall rate)= 0.558
Avg_Loss (average loss) = 19.756

Popular for measuring object detector accuracy:

MAP (mean Average Precision / average of area under precision-recall curve) = 0.752
GAP (global average precision based on top 20 predictions per example) = 0.778

These results are adequate and you can do better. Tuning the network, getting more data and trying different model structures are ways to improve performance.

Inference

To use the model for predictions on new data it has never seen, use the following command.

#Frame-level
python3 inference.py \
--train_dir ${HOME}/yt8m/models/frame/sample_model_logistic \
--output_file=${HOME}/yt8m/models/frame/sample_model_logistic/ks.csv
--input_data_pattern=${HOME}/yt8m/3/frame/test/test*.tfrecord \
--segment_labels --batch_size=64OR # DBoF
python3 inference.py \
--train_dir ${HOME}/yt8m/models/frame/sample_model_logistic \
--output_file=${HOME}/yt8m/models/frame/sample_model_dbof/ks.csv \
--input_data_pattern=${HOME}/yt8m/3/frame/test/test*.tfrecord \
--segment_labels --batch_size=64

It took 13 minutes to run and processed 2,062,258 examples on the Frame-level model.

After its done it will output a file for predictions and share the location of the result file, ks.csv (I shortened it to fit on the line above but name it whatever you want.), under your /tmp/ directory. The exact directory will be listed after inference completes. You can convert the numbers with the vocabulary.csv file to see what category was predicted. You cannot verify this is correct by looking at the original file since it was compressed. More information on compression is provided in the previous blog post about the dataset and the academic papers listed in the Resources section below.

In order to evaluate your performance on the inference results, you can still submit to a competition after it is over to get a score from Kaggle on how your model performs. You can also use your own dataset to run through this model and see the results. Not, you’ll need to do a lot of work to get the dataset setup to model.

If you want to compare the predictions with the results, you can run it against the validate dataset and do the number comparison between the predictions and the labels. This is not ideal if you’ve used the validate as noted above to evaluate and tune the model; however, it is a way to actually see what it looks like.

Wrap up

What was covered is how to develop a temporal localization of topics model for the YouTube-8M dataset. This steps through the examples for training, evaluating and running inference on the completed models.

There are many other models to explore and you can start with the winners of the Kaggle competition and look at others who shared solutions in the Discussion boards of each competition. For the latest Kaggle competition, this is the most recent solution.

Most solutions utilize some type of ensemble model. These can be interesting and fun to experiment with. Best case is to start with something that is successful and simpler. Play around with making adjustments and expanding. Not the more complex your model gets the more compute power you will probably need.

Additionally there are other video datasets you can explore like DeepMind’s Kinetics dataset. This is a well established video dataset used for human action classification. This is a good alternative to explore in the video space. There are over 650K video clips that cover 7K classes including actions like playing instruments or hugging. Each clip is a single action that lasts 10 seconds. More people use the dataset like its ImageNet and it is a good option for pre training video for video representations.

And there you have it, go forth and explore computer vision models.