Step-by-Step Deep Learning Tutorial to Build your own Video Classification Model

Pulkit Sharma
Analytics Vidhya
Published in
15 min readSep 3, 2019

I have written extensive articles and guides on how to build computer vision models using image data. Detecting objects in images, classifying those objects, generating labels from movie posters — there is so much we can do using computer vision and deep learning.

This time, I decided to turn my attention to the less-heralded aspect of computer vision — videos! We are consuming video content at an unprecedented pace. I feel this area of computer vision holds a lot of potential for data scientists.

I was curious about applying the same computer vision algorithms to video data. The approach I used for building image classification models — was it generalizable?

Videos can be tricky for machines to handle. Their dynamic nature, as opposed to an image’s static one, can make it complex for a data scientist to build those models.

But don’t worry, it’s not that different from working with image data. In this article, we will build our very own video classification model in Python. This is a very hands-on tutorial so fire up your Jupyter notebooks — this is going to a very fun ride.

If you’re new to the world of deep learning and computer vision, we have the perfect course for you to begin your journey:

Overview of Video Classification

When you really break it down — how would you define videos?

We can say that videos are a collection of a set of images arranged in a specific order. These sets of images are also referred to as frames.

That’s why a video classification problem is not that different from an image classification problem. For an image classification task, we take images, use feature extractors (like convolutional neural networks or CNNs) to extract features from images, and then classify that image based on these extracted features. Video classification involves just one extra step.

We first extract frames from the given video. We can then follow the same steps as we do for an image classification task. This is the simplest way to deal with video data.

There are actually multiple other ways to deal with videos and there is even a niche field of video analytics. I highly recommend going through the below article to understand how to deal with videos and extract frames in Python:

Also, we will be using CNNs to extract features from the frames of videos. If you need a quick refresher on what CNNs are and how they work, this is where you should begin:

Steps to build Video Classification model

Excited to build a model that is able to classify videos into their respective categories? We will be working on the UCF101 — Action Recognition Data Set which consists of 13,320 different video clips belonging to 101 distinct categories.

Let me summarize the steps that we will be following to build our video classification model:

  1. Explore the dataset and create the training and validation set. We will use the training set to train the model and validation set to evaluate the trained model
  2. Extract frames from all the videos in the training as well as the validation set
  3. Preprocess these frames and then train a model using the frames in the training set. Evaluate the model using the frames present in the validation set
  4. Once we are satisfied with the performance on the validation set, use the trained model to classify new videos

Let’s now start exploring the data!

Exploring the Video Classification dataset

You can download the dataset from the official UCF101 site. The dataset is in a .rar format so we first have to extract the videos from it. Create a new folder, let’s say ‘Videos’ (you can pick any other name as well), and then use the following command to extract all the downloaded videos:

unrar e UCF101.rar Videos/

The official documentation of UCF101 states that:

It is very important to keep the videos belonging to the same group separate in training and testing. Since the videos in a group are obtained from a single long video, sharing videos from the same group in training and testing sets would give high performance.”

So, we will split the dataset into the train and test set as suggested in the official documentation. You can download the train/test split from here. Keep in mind that since we are dealing with a large dataset, you might require high computation power.

We now have the videos in one folder and the train/test splitting file in another folder. Next, we will create the dataset. Open your Jupyter notebook and follow the below code block. We will first import the required libraries:

import cv2     # for capturing videos
import math # for mathematical operations
import matplotlib.pyplot as plt # for plotting the images
%matplotlib inline
import pandas as pd
from keras.preprocessing import image # for preprocessing the images
import numpy as np # for mathematical operations
from keras.utils import np_utils
from skimage.transform import resize # for resizing images
from sklearn.model_selection import train_test_split
from glob import glob
from tqdm import tqdm

We will now store the name of videos in a dataframe:

# open the .txt file which have names of training videos
f = open("trainlist01.txt", "r")
temp = f.read()
videos = temp.split('\n')

# creating a dataframe having video names
train = pd.DataFrame()
train['video_name'] = videos
train = train[:-1]
train.head()

This is how the names of videos are given in the.txt file. It is not properly aligned and we will need to preprocess it. Before that, let’s create a similar dataframe for test videos as well:

# open the .txt file which have names of test videos
f = open("testlist01.txt", "r")
temp = f.read()
videos = temp.split('\n')

# creating a dataframe having video names
test = pd.DataFrame()
test['video_name'] = videos
test = test[:-1]
test.head()

Next, we will add the tag of each video (for both training and test sets). Did you notice that the entire part before the ‘/’ in the video name represents the tag of the video? Hence, we will split the entire string on ‘/’ and select the tag for all the videos:

# creating tags for training videos
train_video_tag = []
for i in range(train.shape[0]):
train_video_tag.append(train['video_name'][i].split('/')[0])

train['tag'] = train_video_tag

# creating tags for test videos
test_video_tag = []
for i in range(test.shape[0]):
test_video_tag.append(test['video_name'][i].split('/')[0])

test['tag'] = test_video_tag

So what’s next? Now, we will extract the frames from the training videos which will be used to train the model. I will be storing all the frames in a folder named train_1.

So, first of all, make a new folder and rename it to ‘train_1’ and then follow the code given below to extract frames:

# storing the frames from training videos
for i in tqdm(range(train.shape[0])):
count = 0
videoFile = train['video_name'][i]
cap = cv2.VideoCapture('UCF/'+videoFile.split(' ')[0].split('/')[1]) # capturing the video from the given path
frameRate = cap.get(5) #frame rate
x=1
while(cap.isOpened()):
frameId = cap.get(1) #current frame number
ret, frame = cap.read()
if (ret != True):
break
if (frameId % math.floor(frameRate) == 0):
# storing the frames in a new folder named train_1
filename ='train_1/' + videoFile.split('/')[1].split(' ')[0] +"_frame%d.jpg" % count;count+=1
cv2.imwrite(filename, frame)
cap.release()

This will take some time as there are more than 9,500 videos in the training set. Once the frames are extracted, we will save the name of these frames with their corresponding tag in a .csv file. Creating this file will help us to read the frames which we will see in the next section:

# getting the names of all the images
images = glob("train_1/*.jpg")
train_image = []
train_class = []
for i in tqdm(range(len(images))):
# creating the image name
train_image.append(images[i].split('/')[1])
# creating the class of image
train_class.append(images[i].split('/')[1].split('_')[1])

# storing the images and their class in a dataframe
train_data = pd.DataFrame()
train_data['image'] = train_image
train_data['class'] = train_class

# converting the dataframe into csv file
train_data.to_csv('UCF/train_new.csv',header=True, index=False)

So far, we have extracted frames from all the training videos and saved them in a .csv file along with their corresponding tags. It’s now time to train our model which we will use to predict the tags for videos in the test set.

Training the Video Classification Model

It’s finally time to train our video classification model! I’m sure this is the most anticipated section of the tutorial. I have divided this step into sub-steps for ease of understanding:

  1. Read all the frames that we extracted earlier for the training images
  2. Create a validation set which will help us examine how well our model will perform on unseen data
  3. Define the architecture of our model
  4. Finally, train the model and save its weights

Reading all the video frames

So, let’s get started with the first step where we will extract the frames. We will import the libraries first:

import keras
from keras.models import Sequential
from keras.applications.vgg16 import VGG16
from keras.layers import Dense, InputLayer, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, GlobalMaxPooling2D
from keras.preprocessing import image
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.model_selection import train_test_split

Remember, we created a .csv file that contains the names of each frame and their corresponding tag? Let’s read it as well:

train = pd.read_csv('UCF/train_new.csv')
train.head()

This is how the first five rows look like. We have the corresponding class or tag for each frame. Now, using this .csv file, we will read the frames that we extracted earlier and then store those frames as a NumPy array:

# creating an empty list
train_image = []

# for loop to read and store frames
for i in tqdm(range(train.shape[0])):
# loading the image and keeping the target size as (224,224,3)
img = image.load_img('train_1/'+train['image'][i], target_size=(224,224,3))
# converting it to array
img = image.img_to_array(img)
# normalizing the pixel value
img = img/255
# appending the image to the train_image list
train_image.append(img)

# converting the list to numpy array
X = np.array(train_image)

# shape of the array
X.shape

Output: (73844, 224, 224, 3)

We have 73,844 images each of size (224, 224, 3). Next, we will create the validation set.

Creating a validation set

To create the validation set, we need to make sure that the distribution of each class is similar in both training and validation sets. We can use the stratify parameter to do that:

# separating the target
y = train['class']

# creating the training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify = y)

Here, stratify = y (which is the class or tags of each frame) keeps the similar distribution of classes in both the training as well as the validation set.

Remember — there are 101 categories in which a video can be classified. So, we will have to create 101 different columns in the target, one for each category. We will use the get_dummies() function for that:

# creating dummies of target variable for train and validation set
y_train = pd.get_dummies(y_train)
y_test = pd.get_dummies(y_test)

Next step — define the architecture of our video classification model.

Defining the architecture of the video classification model

Since we do not have a very large dataset, creating a model from scratch might not work well. So, we will use a pre-trained model and take its learnings to solve our problem.

For this particular dataset, we will be using the VGG-16 pre-trained model. Let’s create a base model of the pre-trained model:

# creating the base model of pre-trained VGG16 model
base_model = VGG16(weights='imagenet', include_top=False)

This model was trained on a dataset that has 1,000 classes. We will fine tune this model as per our requirement. include_top = False will remove the last layer of this model so that we can tune it as per our need.

Now, we will extract features from this pre-trained model for our training and validation images:

# extracting features for training frames
X_train = base_model.predict(X_train)
X_train.shape

Output: (59075, 7, 7, 512)

We have 59,075 images in the training set and the shape has been changed to (7, 7, 512) since we have passed these images through the VGG16 architecture. Similarly, we will extract features for validation frames:

# extracting features for validation frames
X_test = base_model.predict(X_test)
X_test.shape

Output: (14769, 7, 7, 512)

There are 14,769 images in the validation set and the shape of these images has also changed to (7, 7, 512). We will use a fully connected network now to fine-tune the model. This fully connected network takes input in single dimension. So, we will reshape the images into a single dimension:

# reshaping the training as well as validation frames in single dimension
X_train = X_train.reshape(59075, 7*7*512)
X_test = X_test.reshape(14769, 7*7*512)

It is always advisable to normalize the pixel values, i.e., keep the pixel values between 0 and 1. This helps the model to converge faster.

# normalizing the pixel values
max = X_train.max()
X_train = X_train/max
X_test = X_test/max

Next, we will create the architecture of the model. We have to define the input shape for that. So, let’s check the shape of our images:

# shape of images
X_train.shape

Output: (59075, 25088)

The input shape will be 25,088. Let’s now create the architecture:

#defining the model architecture
model = Sequential()
model.add(Dense(1024, activation='relu', input_shape=(25088,)))
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(101, activation='softmax'))

We have multiple fully connected dense layers. I have added dropout layers as well so that the model will not overfit. The number of neurons in the final layer is equal to the number of classes that we have and hence the number of neurons here is 101.

Training the video classification model

We will now train our model using the training frames and validate the model using validation frames. We will save the weights of the model so that we will not have to retrain the model again and again.

So, let’s define a function to save the weights of the model:

# defining a function to save the weights of best model
from keras.callbacks import ModelCheckpoint
mcp_save = ModelCheckpoint('weight.hdf5', save_best_only=True, monitor='val_loss', mode='min')

We will decide the optimum model based on the validation loss. Note that the weights will be saved as weights.hdf5. You can rename the file if you wish. Before training the model, we have to compile it:

# compiling the model
model.compile(loss='categorical_crossentropy',optimizer='Adam',metrics=['accuracy'])

We are using the categorical_crossentropy as the loss function and the optimizer is Adam. Let’s train the model:

# training the model
model.fit(X_train, y_train, epochs=200, validation_data=(X_test, y_test), callbacks=[mcp_save], batch_size=128)

I have trained the model for 200 epochs. To download the weights which I got after training the model, you can use this link.

We now have the weights which we will use to make predictions for the new videos. So, in the next section, we will see how well this model will perform on the task of video classification!

Evaluating our Video Classification Model

Let’s open a new Jupyter notebook to evaluate the model. The evaluation part can also be split into multiple steps to understand the process more clearly:

  1. Define the model architecture and load the weights
  2. Create the test data
  3. Make predictions for the test videos
  4. Finally, evaluate the model

Defining model architecture and loading weights

You’ll be familiar with the first step — importing the required libraries:

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.preprocessing import image
import numpy as np
import pandas as pd
from tqdm import tqdm
from keras.applications.vgg16 import VGG16
import cv2
import math
import os
from glob import glob
from scipy import stats as s

Next, we will define the model architecture which will be similar to what we had while training the model:

base_model = VGG16(weights='imagenet', include_top=False)

This is the pre-trained model and we will fine-tune it next:

#defining the model architecture
model = Sequential()
model.add(Dense(1024, activation='relu', input_shape=(25088,)))
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(101, activation='softmax'))

Now, as we have defined the architecture, we will now load the trained weights which we stored as weights.hdf5:

# loading the trained weights
model.load_weights("weights.hdf5")

Compile the model as well:

# compiling the model
model.compile(loss='categorical_crossentropy',optimizer='Adam',metrics=['accuracy'])

Make sure that the loss function, optimizer, and the metrics are the same as we used while training the model.

Creating the test data

You should have downloaded the train/test split files as per the official documentation of the UCF101 dataset. If not, download it from here. In the downloaded folder, there is a file named “ testlist01.txt “ which contains the list of test videos. We will make use of that to create the test data:

# getting the test list
f = open("testlist01.txt", "r")
temp = f.read()
videos = temp.split('\n')
# creating the dataframe
test = pd.DataFrame()
test['video_name'] = videos
test = test[:-1]
test_videos = test['video_name']
test.head()

We now have the list of all the videos stored in a dataframe. To map the predicted categories with the actual categories, we will use the train_new.csv file:

# creating the tags
train = pd.read_csv('UCF/train_new.csv')
y = train['class']
y = pd.get_dummies(y)

Now, we will make predictions for the videos in the test set.

Generating predictions for test videos

Let me summarize what we will be doing in this step before looking at the code. The below steps will help you understand the prediction part:

  1. First, we will create two empty lists — one to store the predictions and the other to store the actual tags
  2. Then, we will take each video from the test set, extract frames for this video and store it in a folder (create a folder named temp in the current directory to store the frames). We will remove all the other files from this folder at each iteration
  3. Next, we will read all the frames from the temp folder, extract features for these frames using the pre-trained model, predict tags, and then take the mode to assign a tag for that particular video and append it in the list
  4. We will append actual tags for each video in the second list

Let’s code these steps and generate predictions:

# creating two lists to store predicted and actual tags
predict = []
actual = []
# for loop to extract frames from each test video
for i in tqdm(range(test_videos.shape[0])):
count = 0
videoFile = test_videos[i]
cap = cv2.VideoCapture('UCF/'+videoFile.split(' ')[0].split('/')[1]) # capturing the video from the given path
frameRate = cap.get(5) #frame rate
x=1
# removing all other files from the temp folder
files = glob('temp/*')
for f in files:
os.remove(f)
while(cap.isOpened()):
frameId = cap.get(1) #current frame number
ret, frame = cap.read()
if (ret != True):
break
if (frameId % math.floor(frameRate) == 0):
# storing the frames of this particular video in temp folder
filename ='temp/' + "_frame%d.jpg" % count;count+=1
cv2.imwrite(filename, frame)
cap.release()

# reading all the frames from temp folder
images = glob("temp/*.jpg")

prediction_images = []
for i in range(len(images)):
img = image.load_img(images[i], target_size=(224,224,3))
img = image.img_to_array(img)
img = img/255
prediction_images.append(img)

# converting all the frames for a test video into numpy array
prediction_images = np.array(prediction_images)
# extracting features using pre-trained model
prediction_images = base_model.predict(prediction_images)
# converting features in one dimensional array
prediction_images = prediction_images.reshape(prediction_images.shape[0], 7*7*512)
# predicting tags for each array
prediction = model.predict_classes(prediction_images)
# appending the mode of predictions in predict list to assign the tag to the video
predict.append(y.columns.values[s.mode(prediction)[0][0]])
# appending the actual tag of the video
actual.append(videoFile.split('/')[1].split('_')[1])

This step will take some time as there are around 3,800 videos in the test set. Once we have the predictions, we will calculate the performance of the model.

Evaluating the model

Time to evaluate our model and see what all the fuss was about.

We have the actual tags as well as the tags predicted by our model. We will make use of these to get the accuracy score. On the official documentation page of UCF101, the current accuracy is 43.90%. Can our model beat that? Let’s check!

# checking the accuracy of the predicted tags
from sklearn.metrics import accuracy_score
accuracy_score(predict, actual)*100

Output: 44.80570975416337

Great! Our model’s accuracy of 44.8% is comparable to what the official documentation states (43.9%).

You might be wondering why we are satisfied with a below 50% accuracy. Well, the reason behind this low accuracy is majorly due to lack of data. We only have around 13,000 videos and even those are of a very short duration.

End Notes

In this article, we covered one of the most interesting applications of computer vision — video classification. We first understood how to deal with videos, then we extracted frames, trained a video classification model, and finally got a comparable accuracy of 44.8% on the test videos.

We can now try different approaches and aim to improve the performance of the model. Some approaches which I can think of are to use 3D Convolutions which can directly deal with videos.

Since videos are a sequence of frames, we can solve it as a sequence problem as well. So, there can be multiple more solutions to this and I suggest you explore them. Feel free to share your findings with the community.

As always, if you have any suggestions or doubts related to this article, post them in the comments section below and I will be happy to answer them. And as I mentioned earlier, do check out the computer vision course if you’re new to this field.

Originally published at https://www.analyticsvidhya.com on September 3, 2019.

--

--