Video Captioning with Keras

Shreyz-max
Analytics Vidhya
Published in
10 min readMar 15, 2021

Generate captions that describe the events of a video automatically

INTRODUCTION

The task of video captioning has become very popular in recent years. With all these platforms like YouTube, Twitch, and short videos like Instagram Reels, videos have become a very important means of communication in our daily life. According to Forbes, over 500 million people watch videos on Facebook every day. 72 hours of videos are uploaded to YouTube every minute. With videos gaining such high popularity AI products for videos have become an all-time necessity.

A clip showing real-time prediction of video captioning

PRIOR KNOWLEDGE

Concepts of LSTM/RNN and basics of encoder-decoder architecture along with understanding Keras is required to understand this post.

MOTIVATION

I was looking for some unique projects to work on when I came across captioning. I started diving into this topic when I realized the lack of good resources on video captioning yet tons on image captioning. So, I decided to work on this and make it easier for people to implement video captioning.

REAL-WORLD APPLICATIONS

We must first understand how important this problem is in real-world scenarios.

  • Better search algorithms: If each video can be automatically described search algorithms will have finer more accurate results.
  • Recommendation Systems: We could easily be able to cluster videos based on their similarity if the contents of the video can be automatically described.

DATA COLLECTION

For the purpose of this study, I have used the MSVD data set by Microsoft. You can get the data set from here. This data set contains 1450 short YouTube clips that have been manually labeled for training and 100 videos for testing.

Each video has been assigned a unique ID and each ID has about 15–20 captions.

UNDERSTANDING THE DATA SET

On downloading the data set you will find the training_data and testing_data folders. Each of the folders contains a video sub folder that contains the videos that will be used for training as well as testing. These folders also contain feat sub folder which is short for features. The feat folders contain the features of the video. There are also training_label and testing_label json files. These json files contain the captions for each ID. We can read the json files as follows:

train_path='training_data'
TRAIN_LABEL_PATH = os.path.join(train_path, 'training_label.json')
# mentioning the train test split
train_split = 0.85
# loading the json file for training
with open(TRAIN_LABEL_PATH) as data_file:
y_data = json.load(data_file)

The json file looks as follows:

Each video has multiple captions which mean the same thing

Thus for each video id there are many alternative captions.

EXTRACTING FEATURES OF VIDEO

Video Captioning is a two-part project. In the first part, the features of the video are extracted.

What is a video? One can say a video is a list of images right? So for a video in the data set each and every image called frame is extracted from the video.

The code for this can be seen here.

Since the length of videos is different, the number of frames extracted is also going to be different. So for the sake of simplicity, only 80 frames are taken from each video. Each of the 80 frames is passed through a pre-trained VGG16 and 4096 features are extracted from each frame. These features are stacked to form an (80, 4096) shaped array. 80 is the number of frames and 4096 is the number of extracted features from each frame.

Here you can see the model VGG16 is loaded. Each of the 80 frames from each video is passed into the model to extract features and saved as numpy arrays. Now these features have already been extracted for us in the data set, so we can simply move on to the next step.

CLEANING AND PREPROCESSING CAPTIONS

Now we will load all the captions and pair them with their video IDs. Here is what I have done. The train_list contains a pair of captions and its the video ID. The only text preprocessing I have done is add the <bos> and <eos> tokens before and after each caption respectively.

  • <bos> denotes the beginning of the sentence hence the model knows to start predicting from here and
  • <eos> denotes the end of the statement, this is where the model knows to stop the prediction.

This is how the train_list looks.

Some of the training list items

The train_list is split into training and validation. The training_list contains 85% of the data and the rest is present in the validation_list.

The vocab_list contains only the captions from the training_list because we will only use the words in the training data to tokenize. After tokenizing we will pad the captions so that all the sentences are of the same length. In my project here I have padded all of them to be 10 words. You might have seen that I also am only using captions where the number of words is between 6 and 10. You might ask why I did this?

If you look at the caption with the maximum number of words in all of the dataset it has 39 words but for most captions, the number of words is between 6 and 10. If we do not filter out some of the captions we will have to pad them all to the maximum length of the captions, here in our case 39. Now if most sentences are of 10 words and we will have to pad them to double their length this would lead to a lot of padding. These highly padded sentences will be used for training which will lead to the model predicting mostly padded tokens. Since padding basically means adding white spaces so most of the sentences predicted by the model will just contain more blank spaces and fewer words leading to incomplete sentences.

Now I used only the top 1500 words as the vocabulary for the captions. Any of the captions you see generated has to be a part of the 1500 words. Even though the number of unique words is way more than 1500 why is it that we only use 1500 words for training?

If you think most of the words appear very few only 1, 2, or 3 times making the vocabulary prone to outliers. Hence to keep it safe we will only use the top 1500 most occurring words.

MODEL FOR TRAINING

Mostly for problems related to text generation, the preferred model is an encoder-decoder architecture. Here in our problem statement since the text has to be generated we will also use this sequence-to-sequence architecture. To understand more about this architecture I would suggest checking out this article.

One thing to know in this architecture is the final state of the encoder cell always acts as the initial state of the decoder cell. In our problem we will use the encoder to input the video features and the decoder will be fed the captions.

Now that we have established we will be using an encoder-decoder model let us look into how we shall use it.

What is a video again? We can call it a sequence of images right? For anything related to sequence we always prefer using RNNs or LSTMs. In our case, we will use an LSTM. To understand LSTMs refer to this link.

Now that we will use LSTM for the encoder let us look into the decoder. The decoder will generate captions. Captions are basically a sequence of words so we will use LSTMs in the decoder as well.

Training model

Here in the picture, the features of the first frame are fed into the 1st LSTM cell of the encoder. This is followed by the features of the second frame and this goes on till the 80th frame. For this problem, we are interested only in the final state of the encoder so all the other outputs from the encoder are discarded. Now the final state of the encoder LSTM acts as the initial state for the decoder LSTM. Here in the first decoder LSTM <bos> acts as input to start the sentence. Each and every word of the caption from the training data is fed one by one until <eos>.

So for the example above, if the actual caption is woman is cooking something the decoder starts with <bos> in the first decoder LSTM. In the next cell, the next word from the actual caption woman is fed followed by is cooking something. This ends with <eos> token.

The time steps for the encoder is the number of LSTM cells we will use for the encoder which is equal to 80. Encoder tokens are the number of features from the video which is 4096 in our case. Time steps for the decoder are the number of LSTM cells for the decoder which is 10 and the number of tokens is the length of vocabulary which is 1500.

Let us look into the code for the model.

Code for training model

Let us look at the architecture.

LOADING THE DATASET

Now that we know the model, loading data into it is also a very important part of the training.

The number of training data points is around 14k which will definitely cause RAM memory issues. To avoid such a problem I used a data generator. I used a batch size of 320. Since there are two inputs for training. I convert it into a list and then feed the two together as encoder input which contains the features of the video as input and decoder input which are the captions that have been tokenized and padded converted to categorical features with 1500 labels as that is the length of vocabulary we will use or as I have mentioned as the number of decoder tokens. I use the yield statement to return the output. Yield statements are used to create generators. Here I have used a custom generator because we had two inputs. I have already loaded all the features in the form of a dictionary so that it takes less time to load the same arrays again and again.

TRAINING

I trained the model for 150 epochs. It took about 40 seconds to complete the training of one epoch. I used the colab free version for training on Tesla T4.

MODEL FOR INFERENCE

Unlike most neural networks the training and testing models are different for encoder-decoder. We do not save the whole model as is after training. We save the encoder model and the decoder part differently. Now let us look into the inference model.

First, we will use the encoder model. Features from all the 80 frames are passed into the model. This part of the model is the same as it was for training. The encoder model gives us the predictions. Here again, we are interested in the final output state so all the other outputs from the encoder will be discarded. The final state of the encoder is fed into the decoder as its initial state along with the <bos> token so that the decoder predicts the next word.

There are two ways to generate captions. I have implemented both but for faster results in a real-time prediction, I will use greedy search. To understand more about both greedy and beam search click here.

Inference model

Now if the model is trained properly as you can see above it should predict woman as the token. Remember, how in training the next input was always the next word in the captions. Since we have no captions here the next word is the output from the previous LSTM cell. The output woman is then fed into the next cell along with the state of the previous cell. This goes on to predict the next word is. This goes on until the model predicts <eos>. We will no longer need any more predictions because the sentence is complete.

RESULTS

Now I know everyone was waiting for this so let me show you some more results from the testing data. Now, mind you these results are using the greedy search algorithms.

a man is performing on a stage
a man is mixing ingredients in a bowl
a cat is playing the piano
a man is spreading a tortilla

Now, it would be wrong to show only the appropriate results. Here are some of the not-so-correct results.

a man is riding a bicycle

The model confuses a bike with a bicycle.

a dog is making a dance

Somehow the model confuses the cat with a dog and instead of swinging the paws, the model thinks of it as making a dance. This caption grammatically does not make a lot of sense.

CONCLUSION

Thank you for reading this far. Please refer to my Github for more details.

Some more ways to make the training better is more shuffling of data. Adding videos from many different domains.

Important Point

We must understand that training data should be semantically similar to testing data. For example, if I train the model on videos of animals and test it on different activities it is bound to give bad results.

FUTURE WORK

  • Instead of using the features given extracting more features on my own using models like I3D specifically designed for videos
  • Adding a UI to make this more attractive and deploying it to some platform.
  • Adding embedding layers and attention blocks to train for longer videos.

REFERENCES

https://github.com/CryoliteZ/Video2Text

https://github.com/PacktPublishing/Intelligent-Projects-Using-Python/tree/master/Chapter05

Want more AI content? Follow me on LinkedIn for daily updates.

Thank you for reading. If you enjoyed this article, do give it some claps 👏 . Hope you have a great day ahead!

--

--

Shreyz-max
Analytics Vidhya

AI Research Engineer @LatitudeResearch | Assistant Researcher @E4E | VIT Vellore |