Introduction to Image Caption Generation using the Avenger’s Infinity War Characters
Deep learning can be a daunting field for beginners. And it was no different for me - most of the algorithms and terms sounded from another world! I needed a way to understand the concepts from scratch in order to figure out how things actually work. And lo and behold, I found an interesting way to learn deep learning concepts.
The idea is pretty simple. To understand any deep learning concept, imagine this:
A mind of a newly born baby is capable of performing a trillion calculations. And, all you need is time (epochs) and nuture (algorithms) to make it understand a “thing” (problem case). I personally call this the babifying technique.
This intuition inherently works because neural networks are inspired by the human brain in the first place. So, re-engineering the problem should definitely work! Let me explain that with a example.
What if we trained our model on American culture images, and later asked it to predict labels of traditional Indian dance folks?
Apply the re-engineering idea to the question. It would be akin to imagining a kid who has been brought up in the USA, and has been to India for a vacation. Guess what label an American kid would predict for this image? Keep that in your mind before scrolling further.
This image has a lot of traditional dressing from traditional Indian culture.
What would a kid born in America caption it (or) a model that is exposed to an American dataset?
From my experiments, the model predicted the following caption:
A Man Wearing A Hat And A Tie
It might sound funny if you’re aware of Indian culture, but that’s the bias of algorithms. Image caption generation works in a similar manner. There are two main architectures of an image captioning model.
Understanding Image Caption Generation
The first one is an image based model which extracts the features of the image, and the other is a language based model which translates the features and objects given by our image-based model to a natural sentence.
In this article, we will be using a pretrained CNN network that is trained on the ImageNet dataset. The images are transformed into a standard resolution of 224 X 224 X 3. This will make the input constant for the model for any given image.
The condensed feature vector is created from a convolutional neural network (CNN). In technical terms, this feature vector is called embedding, and the CNN model is referred to as an encoder. In the next stage, we will be using these embeddings from the CNN layer as input to the LSTM network, a decoder.
In a sentence language model, LSTM is predicting the next word in a sentence. Given the initial embedding of the image, the LSTM is trained to predict the most probable next value of the sequence. Its just like showing a person a series of pictures and asking them to remember the details. And then later show them a new image which has similar content to the previous images and ask them to recall the content. This “recall” and “remember” job is done by our LSTM network.
Technically, we also insert <start> and <stop> stoppers to signal the end of the caption.
['<start>', 'A', 'man', 'is', 'holding', 'a', 'stone', '<end>']
This way, the model learns from various instances of images and finally predicts the captions for unseen images. To learn and dig deeper, I highly recommend reading the following references:
- Show and Tell: A Neural Image Caption Generator by the Google Research team
- Automatic Image Captioning using Deep Learning (CNN and LSTM) in PyTorch by Analytics Vidhya
To replicate the results of this article, you’ll need to install the pre-requisites. Make sure you have anaconda installed. If you want to train your model from scratch, follow the below steps, else skip over to the Pretrained model part.
git clone https://github.com/pdollar/coco.git
python setup.py build
python setup.py install
git clone https://github.com/yunjey/pytorch-tutorial.git
pip install -r requirements.txt
Now that you have the model ready, you can predict the captions using:
$ python sample.py --image='png/example.png'
The original repository and code are implemented in the command line interface and you will need to pass Python arguments. To make it more intuitive, I have made a few handy functions to leverage the model in our Jupyter Notebook environment.
Let’s begin! Import all the libraries and make sure the notebook is in the root folder of the repository:
import matplotlib.pyplot as plt
import numpy as np
from torchvision import transforms
from build_vocab import Vocabulary
from model import EncoderCNN, DecoderRNN
from PIL import Image
Add this configuration snippet and function to load_image from notebook:
# Device configuration
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)#Function to Load and Resize the imagedef load_image(image_path, transform=None):
image = Image.open(image_path)
image = image.resize([224, 224], Image.LANCZOS)
if transform is not None:
image = transform(image).unsqueeze(0)
Hard code the constants with pretrained model parameters. Note that these are hard coded and should not be modified. The pretrained model was trained using the following parameters. Changes should only be made if you are training your model from scratch.
# MODEL DIRS
ENCODER_PATH = './models/encoder-5-3000.pkl'
DECODER_PATH = './models/decoder-5-3000.pkl'
VOCAB_PATH = 'data/vocab.pkl'# CONSTANTS
EMBED_SIZE = 256
HIDDEN_SIZE = 512
NUM_LAYERS = 1
Now, code a PyTorch function that uses pretrained files to predict the output:
def PretrainedResNet(image_path, encoder_path=ENCODER_PATH,
# Image preprocessing
transform = transforms.Compose([
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))])
# Load vocabulary wrapper
with open(vocab_path, 'rb') as f:
vocab = pickle.load(f)# Build models
encoder = EncoderCNN(embed_size).eval() # eval mode (batchnorm uses moving mean/variance)
decoder = DecoderRNN(embed_size, hidden_size, len(vocab), num_layers)
encoder = encoder.to(device)
decoder = decoder.to(device)# Load the trained model parameters
decoder.load_state_dict(torch.load(decoder_path))# Prepare an image
image = load_image(image_path, transform)
image_tensor = image.to(device)
# Generate a caption from the image
feature = encoder(image_tensor)
sampled_ids = decoder.sample(feature)
sampled_ids = sampled_ids.cpu().numpy() # (1, max_seq_length) -> (max_seq_length)
# Convert word_ids to words
sampled_caption = 
for word_id in sampled_ids:
word = vocab.idx2word[word_id]
if word == '<end>':
sentence = ' '.join(sampled_caption)[8:-5].title()
# Print out the image and the generated caption
image = Image.open(image_path)
return sentence, image
To predict the labels use :
predicted_label, image = PretrainedResNet(image_path='IMAGE_PATH')
We had Hulk. Now we have ML!
Let us get started with producing captions on some scenes from Avenger’s Infinity War, and see how well it generalizes!
Test Image: Mark I
Have a look at the image shown below:
What do you think this image is about? Hold a caption in your mind without scrolling down.
Let’s see how our model predicts this image..
Well, the prediction for this image is exactly to the point. Makes me curious if I can train a whole model again just on the Marvel Universe to predict the names. Personally, I would love to see Tony Stark being represented as Iron Man.
Test Image: Mark II
Perfect again! In fact, Tony is holding a cellular remote mobile to call Steve Rogers.
Test Image: Mark III
Honestly, even I am pretty amazed at the learning of the model. The model captured the front, as well as the background layer information. Although it misclassified the Panther statue as a mountain, it’s still a pretty good prediction overall.
Test image: Mark IV
Oh boy! Rocket Raccon is going to be really upset. He gets super annoyed when people around the galaxy refer to him as a rabbit or a talking panda. Dog is going to get on his nerves a bit!
Plus, the model is trained on cars, and hence spaceships are out of the question here. But I am quite happy that our model successfully predicted Rocket Racoon sitting near a “window”.
Test image: Mark V
“Woods”, correct. “Man sitting”, correct. “A Rock”, unfortunate, but correct.
Our model is absolutely brilliant at captioning the images. Taking this forward, I would like to train it further on the Marvel Universe to see if the model can recognize the names, context or perhaps even the humor.
Final Test: Avengers 4 Prediction
The model pretty much hints at the new soul world twist in the Avenger’s 4 plot. I will leave this one out for you! Do let me know what you interpret from the last image in the comments below.
Artificial Intelligence and Machine learning are getting awesome with every breakthrough. I hope you now have a basic intuition of how image captioning works, and had fun doing it the Avenger’s way.
PS: Ultron is gone for good. We assure you that we are NOT working on that AI singularity yet.
So, take a break and share your love through claps, and don’t forget to subscribe Analytics Vidhya publication for more awesome stuff.