Analytics Vidhya
Published in

Analytics Vidhya

Introduction to Image Caption Generation using the Avenger’s Infinity War Characters

Source — www.hdwallpapers.in
Guess the caption?
A Man Wearing A Hat And A Tie

Understanding Image Caption Generation

['<start>', 'A', 'man', 'is', 'holding', 'a', 'stone', '<end>']
  1. Show and Tell: A Neural Image Caption Generator by the Google Research team
  2. Automatic Image Captioning using Deep Learning (CNN and LSTM) in PyTorch by Analytics Vidhya

Prerequisites

git clone https://github.com/pdollar/coco.git
cd coco/PythonAPI/
make
python setup.py build
python setup.py install
cd ../../
git clone https://github.com/yunjey/pytorch-tutorial.git
cd pytorch-tutorial/tutorials/03-advanced/image_captioning/
pip install -r requirements.txt

Pretrained model

$ python sample.py --image='png/example.png'
import torch
import matplotlib.pyplot as plt
import numpy as np
import argparse
import pickle
import os
from torchvision import transforms
from build_vocab import Vocabulary
from model import EncoderCNN, DecoderRNN
from PIL import Image
# Device configuration
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
#Function to Load and Resize the imagedef load_image(image_path, transform=None):
image = Image.open(image_path)
image = image.resize([224, 224], Image.LANCZOS)
if transform is not None:
image = transform(image).unsqueeze(0)
return image
# MODEL DIRS
ENCODER_PATH = './models/encoder-5-3000.pkl'
DECODER_PATH = './models/decoder-5-3000.pkl'
VOCAB_PATH = 'data/vocab.pkl'
# CONSTANTS
EMBED_SIZE = 256
HIDDEN_SIZE = 512
NUM_LAYERS = 1
def PretrainedResNet(image_path, encoder_path=ENCODER_PATH, 
decoder_path=DECODER_PATH,
vocab_path=VOCAB_PATH,
embed_size=EMBED_SIZE,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYERS):
# Image preprocessing
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))])

# Load vocabulary wrapper
with open(vocab_path, 'rb') as f:
vocab = pickle.load(f)
# Build models
encoder = EncoderCNN(embed_size).eval() # eval mode (batchnorm uses moving mean/variance)
decoder = DecoderRNN(embed_size, hidden_size, len(vocab), num_layers)
encoder = encoder.to(device)
decoder = decoder.to(device)
# Load the trained model parameters
encoder.load_state_dict(torch.load(encoder_path))
decoder.load_state_dict(torch.load(decoder_path))
# Prepare an image
image = load_image(image_path, transform)
image_tensor = image.to(device)

# Generate a caption from the image
feature = encoder(image_tensor)
sampled_ids = decoder.sample(feature)
sampled_ids = sampled_ids[0].cpu().numpy() # (1, max_seq_length) -> (max_seq_length)

# Convert word_ids to words
sampled_caption = []
for word_id in sampled_ids:
word = vocab.idx2word[word_id]
sampled_caption.append(word)
if word == '<end>':
break
sentence = ' '.join(sampled_caption)[8:-5].title()
# Print out the image and the generated caption
image = Image.open(image_path)
return sentence, image
plt.figure(figsize=(12,12))
predicted_label, image = PretrainedResNet(image_path='IMAGE_PATH')
plt.imshow(image)
print(predicted_label)

We had Hulk. Now we have ML!

<HOLD A CAPTION IN YOUR MIND>
Avenge Us Fan poster — Reddit.com (Hint: The Soul World!)

End Notes

Source: Gify GIFS

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mohammad Shahebaz

Kaggle Grandmaster 🏅| 👨🏻‍💻 Data Scientist | TensorFlow Dev 🔥