Image Captioning Using Hugging Face Vision Encoder Decoder — Step 2 Step Guide (Part 2)

8 min readJul 18, 2022

In the previous article, we discussed in brief about encoder decoders and our approach towards solving the task of captioning. We fine-tuned a language model, this allowed the decoder to learn new words, generate brief captions and save training time.

This can be referred as priming our decoder before actual training on the captioning task.

Before we dirty our hands with the code, let us understand how the Vision Encoder Decoder module connects the two models (Image Encoder and Text Sequence generator) and how it deciphers what is present in the image.

To understand this, you need a basic understanding of how transformer attention works and terminologies like KEY, QUERY & VALUE. You can learn in detail about transformers in this informative 3-part blogpost by Ketan Doshi also in this Seq2Seq course.

VISION ENCODER DECODER (VED)

When we initialize the vision encoder decoder with our pretrained models (Vision transformer & Roberta in above example), it creates an image encoder & language decoder instance and ties their embeddings together using a cross attention layer.

The encoder embeddings are used as KEY & VALUE and the decoder embeddings are used as QUERY in the cross-attention head.

During training, both Image and desired caption are passed as inputs to the model. This is an example of TEACHER FORCING training.

In above example, the image and caption “<start> Dog is running in grass with ball in its mouth <end>” are the inputs to our model.

Using these inputs, the model is forced to generate same caption, this leads to understanding the correlation between words in captions and objects in the input images.

Final output from the linear layer (LM head), is a vector of size Length of sequence X Vocabulary size (refer 2.1). The elements of vector matrix (after SoftMax) are probabilities (green boxes below signifies highest probability) of the word occurring at a given position in the sequence.

This generated sequence “<start> Dog is running <end>” is compared with input caption and loss is calculated for every incorrect word in the generated sequence.

At every training step, the model predicts sequence of words which is compared with the actual caption and the loss is propagated back to learn.

After training, GIF below explains how we generate text sequence.

We start with image and start token “<start>” as input to the model.
Using inputs model generates Length of sequence X Vocabulary length vector again, but unlike in training (where all the tokens are used) only the second token (immediately succeeding to start token) is considered and all the other words are masked. Above GIF shows, after start token, “Dog” was the first word, and all the other words were masked.
Once we get the first word, we repeat the process in first step except this time we will use “<start> Dog” as our input and generate next word (3rd token).
This loop continues until the model outputs an end token or reaches a maximum set length of output.

Translation & Captioning Analogy

Now that we know how model generates caption, we will have a brief discussion on how it interprets objects in the input image using an analogy with machine translation.

In Sentence translation (ex. converting Spanish to English), words from input sentence are used to select words and structure them to form a meaningful translated sentence. Thus every translated word has some association with words in input sentence. This association (also known as attention weights) signify what input word or set of input words led to the selection of a specific translated word.

As shown below the word “try” have higher attention weight with “tratar” signifies “try ”was selected predominantly because of “tratar”.

“descubro” have attention weights with the “find” (which is discover in Spanish) also with “to” and “out” indicating the focus of the model on Spanish tokens while generating the corresponding English token.

Similarly in Image captioning task, the words will have attentions to regions, objects or actions in the image. Each word or sequence of words will have association or attention weights with different regions signifying which region contributed towards selection of the word the most.

As we see the word ‘dog ’ is associated to region of the image containing a dog. ‘Running’ and ‘grass’ has high attention with the 3rd region where we can see the grass and legs of the dog. ‘Running’ has small attention to the first image as we can clearly see that the dog is running.

After sufficient training, the model is able to interpret nuances in image and generate captions by understanding various objects and their actions in the image.

TASK 2

In the second task we will start with pre-processing data & initialize VED with pre trained Vision Transformer (from hugging face) & our fine-tuned Roberta model (from Task 1).

-> Libraries

import datasets
import transformers
import pandas as pd
import numpy as np
import torch
from torch.utils.data.dataset import Dataset
from pathlib import Path



#Tokenizer from scratch on vocabulary of corpus
from tokenizers import ByteLevelBPETokenizer

# Decoder
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM # RobertaLM for learning
from transformers import RobertaTokenizerFast # After training tokenizer we will wrap it so it can be used by Roberta model



#Encoder-Decoder Model
from transformers import VisionEncoderDecoderModel

#Training
# When using previous version of the library you need the following two lines
from transformers import Seq2SeqTrainer
from transformers import Seq2SeqTrainingArguments# Latest version imports
from transformers import Trainer, TrainingArguments
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

> Train Test Split & Pre-processing

import random 
def train_test_split(dictionary):
    images = dictionary.keys()
    images_test = random.sample(images,int(0.3*len(images)))
    images_train = [img for img in images if img not in images_test]

    train_dict = {
      img: dictionary[img] for img in images_train
    }

    test_dict = {
      img: dictionary[img] for img in images_test
    }
    return(train_dict,test_dict)

train,test = train_test_split(images_caption_dict)

The dataset used in this guide has 5 distinct captions associated to a single image. Due to compute constraints, I created 5 distinct image caption pairs to have smaller input sequence, but you can use all 5 captions concatenated together by a separator token by just passing use_all = True in the function below

import pandas as pd

def get_df(dictionary):
    df = pd.DataFrame([])

    captions = []
    images = []
    for image in list(images_caption_dict.keys()):
        caption= images_caption_dict[image]
        if use_all == True:
            captions.append(tokenizer.sep_token.join([' '.join(capt.replace('<s> ','').replace('  <e>','').strip().split(' ') for capt in caption])
        else:            for capt in caption:
                captions.append(' '.join(capt.replace('<s> ','').replace('  <e>','').strip().split(' ')[:30]))
                images.append(image)

    df['images'] = images
    df['captions'] = captions
    return(df)

train_df = get_df(train)
test_df = get_df(test)

> Initialize Encoder Feature Extractor & Decoder Tokenizer

Feature Extractor in Vision transformer is analogous to a tokenizer, because it slices the 3d image in to sequence of 3d cropped images as tokens and ready for input, just like a tokenizer.

from transformers import ViTFeatureExtractor
from transformers import RobertaTokenizerFastfeature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
tokenizer = RobertaTokenizerFast.from_pretrained('Byte_tokenizer')

> Initialize Vision Encoder Decoder (VED)

We will initialize VED model and provide “Vision Transformer” as an encoder (from hugging face) and our pre-trained “Roberta” model as decoder (as shown above 2.1)

REMEMBER to put “tie_encoder_decoder = True” this is crucial (it does it by default but make sure it is still included in your code).

Tie_encoder_decoder method generates a cross attention layer as we discussed in detail above.

# set encoder decoder tying to True
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained\
                    ("google/vit-base-patch16-224-in21k", 'RobertaMLM', tie_encoder_decoder=True)
model.to(device)

Since tokenizer is initialized parallelly, the parameter settings below are necessary for the model to interpret input from the tokenizer.

# set special tokens used for creating the decoder_input_ids from the labels
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id
# make sure vocab size is set correctly
model.config.vocab_size = model.config.decoder.vocab_size

# set beam search parameters
model.config.eos_token_id = tokenizer.sep_token_id
model.config.max_length = 20
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4

The VED model requires only 2 inputs

pixel_values : 3d image chunk sequence output from feature extractor.
labels : Token sequence output from tokenizer of desired captions.

> Create Dataset for training & evaluation

from torch.utils.data import Dataset
from PIL import Imagebatch_size=TRAIN_BATCH_SIZE  # change to 16 for full trainingmax_length = max(np.max(train_df['captions'].apply(lambda x : len(x.split(' '))),np.max(test_df['captions'].apply(lambda x : len(x.split(' '))))class IAMDataset(Dataset):
    def __init__(self, df, tokenizer,feature_extractor, decoder_max_length=max_length):
        self.df = df
        self.tokenizer = tokenizer
        self.feature_extractor = feature_extractor
        self.decoder_max_length = decoder_max_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # get file name + text 
        img_path = self.df['images'][idx]
        caption = self.df['captions'][idx]
        # prepare image (i.e. resize + normalize)
        image = Image.open(img_path).convert("RGB")
        pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values
        # add labels (input_ids) by encoding the text
        labels = self.tokenizer(caption, truncation = True,
                                          padding="max_length", 
                                          max_length=self.decoder_max_length).input_ids
        # important: make sure that PAD tokens are ignored by the loss function
        labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels]

        encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
        return encoding
    
train_dataset = IAMDataset(train_df,
                           tokenizer=tokenizer,
                          feature_extractor= feature_extractor)
eval_dataset = IAMDataset(test_df),
                           tokenizer=tokenizer,feature_extractor= feature_extractor)

> Rouge Metrics to evaluate performance

ROUGE (Recall-Oriented Understudy for Gisting Evaluation), it measures number of matching n-grams between the sequence generated by the model and desired sequence. Learn more about ROGUE in this article.

# load rouge for validation
rouge = datasets.load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

captioning_model = 'VIT_Captioning'

training_args = Seq2SeqTrainingArguments(
    output_dir=captioning_model,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    fp16=True,
    evaluation_strategy="epoch",
    do_train=True,
    do_eval=True,
    logging_steps=1024,  
    save_steps=2048, 
    warmup_steps=1024,  
    #max_steps=1500, # delete for full training
    num_train_epochs = TRAIN_EPOCHS, #TRAIN_EPOCHS
    overwrite_output_dir=True,
    save_total_limit=1,
)

from transformers import default_data_collator

# instantiate trainer
trainer = Seq2SeqTrainer(
    tokenizer=feature_extractor,
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
)# Fine-tune the model, training and evaluating on the train dataset trainer.train()

trainer.save_model('Image_Cationing_VIT_Roberta_iter2')

Results

Let's see how our model is performing,

t = VisionEncoderDecoderModel.from_pretrained('Image_Cationing_VIT_Roberta_iter2')temp = test_df.sample(1).images.iloc[0]
Image.open(temp).convert("RGB")print(tokenizer.decode(t.generate(feature_extractor(Image.open(temp).convert("RGB"), return_tensors="pt").pixel_values)[0]))