Image Captioning Using Hugging Face Vision Encoder Decoder — A Step 2 Step Guide (Part 1)

9 min readJul 7, 2022

In this tutorial we will learn to create our very own image captioning model using Hugging face library.

Along with the code walkthrough, we will discuss key concepts in brief to get an intuitive idea of how transformer architecture works with few examples. This guide is divided in 2 main tasks

Train & fine tune Language model on captions (any transformer BERT, Roberta etc.) — Part 1
Initialize & train captioning model using Vision Encoder Decoder module of hugging face library — Part 2

Flikr8K dataset was used to train both models.

You can download the images from here. All the training captions and Image file names are stored in data.json.

For complete code of the tutorial use my github link

What is Image Captioning in a crux?

Image captioning is as an End-to-End Sequence to Sequence embedding task where, Image pixels are input sequences and caption describing the image is desired output.

Due to exclusive nature of both images and text sequences two different model tied together (one dedicated to ENCODE from images and other DECODE to a text Sequence) are required to solve this task.

This idea is synonymous to traditional Encoders & Decoders used to resolve processing and propagation of electronic signals.

Encoder-Decoder architecture inspires state of the art Sequence to Sequence learning task like Language translation, Text summarization etc. which harness power of State-of-Art Transformers like BERT, Roberta, GPT2 etc.

In these above tasks, encoder is essentially a transformer model which coverts information in text into a mathematical tensor employing Self-Attention heads which uses Key, Value and Query vectors to encode text and draw the context of entire input into an embedding vector. This solves the problem of traditional NLP techniques like short window of attention and inability of parallel computation. Mathematical-tensor from Encoder is connected with the Decoder using a cross-attention layer. This layer helps in mapping decoder tensors to the encoder output and extract the required information and deliver the desired output according to the task at hand.

Image captioning is analogous to other Seq2 Seq tasks except the only difference is instead of translating from a language to language we are translating from an Image to a language. Which mathematically is an identical task as both are nothing but 3 dimensional vectors.

You can learn in detail about transformer architecture in this informative 3-part blogpost by Ketan Doshi also this Seq2Seq guide course

Before ATTENTION from Transformers

Before advent of transformers, encoders chosen for captioning task were CNN architectures trained on image classification or segmentation. Last hidden states from pretrained CNN serve as an input cell state for recurrent networks like LSTM, GRU decoders which generates the desired captions.

Image Captioning Using CNN and RNN networks

After ATTENTION from Transformers

Due to advances in transformers in computer vision and NLP they replaced both encoder and decoder architectures.

Encoder for image is a Vision Transformer (explained below) and the LSTMs or GRUs are replaced by transformers which can generate better captions due to its capability to understand context better.

Note: New Efficient Net V2 outperforms ViT so there is still tussle for best architecture in computer vision

The output embeddings from ViT encoder are connected with the decoder transformer which can be any transformer architecture like Roberta, BERT or GPT2 etc. with a cross attention layer to generate the text describing the Image.

Image captioning using transformers (Reference: 1)

What is a Vision Transformer?

Recently a paper on Vision Transformers “An Image is worth 16x16 words” which flashes an idea that every Image can be interpreted as chunk of images analogous to tokens or words in a sentence. This 3d Image tensor tokens can be used as an input to a transformer for any CV tasks. This solves the challenge of CNN of limited association as the basic idea of CNN’s are tying only regional pixels using a kernel which convolutes all over the image.

Now that we have a brief idea of the models that are used in upcoming guide lets us begin creating our very own image captioning model….

TASK1

We will be fine tuning a language model on our corpus of text which will allow the decoder to learn new words (if any) and generate brief captions. This will fine tune the self-attention weights of the transformer to build a better context of sentences.

This will also save us training time and complexity on the actual task of captioning because while training the captioning model the optimizer can focus on fine tuning the cross-attention layers, as the self-attention layers would be already primed to have better sentence (caption) representation.

-> Model Parameters

Defining hyper parameters like learning rate, batch size, maximum length of text sequence to be generated by the decoder.]

TRAIN_BATCH_SIZE = 20   # input batch size for training (default: 64)
VALID_BATCH_SIZE = 5   # input batch size for testing (default: 1000)
VAL_EPOCHS = 1 
LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
SEED = 42               # random seed (default: 42)
MAX_LEN = 128           # Max length for Input description

TRAIN_EPOCHS = 2       # number of epochs to train (default: 5)
WEIGHT_DECAY = 0.01
SEED = 42               # random seed (default: 42)
SUMMARY_LEN = 20   # Maximum length of caption generated

import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

Loading the json file with Image file name and captions associated to the image.

Preprocessing the captions with start and end token (this is not a necessary step most of the tokenizers append the Start and End tokens by default when you tokenize the sentence).

import os
import json
import pandas as pd# LOCATION OF JSON AND IMAGES
os.chdir(r'C:\Users\kalpe\Great lakes\ML project') # CONVERTING TO DICTIONARY
with open('data.json', 'r') as openfile:

    json_object = json.load(openfile)

images_caption_dict = dict(json_object)
images_path = 'C:/Users/kalpe/Documents/Great lakes/ML project/Flickr8k_Dataset/Flicker8k_Dataset/'images = list(images_caption_dict.keys())# PREPENDING IMAGE PATHS TO THE IMAGE FILE NAMES
for image_path in images:
    if image_path.endswith('jpg'):
        new = images_path + image_path.split('/')[-1]
        images_caption_dict[new] = images_caption_dict.pop(image_path)
    else:
        images_caption_dict.pop(image_path)

# PREPROCESSING CAPTIONS BEFORE TOKENIZING
df = pd.DataFrame([])

captions = []
images = []
for image in list(images_caption_dict.keys()):
    caption = images_caption_dict[image]
    for capt in caption:
        captions.append(capt.replace('<s> ','').replace('  <e>','').strip())
        images.append(image)
        
df['images'] = images
df['captions'] = captions

-> Train a Custom Tokenizer

It is always better to train a tokenizer on your corpus of text so that it doesn't miss out on words important for your specific task. If the word in our corpus is not present in the tokenizer vocabulary it may be assigned an unknown “<UNK>” token which will be loss of information.

The captions are needed to be converted into text files, for the tokenizer to learn on the corpus of captions we have.

# Store values in a dataframe column (Series object) to files, one file per record
os.mkdir("./text_split")def column_to_files(column, prefix, txt_files_dir = "./text_split"):
    # The prefix is a unique ID to avoid to overwrite a text file
    i=prefix
    #For every value in the df, with just one column
    for row in column.to_list():
      # Create the filename using the prefix ID
        file_name = os.path.join(txt_files_dir, str(i)+'.txt')
        try:
            # Create the file and write the column text to it
            f = open(file_name, 'wb')
            f.write(row.encode('utf-8'))
            f.close()
        except Exception as e:  #catch exceptions(for eg. empty rows)
            print(row, e) 
        i+=1
    # Return the last ID
    return i

data = df["captions"]# Removing the end of line character \n
data = data.replace("\n"," ")# Set the ID to 0
prefix=0# Create a file for every description value
prefix = column_to_files(data, prefix)

In this example we will train a Byte Level BPE Tokenizer which starts constructing tokens at byte level rather than word level.

Ex. “Higher the throw higher the scoring ability”

For above sentence the tokenizer will first start with all letters in the sentence and after chunking words together the final tokens might look like

[ “</s>” ,“High”, “er</w>”, “the</w>”, “throw</w>”, “high”, “er</w>”, “the</w>”, “scor”, “ing</w>”, “ability</w>”, “</e>”]

To learn more about different types of tokenizers Refer this awesome blog post.

%%time 
paths = [str(x) for x in Path(".").glob("text_split/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer(lowercase=True)

# Customize training
tokenizer.train(files=paths, vocab_size=10000, min_frequency=2,
                show_progress=True,
                special_tokens=[
                                "<s>",
                                "<pad>",
                                "<e>",
                                "<unk>",
                                "<mask>"])

After training save the tokenizer and then call it using the decoder model's tokenizer wrapper function which will wrap our trained tokenizer.

os.mkdir('Byte_tokenizer')
tokenizer.save_model('Byte_tokenizer')

-> Training Decoder

I have used Roberta as a Text decoder model in this demonstration you can choose any Transformer model like Deberta, BERT, Alberta etc. (The tokenizing and initializing steps might differ for different models).

We are training decoder on the captions to fine tune it on the language structure and create better and brief description of Images.

For training the Language model we used masked LM method to initialize. In this method a random word in the sentence is masked while training, and the model tries to predict the word which will possibly be the right choice. The model keeps on learning till it starts predicting right words.

Initialize the model with the config and also use Roberta tokenizer wrapper as shown below. To learn more about Masked LM (Language Modelling) training checkout this blogpost.

config = RobertaConfig(
    vocab_size=10000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

model = RobertaForMaskedLM(config=config)
model.to(device)print('Num parameters: ',model.num_parameters())

# Create the tokenizer from a trained one
tokenizer = RobertaTokenizerFast.from_pretrained('Byte_tokenizer', max_len=MAX_LEN)

Before training the model, we need to process the data by creating a data loader.

class CustomDataset(Dataset):
    def __init__(self, df, tokenizer):
        # or use the RobertaTokenizer from `transformers` directly.

        self.examples = []
        
        for example in df.values:
           x=tokenizer.encode_plus(example, max_length = MAX_LEN, truncation=True, padding=True)
            self.examples += [x.input_ids]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        # We’ll pad at the batch level.
        return torch.tensor(self.examples[i])
# Create the train and evaluation dataset 
train_dataset = CustomDataset(df['captions'][:38000], tokenizer) eval_dataset = CustomDataset(df['captions'][38000:], tokenizer)

Also, we use a handy tool, Data collator for Language Modelling which will randomly mask words while training.

from transformers import DataCollatorForLanguageModeling

# Define the Data Collator
collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.1
)

Now using the previously defined hyper parameters we will allocate the training arguments for training and start training the model.

model_folder = "RobertaMLM"
# Define the training arguments
training_args = TrainingArguments(
    output_dir=model_folder,
    fp16=True,
    overwrite_output_dir=True,
    evaluation_strategy = 'epoch',
    num_train_epochs=TRAIN_EPOCHS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    save_steps=8192,
    #eval_steps=4096,
    save_total_limit=1,
)
# Create the trainer for our model
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    #prediction_loss_only=True,
)# Train the model
trainer.train()

Time to check how the model is performing!!

from transformers import pipeline

fill = pipeline(
    "fill-mask",
    model= r'RobertaMLM',
    tokenizer= 'Byte_tokenizer'
)fill("a girl going into a <mask> building")

Finally save the decoder and tokenizer for further use in the second step.

tokenizer.save_pretrained('Byte_tokenizer')
trainer.save_model(model_folder)

This fine-tuned Language transformer will serve as a decoder in our final model.

Please refer my next blog for Task 2 where we will train our Vision Encoder Decoder model to generate the Image captions.

References:

GitHub - google-research/vision_transformer

Update (9.6.2022): Added the ResNet, ViT, and MLP-Mixer checkpoints optimized using "Surrogate Gap Minimization…

github.com

https://github.com/edumunozsala/RoBERTa_Encoder_Decoder_Product_Names

Encoder Decoder Models

The EncoderDecoderModel can be used to initialize a sequence-to-sequence model with any pretrained autoencoding model…

huggingface.co

https://arxiv.org/abs/2109.10282