Building a Multimodal Model Architecture in Python

Published in

Python’s Gurus

6 min readJun 12, 2024

Multimodal learning implemented by a human baby

Introduction

This article is based on my research experience as a student before ChatGPT became popular. At that time, there was still very little discussion about multimodal training. The focus was heavily on pre-trained transformer models, which were, and still are, considered state-of-the-art foundation models. The topic usually covers what they are, how they work, and less on how to leverage their power. Innovation, on the other hand, is about newly connecting old concepts. One day, while searching for inspiration for a competition idea, I was sidetracked by a computer vision multimodal paper that inspired me to experiment with creating my own multimodal architecture using these pre-trained foundation models.

As we approach the era of Artificial General Intelligence (AGI), multimodality training has been more and more relevant in the field of AI. Drawing inspiration from the intuitive way we teach children by speaking to them while showing them objects, multimodal learning extracts information from different aspect of an object. This approach allows AI systems to not only understand language but also interpret visual and auditory inputs, mimicking the holistic way humans learn and comprehend information.

In this exploration, we’ll build our own multimodal model. We’ll use a case study to show how an AI model could leverage multimodal data to gain more nuanced learning abilities. Our case study focuses on an image classification task where the model is tasked with predicting the room location category based on both a photo of the place and a text description accompanying the photo.

Dataset

The dataset will be used in the case study is taken from a Kaggle competition, SoCS Hackathon — AI, which is an Indoor Scene Recognition dataset by MIT. It consists of 11 class of room types with 2003 images of places accompanied by a label of the place classification as training set and 1167 images that are not present in the training set as test set. Additionally, 320 out of the 1167 test set images are available for the validation set, while the remainder is hidden for determining the final leaderboard.

Model Architecture

In this article we will explore mainly how to create a multimodel classifier. Our model architecture will be consist of two models a language model and a vision model. For the language model we will be using RoBERTa a transformers-based variant of BERT which uses dynamic masking to achieve more robust result and for the vision model we will be using ViT which is also a transformers-based vision model.

Before diving into the model, we will need to generate text descriptions for the images. To accomplish this, we can utilize a generative captioning model such as GPT. However, at this stage, there are concerns regarding generated caption data: if the model’s quality in generating captions is poor, it may introduce noise rather than provide nuanced information.

It is crucial to experiment and find the best caption generation method suited to your specific dataset. In this case study, we will utilize GPT-2, which requires relatively low computational resources.

Flow of generating caption for image dataset

Our multimodal architecture will then receive paired input of images and text captions. The import and preprocessing of the paired data is shown in the code snippet below


# import the data
data_dir = '/content/train/train'

# import the image
batch_size = 16
img_size = (224, 224)

train_ds = keras.utils.image_dataset_from_directory(
  data_dir,
  shuffle=False,
  seed=123,
  image_size=img_size,
  batch_size=batch_size)

# import the caption
df = pd.read_csv('caption_true.csv',index_col=0)

tensor_train = tf.data.Dataset.from_tensor_slices((df['text'].values,df['label'].values))

# pair the image with the caption
import tensorflow_datasets as tfds
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer, RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# apply tokenizer encoding to each sentence
def convert_sentence_to_features(sentence,tokenizer):
    return tokenizer.encode_plus(
        sentence,
        max_length=MAX_LENGTH,
        add_special_tokens=True,
        pad_to_max_length=True,
        return_attention_mask=True,
        truncation=True
    )

# map the pre-processed data into the input parameter
def map_features_to_dict(image_list, input_ids_list, attention_masks_list, label_list=[],token_type_ids_list=[]):
    dict_={}
    if len(token_type_ids_list) > 0:
      if len(label_list) > 0:
        dict_ = {
          'image':image_list,
          'input_ids': input_ids_list,
          'token_type_ids' : token_type_ids_list,
          'attention_mask': attention_masks_list,
        }, label_list
      else:
        dict_ = {
          'image':image_list,
          'input_ids': input_ids_list,
          'token_type_ids' : token_type_ids_list,
          'attention_mask': attention_masks_list,
        }
    else:
      if len(label_list) > 0:
        dict_ = {
          'image':image_list,
          'input_ids': input_ids_list,
          'attention_mask': attention_masks_list,
        }, label_list
      else:
        dict_ = {
          'image':image_list,
          'input_ids': input_ids_list,
          'attention_mask': attention_masks_list,
        }
    return dict_
    
# pre-process the text into tokens and mask, and paired with image, then map to the input layer parameter
def encode_sentences(dataset, imageset,tokenizer, evaluation=False):
    input_ids_list = []
    token_type_ids_list = []
    attention_masks_list = []
    label_list = []
    image_list = []
    tensor_dataset = []

    # evaluation don't need label
    if evaluation:
      for images in imageset:
        for image in images:
          image_list.append(image)

      # apply tokenizer then append to list
      for message in tfds.as_numpy(dataset):
        bert_input = convert_sentence_to_features(message.decode(),tokenizer) 
        input_ids_list.append(bert_input['input_ids'])
        attention_masks_list.append(bert_input['attention_mask'])        
        if 'token_type_ids' in bert_input:
          token_type_ids_list.append(bert_input['token_type_ids'])
    else:
      for images,label in imageset:
        for image in images:
          image_list.append(image)

      for message, label in tfds.as_numpy(dataset):
        bert_input = convert_sentence_to_features(message.decode(),tokenizer) 
        input_ids_list.append(bert_input['input_ids'])
        attention_masks_list.append(bert_input['attention_mask'])
        label_list.append([label]) 
        
        if 'token_type_ids' in bert_input:
          token_type_ids_list.append(bert_input['token_type_ids'])

    # pair image with the token and mask produced from the tokenizer encoding
    if len(token_type_ids_list) > 0:
      if len(label_list) > 0:
        tensor_dataset = tf.data.Dataset.from_tensor_slices((image_list, input_ids_list, attention_masks_list, label_list, token_type_ids_list))
      else:
        tensor_dataset = tf.data.Dataset.from_tensor_slices((image_list, input_ids_list, attention_masks_list, token_type_ids_list))
    else:
      if len(label_list) > 0:
        tensor_dataset = tf.data.Dataset.from_tensor_slices((image_list, input_ids_list, attention_masks_list, label_list))
      else:
        tensor_dataset = tf.data.Dataset.from_tensor_slices((image_list, input_ids_list, attention_masks_list))

    return tensor_dataset.map(map_features_to_dict)


AUTOTUNE = tf.data.AUTOTUNE

ds_train_encoded = encode_sentences(tensor_train,train_ds,tokenizer=tokenizer).cache().shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)

After preprocessing the data, we define each layer of our model. We start with the input layer, then route it to the appropriate language and vision models. Next, we concatenate the output embeddings from these models and pass them to multi-layer perceptrons. Finally, we activate the outcome logits using the softmax function for classification into one of the 11 categories.

# Init ViT vision model
vit_model = TFViTForImageClassification.from_pretrained(model_id)

# ViT input layer
image = layers.Input(shape=IMG_SHAPE, dtype='float32',name='image')

# Init RoBERTa language model
roberta = TFRobertaForSequenceClassification.from_pretrained('roberta-base',num_hidden_layers=8)

# RoBERTa input layer
input_ids = tf.keras.Input(shape=(128, ),dtype='int32',name= 'input_ids')
attention_mask = tf.keras.Input(shape=(128, ), dtype='int32',name= 'attention_mask')

# additional preprocess layer for ViT
model_id = 'google/vit-base-patch16-224'

feature_extractor = ViTFeatureExtractor.from_pretrained(model_id)

def vit_preprocess_input(img):
    
    mean = feature_extractor.image_mean
    std = feature_extractor.image_std
    
    # Scale to the value range of [0, 1] first and then normalize.
    img = img / 255
    mean = tf.constant(mean)
    std = tf.constant(std)
    img = (img - mean) / std
    
    return tf.transpose(img, (0,3,1,2))

# build the model

# roberta
r = roberta([input_ids,attention_mask])[0]

#vit
aug = data_augmentation(image) # this part is optional
v = vit_preprocess_input(aug) 
v = vit_model.vit(v)[0]

# concatenate the embeddings
x = layers.Concatenate()([v[:, 0, :],r]);

# the MLP
x = layers.Dense(256, activation="relu")(x)
x = layers.Dropout(0.5)(x)

# use softmax for sparse categorical task
outputs = layers.Dense(11, activation="softmax")(x)

model = keras.Model(inputs=[image,input_ids,attention_mask], outputs = [outputs])

# use general adaptive optimizer
model.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-4),
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# train
history = model.fit(ds_train_encoded,
                        # validation_data=valid_ds, (this part is optional)
                         epochs=1)

Result

We did some tuning via training the final architecture and already converge on 6 to 10 epoch. Next we did testing on the validation set and receive weighted f1 score of 97%

The model we created was submitted to the competition and achieved a score of 88% on the unseen dataset. While this result falls short of the performance of native pre-trained multimodal BEiT models, the primary focus of this article is not to compete with state-of-the-art approaches. Instead, our aim is to explore the inner workings of multimodal models, gaining insights into how cross-embedding enables multi-modal information processing.

Closing

In this case study we did a deep dive into creating a multimodal neural network. By combining visual and textual information provided from the room dataset, our multimodal approach aims to provide the model with a richer understanding of the context of each image. The process began with preprocessing the data and defining the model’s layers, routing inputs through specialized language and vision models. We then concatenated the extracted features and passed them through multi-layer perceptrons, ultimately using the softmax function for classification. We learn that the concatenation between embeddings is what allows the model to extract multi-modal information.

Reference

The full notebook can be found here https://colab.research.google.com/drive/15Acr9hwdNkZOV65U9tPX2zwDKiinjXF4?usp=sharing

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

Be sure to clap x50 time and follow the writer ️👏️️
Follow us: Newsletter
Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.