Building a Multimodal Model Architecture in Python
Introduction
This article is based on my research experience as a student before ChatGPT became popular. At that time, there was still very little discussion about multimodal training. The focus was heavily on pre-trained transformer models, which were, and still are, considered state-of-the-art foundation models. The topic usually covers what they are, how they work, and less on how to leverage their power. Innovation, on the other hand, is about newly connecting old concepts. One day, while searching for inspiration for a competition idea, I was sidetracked by a computer vision multimodal paper that inspired me to experiment with creating my own multimodal architecture using these pre-trained foundation models.
As we approach the era of Artificial General Intelligence (AGI), multimodality training has been more and more relevant in the field of AI. Drawing inspiration from the intuitive way we teach children by speaking to them while showing them objects, multimodal learning extracts information from different aspect of an object. This approach allows AI systems to not only understand language but also interpret visual and auditory inputs, mimicking the holistic way humans learn and comprehend information.
In this exploration, we’ll build our own multimodal model. We’ll use a case study to show how an AI model could leverage multimodal data to gain more nuanced learning abilities. Our case study focuses on an image classification task where the model is tasked with predicting the room location category based on both a photo of the place and a text description accompanying the photo.
Dataset
The dataset will be used in the case study is taken from a Kaggle competition, SoCS Hackathon — AI, which is an Indoor Scene Recognition dataset by MIT. It consists of 11 class of room types with 2003 images of places accompanied by a label of the place classification as training set and 1167 images that are not present in the training set as test set. Additionally, 320 out of the 1167 test set images are available for the validation set, while the remainder is hidden for determining the final leaderboard.
Model Architecture
In this article we will explore mainly how to create a multimodel classifier. Our model architecture will be consist of two models a language model and a vision model. For the language model we will be using RoBERTa a transformers-based variant of BERT which uses dynamic masking to achieve more robust result and for the vision model we will be using ViT which is also a transformers-based vision model.
Before jump into the model we will need to generate the text for the image. To do this we can utilize generative captioning model such as GPT. In this case study we will utilize GPT2 which require small computational cost.
Our multimodal architecture will then receive paired input of images and text captions. The import and preprocessing of the paired data is shown in the code snippet below
# import the data
data_dir = '/content/train/train'
# import the image
batch_size = 16
img_size = (224, 224)
train_ds = keras.utils.image_dataset_from_directory(
data_dir,
shuffle=False,
seed=123,
image_size=img_size,
batch_size=batch_size)
# import the caption
df = pd.read_csv('caption_true.csv',index_col=0)
tensor_train = tf.data.Dataset.from_tensor_slices((df['text'].values,df['label'].values))
# pair the image with the caption
import tensorflow_datasets as tfds
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer, RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
# apply tokenizer encoding to each sentence
def convert_sentence_to_features(sentence,tokenizer):
return tokenizer.encode_plus(
sentence,
max_length=MAX_LENGTH,
add_special_tokens=True,
pad_to_max_length=True,
return_attention_mask=True,
truncation=True
)
# map the pre-processed data into the input parameter
def map_features_to_dict(image_list, input_ids_list, attention_masks_list, label_list=[],token_type_ids_list=[]):
dict_={}
if len(token_type_ids_list) > 0:
if len(label_list) > 0:
dict_ = {
'image':image_list,
'input_ids': input_ids_list,
'token_type_ids' : token_type_ids_list,
'attention_mask': attention_masks_list,
}, label_list
else:
dict_ = {
'image':image_list,
'input_ids': input_ids_list,
'token_type_ids' : token_type_ids_list,
'attention_mask': attention_masks_list,
}
else:
if len(label_list) > 0:
dict_ = {
'image':image_list,
'input_ids': input_ids_list,
'attention_mask': attention_masks_list,
}, label_list
else:
dict_ = {
'image':image_list,
'input_ids': input_ids_list,
'attention_mask': attention_masks_list,
}
return dict_
# pre-process the text into tokens and mask, and paired with image, then map to the input layer parameter
def encode_sentences(dataset, imageset,tokenizer, evaluation=False):
input_ids_list = []
token_type_ids_list = []
attention_masks_list = []
label_list = []
image_list = []
tensor_dataset = []
# evaluation don't need label
if evaluation:
for images in imageset:
for image in images:
image_list.append(image)
# apply tokenizer then append to list
for message in tfds.as_numpy(dataset):
bert_input = convert_sentence_to_features(message.decode(),tokenizer)
input_ids_list.append(bert_input['input_ids'])
attention_masks_list.append(bert_input['attention_mask'])
if 'token_type_ids' in bert_input:
token_type_ids_list.append(bert_input['token_type_ids'])
else:
for images,label in imageset:
for image in images:
image_list.append(image)
for message, label in tfds.as_numpy(dataset):
bert_input = convert_sentence_to_features(message.decode(),tokenizer)
input_ids_list.append(bert_input['input_ids'])
attention_masks_list.append(bert_input['attention_mask'])
label_list.append([label])
if 'token_type_ids' in bert_input:
token_type_ids_list.append(bert_input['token_type_ids'])
# pair image with the token and mask produced from the tokenizer encoding
if len(token_type_ids_list) > 0:
if len(label_list) > 0:
tensor_dataset = tf.data.Dataset.from_tensor_slices((image_list, input_ids_list, attention_masks_list, label_list, token_type_ids_list))
else:
tensor_dataset = tf.data.Dataset.from_tensor_slices((image_list, input_ids_list, attention_masks_list, token_type_ids_list))
else:
if len(label_list) > 0:
tensor_dataset = tf.data.Dataset.from_tensor_slices((image_list, input_ids_list, attention_masks_list, label_list))
else:
tensor_dataset = tf.data.Dataset.from_tensor_slices((image_list, input_ids_list, attention_masks_list))
return tensor_dataset.map(map_features_to_dict)
AUTOTUNE = tf.data.AUTOTUNE
ds_train_encoded = encode_sentences(tensor_train,train_ds,tokenizer=tokenizer).cache().shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
After preprocessing the data, we define each layer of our model. We start with the input layer, then route it to the appropriate language and vision models. Next, we concatenate the output embeddings from these models and pass them to multi-layer perceptrons. Finally, we activate the outcome logits using the softmax function for classification into one of the 11 categories.
# Init ViT vision model
vit_model = TFViTForImageClassification.from_pretrained(model_id)
# ViT input layer
image = layers.Input(shape=IMG_SHAPE, dtype='float32',name='image')
# Init RoBERTa language model
roberta = TFRobertaForSequenceClassification.from_pretrained('roberta-base',num_hidden_layers=8)
# RoBERTa input layer
input_ids = tf.keras.Input(shape=(128, ),dtype='int32',name= 'input_ids')
attention_mask = tf.keras.Input(shape=(128, ), dtype='int32',name= 'attention_mask')
# additional preprocess layer for ViT
model_id = 'google/vit-base-patch16-224'
feature_extractor = ViTFeatureExtractor.from_pretrained(model_id)
def vit_preprocess_input(img):
mean = feature_extractor.image_mean
std = feature_extractor.image_std
# Scale to the value range of [0, 1] first and then normalize.
img = img / 255
mean = tf.constant(mean)
std = tf.constant(std)
img = (img - mean) / std
return tf.transpose(img, (0,3,1,2))
# build the model
# roberta
r = roberta([input_ids,attention_mask])[0]
#vit
aug = data_augmentation(image) # this part is optional
v = vit_preprocess_input(aug)
v = vit_model.vit(v)[0]
# concatenate the embeddings
x = layers.Concatenate()([v[:, 0, :],r]);
# the MLP
x = layers.Dense(256, activation="relu")(x)
x = layers.Dropout(0.5)(x)
# use softmax for sparse categorical task
outputs = layers.Dense(11, activation="softmax")(x)
model = keras.Model(inputs=[image,input_ids,attention_mask], outputs = [outputs])
# use general adaptive optimizer
model.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-4),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
# train
history = model.fit(ds_train_encoded,
# validation_data=valid_ds, (this part is optional)
epochs=1)
Result
We did some tuning via training the final architecture and already converge on 6 to 10 epoch. Next we did testing on the validation set and receive weighted f1 score of 97%
The model we created was submitted to the competition to run on the rest of unseen dataset and achieve a score of 88%. Compare to the first place winner of the competition using a native pre-trained multimodal BEiT, our handmade BERT + ViT model didn’t stand a change. When it comes to competition or faced with daily task it is better to just use a pre-trained optimized modules.
However, the primary focus of this article is not to compete with existing state-of-the-art models but to delve into the inner workings of multimodal models, gaining insights to inspire future advancements in the field.
Closing
In this case study we did a deep dive into creating a multimodal neural network. By combining visual and textual information provided from the room dataset, our multimodal approach aims to provide the model with a richer understanding of the context of each image. The process began with preprocessing the data and defining the model’s layers, routing inputs through specialized language and vision models. We then concatenated the extracted features and passed them through multi-layer perceptrons, ultimately using the softmax function for classification.
Reference
The full notebook can be found here https://colab.research.google.com/drive/15Acr9hwdNkZOV65U9tPX2zwDKiinjXF4?usp=sharing
Python’s Gurus🚀
Thank you for being a part of the Python’s Gurus community!
Before you go:
- Be sure to clap x50 time and follow the writer ️👏️️
- Follow us: Newsletter
- Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.