Training Multimodal GPT using PyTorch Lightning — Part 1

Sanjana Tule
MatrixnTensors
5 min readFeb 5, 2024

--

Introduction

Multimodal models are designed to process and generate information from multiple modalities, such as text, images, and possibly other forms of data. GPT (Generative Pre-trained Transformer) is a type of transformer-based language model. These models are trained on large datasets to understand and generate human-like text. Combining them together, an ideal multimodel gpt has the ability to answer questions related to images. A natural application is a multimodal chatbot. The question themselves can be posed in various formats, such as audio or text. In this two series blog article, we explore technical and training details of the multimodal GPT model. This article explores multi-modal training and fine-tuning for visual instruction following. This project is based on the work done in LLAVA 1.5 with some key differences.

Model Architecture

The architecture of multimodel gpt model includes two pretrained models — CLIP and PHI-2 with a trainable projection layer between them. CLIP (Contrastive Language-Image Pre-Training) is a multi-modal vision and language model, trained with contrastive learning framework, where it learns to bring similar image and text pairs closer together in the representation space while pushing dissimilar pairs apart.

Phi2 is a small language model with 2.7 billion-parameter language model but showcases state-of-the-art performance among base language models with less than 13 billion parameters.

Projection layer is simple set of linear layers with residual connections (shown in Figure 2).

The CLIP model converts the image into meaningful representations that integrates both visual and textual information. The CLIP image representations are sent to the projection layer and converted into embeddings compatible with Phi2 language model. The projection layer is responsible for aligning the embeddings between the CLIP and Phi2 models. The text or question on the other hand is sent through the Phi2 model tokenizer and embedding layer to produce corresponding text embeddings for the input question. Both the image and the text embeddings are concatenated and forwarded through Phi2 layers.

The Figure 1 below shows the high level architecture of the multimodel gpt model. The architecture is based on Llava1.5 , where Llava stands for Large Language and Vision Assistant.

Figure 1: Multimodal GPT model architecture. CLIP and Phi2 are pretrained model and are frozen. The image and question is the input to the model. The model is trained to generate answers to the question.
Figure 2: Projection layer architecture.

Training strategy

The model is trained in two stages — stage 1 (pretrain) and stage 2 (fine tuning). For both stages of the training, CLIP and Phi2 models are freezed whereas the projection layer is trainable. Phi2 will be fine tuned using QLoRA for step two. QLoRA provides an memory efficient way to finetune large language models without compromising too much on the performance.

We train the model on a single 48GB GPU using pytorch lightning following the same strategy as described in Llava1.5.

In the pretrain stage 1, we train the model to predict captions for an input image.

In stage 2 fine tuning step, we continue training the projection layer and fine tuning Phi2 using QLoRA for instruction following. The model is trained to predict the answer to the given question and an image.

The key differences in training with respect to Llava1.5

Stage 1:

  1. 1x A6000 vs 8x A100 GPU.
  2. CLIP-ViT-B-224px vs CLIP-ViT-L-336px.
  3. COCO 2017 as opposed to LCS-558K.
  4. Phi2 (2.7B) instead of Vicuna (13B parameter model).

Stage 2 :

  1. 8x A6000 vs 8x A100 GPU.
  2. CLIP-ViT-B-224px vs CLIP-ViT-L-336px.

3.Instruct150K subset as opposed to whole dataset.

4. Phi2 (2.7B) instead of Vicuna (13B parameter model)

Data Preparation

For step 1, we use the COCO 2017 dataset for training the model to predict captions for a given input image. The dataset consists of 118,287 images and 591,753 captions. Figure 3 shows an example image and the caption.

Figure 3: Example of an image and captions from coco2017 dataset.

For step 1 training, we structure the data to enable the training of the model using the GPT autoregressive training approach. Here, the image embeddings remain constant in the input, while the captions undergo sequential shifts, one word at a time, during the training phase. This means that each word in the caption is influenced by the image and the preceding words in the caption. We also add separator tokens between the image and the captions, so our model can learn where the image ends. The separator used in this case stands for ‘[Visualize this]’. An illustrative example is provided below for clarity in Figure 4 below.

Figure 4: Training data preparation for step 1.

For step 2, we use the Instruct150K dataset. The Instruct150K dataset was created through the use of GPT-4 prompts applied to COCO images. The dataset have 199,770 question and answer pair associated with the 81,479 images. Figure 5 shows an example image, question and answer.

Figure 5: Examples from Instruct150K dataset.

For step 2, the training process is akin to step 1, but now both image embeddings and questions are maintained in the input. However, the shift is applied to the answers. This implies that every word in the answer is conditioned on the image, the question, and the preceding words of the answers. A separator token is added between question and the answer. An illustrative example is provided below for clarity in Figure 6 below.

Figure 6: Training data preparation for step 2.

The data is now ready and we move on to training the model. In the upcoming second part of this series, we’ll delve into the detailed code, training of the model and explore the results. Part 2 of this series can be accessed here.

References

Learning Transferable Visual Models From Natural Language Supervision Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 2021

Textbooks are all you need Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Cesar, Caio and Mendes, Teodoro and Giorno, Allie Del and Gopi, Sivakanth and Javaheripi, Mojan and Kauffmann, Piero and de Rosa, Gustavo and Saarikivi, Olli and Salim, Adil and Shah, Shital and Singh Behl, Harkirat and Wang, Xin and Bubeck, Sébastien and Eldan, Ronen and Kalai, Adam Tauman and Lee, Yin Tat and Li, Yuanzhi 2023

Improved Baselines with Visual Instruction Tuning Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae 2023

QLoRA: Efficient Finetuning of Quantized LLMs Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke 2023

The School of A.I Rohan Sharavan 2023

--

--