Analysis of KOSMOS-1

Hemant Vikani
Version 1
Published in
6 min readMar 13, 2023

1. INTRODUCTION

The following document is intended to provide a high-level review of the Kosmos-1 model. Kosmos-1 is a multimodal large language model (MLLM) developed by Microsoft that can perceive general modalities, learn in context i.e., few-shot (What is few-shot?) and follow instructions (i.e., zero-shot (What is zero-shot?). Kosmos-1 is trained from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. They evaluate various settings, including zero-shot, few-shot, and multimodal chain of thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that KOSMOS-1 achieves impressive performance on:
1. Language understanding, generation, and even OCR-free NLP (directly fed with document images).
2. Perception-language tasks, including multimodal dialogue, image captioning and visual question answering.
3. Vision tasks, such as image recognition with descriptions (specifying classification via text instructions).

They also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. KOSMOS-1 is trained with 1.6B parameters which are comparably smaller than other language models like GPT-3 with 175B parameters. It is worth noting that KOSMOS-1 possesses innate capabilities to read and understand the text present in the rendered images without using any external OCR tools.

1. KOSMOS-1

KOSMOS-1 is still in its research phase and Microsoft does not provide any API for this. It's able to perform various tasks like language understanding, generation, OCR-free NLP, image captioning, visual question answering, and image recognition with description.

Model architecture consists of:

1. MAGNETO: A Transformer variant (paper), as the backbone architecture. MAGNETO has better training stability and superior performance across modalities.

2. xPOS: They used XPOS (paper) relative position encoding for better long-context modelling. The method can better generalize to different lengths, i.e., training on short while testing on longer sequences.

Some sample examples for different tasks mentioned in this paper are:

1. Visual Explanation: In this task, we are asking for an explanation of the image. In Figure 1, the input prompt (an image with a question) is provided to the model and the output is a visual explanation.

Figure 1 (Source)

2. Visual Question Answering: In this task, we are asking questions related to image. In Figure 2, an input prompt (an image with a question) is provided to the model and the output is an answer.

Figure 2 (Source)

3. Web Page Question Answering: In this task, we have asked a question related to a web page image. In Figure 3, an input prompt (image with a question) is provided to the model and the output is an answer.

Figure 3 (Source)

4. Number recognition: In this task, we are asking a question to extract the numbers in the image. In Figure 4, an input prompt (an image with a question) is provided to the model and the output is an answer.

Figure 4 (Source)

5. Image Captioning: In Figure 5, an input prompt (an image with text) is provided to the model and the output is an image caption.

Figure 5 (Source)

6. OCR: In Figure 6, the input prompt (an image with text) is provided to the model and the output is the text written in the image.

Figure 6 (Source)

7. Visual dialogue: In Figure 7, we can see the conversation based on the input image.

Figure 7 (Source)

8. IQ Test: In Figure 8, providing input prompt (consisting of flattened image matrix and verbal instruction) to model. Each candidate image is appended to the prompt separately and then queried the model if it is correct. The final prediction is based on probability (a candidate with high probability is labelled as yes).

Figure 8 (Source)

2.1. Model Details

The models are trained on web-scale multimodal corpora. The training datasets consist of text corpora, image-caption pairs, and interleaved data of images and texts. KOSMOS-1 has about 1.6B parameters which are comparably small to other large language models like GPT3 with 175B parameters.

Limitations: One of the main limitations of the model is the number of tokens that they can process at once. When training or using a transformer-based language model, the input text is first tokenized and then converted into numerical representations (called “embeddings”) that can be processed by the model.

So, the KOSMOS-1 model takes input of 2048 tokens (sentence/text divided into smaller parts) max, which means a user can input a text or sentence with 2048 tokens. This will limit the use where we need to input a more significant number of words/tokens.

2.2. Evaluation Table

Evaluation of all tasks performed by KOSMOS-1 based on the results mentioned in the paper.

Refer to this paper for the task description.

2.3. Other Commercial Ideas

These are possible commercially viable ideas listed below. We can say more about it once the API is released. Additional research and development time would be required to realize the feasibility and design details.

- Financial Document classification

KOSMOS-1 (OCR free language understanding) could help in classifying financial documents in pdf or image formats directly without using any OCR system for extracting texts.

- Entity extraction

By using KOSMOS-1, we can extract important entities by querying for a particular entity in the image. Example: Can be used for extracting relevant or required data from invoices, and financial documents directly.

- IT Support

Kosmos-1 can help in automated IT support by answering the users' real-time queries. We have already seen an example above where a windows 11 dialogue image was provided, with the question, “How to restart the computer?”

- Image captioning

Kosmos-1 can be used for generating captions or text descriptions for images.

- Sentiment classification

We can classify the sentiment of a text in an image by using KOSMOS-1 (OCR free language understanding).

3. Conclusion

KOSMOS-1 has demonstrated its ability to perform various tasks such as image captioning, visual question answering, web page question answering, OCR-free language understanding, image classification with and without descriptions, a chain of thought prompting, and cross-modal transfer. While some tasks such as OCR-free language understanding, image captioning, a chain of thought prompting, visual common-sense reasoning, and language tasks have yielded promising results, some tasks like IQ test, and web page question answering results are not satisfactory and require significant improvements. As the KOSMOS-1 model is compared for specific tasks against LLM and other models which have not been revealed, we are not sure what the evaluation is based on. Further analysis and development are necessary to enhance the performance of the system for these specific tasks. The model has a limitation on the number of tokens (sentence/text divided into smaller parts) as input to the model.

It is important to note that while the results in the research paper may appear promising, a thorough evaluation of the demo/API (not available yet) is essential for a comprehensive understanding of KOSMOS-1’s capabilities.

About the author:
Hemant Vikani is a Data Scientist here at Version 1.

--

--