What’s new in GPT-4: An Overview of the GPT-4 Architecture and Capabilities of Next-Generation AI

Amol Wagh
9 min readMar 16, 2023

--

What is GPT-4, and what are its potential capabilities?

GPT-4 is a new language model created by OpenAI that is a large multimodal that can accept image and text inputs and emit outputs. It exhibits human-level performance on various professional and academic benchmarks.

GPT-4 uses a transformer-style architecture in its neural network. A transformer architecture allows for a better understanding of relationships between words in text. It also uses an attention mechanism that allows the neural network to parse out which pieces of data are more relevant than others. Please refer my other blog to get details on transformer architecture — Architecture of OpenAI ChatGPT

What is Multimodal?

Multimodal technology refers to systems that can process and integrate multiple types of inputs and outputs, such as text, speech, image, video, gesture, etc. Multimodal systems can enable more natural and efficient human-computer interactions.

One example of a multimodal architecture is the Multimodal Architecture and Interfaces recommendation by the World Wide Web Consortium (W3C). It introduces a generic structure and a communication protocol to allow the modules in a multimodal system to communicate with each other. It also proposes an event-driven architecture as a general frame of reference focused on the control flow data exchange.

Another example of a multimodal architecture is the one used by GPT-4. It consists of three main components: an encoder that transforms image and text inputs into vector representations; a decoder that generates text outputs from vector representations; and an attention mechanism that allows the encoder and decoder to focus on relevant parts of the inputs and outputs.

How Multimodal Works?

The pretrained datasets for GPT-4 is somewhat similar to KOSMOS-1 which trained on text and images. It trained on the below datasets -

Figure 1: Sources and contents of the KOSMOS-1 DS. [Ref. information from paper — 2302.14045.pdf (arxiv.org)]

Visual examples from the Kosmos-1 paper show the model analyzing images and answering questions about them, reading text from an image, writing captions for images, and taking a visual IQ test with 22–26 percent accuracy. See the examples in Figure 1.1 and Figure 1.2.

Figure 1.1: Selected examples generated from KOSMOS-1. Blue boxes are input prompt and pink boxes are KOSMOS-1 output. The examples include (1)-(2) visual explanation, (3)-(4) visual question answering, (5) web page question answering, (6) simple math equation, and (7)-(8) number recognition. [Source: Kosmos-1 paper]
Figure 1.2: Selected examples generated from KOSMOS-1. Blue boxes are input prompt and pink boxes are KOSMOS-1 output. The examples include (1)-(2) image captioning, (3)-(6) visual question answering, (7)-(8) OCR, and (9)-(11) visual dialogue [Source: Kosmos-1 paper]

For training, each modality must be converted to a representation in the same embedding space. That means there is need of a sequence of same-length vectors that are generated from text and images.
For text, this is straightforward since the tokens are already discretized. In KOSMOS-1, each token is assigned an embedding learned during training, the consequence being that words of similar semantic meaning become closer in the embedding space. KOSMOS-1 deals with images using the MetaLM approach. This provides a general-purpose interface supporting natural language interactions with other non-causal models. A pre-trained image encoder generates embeddings that are passed through a connector layer, which projects to the same dimension as the text embeddings. KOSMOS-1 can then handle image embeddings while predicting text tokens (Shown in the below figure 2).

As described in the below figure 2 - In order to achieve this, the model must learn the relationship between text and images. Each image consists of multiple embeddings (positional locations 7–12 in Figure 2) which are passed through the transformer. During training, only the embedding predicted after seeing all the image embeddings (e.g. x9 in Figure 2) is used to calculate the loss. When predicting this token, the transformer can still attend to all the image embeddings, thus allowing the model to learn a relationship between text and images

Figure 2: Ref. from research article 2206.06336.pdf (arxiv.org)

It is most likely that GPT-4 uses combination of Vision Transformer (ViT) and Flamingo visual language model for image processing. Let’s understand concepts for this architecture.

The architecture used for the image encoder is a pre-trained Vision Transformer (ViT). This is common for image processing tasks. The ViT applies a series of convolutional layers to an image to generate a set of ‘patches’, as shown in Figure 3. These image patches are flattened and transformed into a sequence of tokens, which are processed by the transformer to produce an output embedding. The ViT is encoder-only.

The approach used to train the ViT is the Contrastive Language-Image Pre-Training (CLIP) task[9]. Roughly, images and text share an embedding space, and the model is trained such that matching image-text pairs have a high cosine similarity.

During KOSMOS-1 training, the ViT parameters are frozen, except for the last layer. The exact model is CLIP ViT-L/14. GPT-4 may use this approach as well. Alternatively, it’s not unreasonable that with enough data, the image encoder can be trained from scratch.

Figure 3: Vision Transformer (ViT) architecture. Image encoder for KOSMOS-1 (Image is split into patches and processed by the transformer) [Ref from published paper]

Flamingo uses a different approach to multimodal language modelling. This could be a likely architecture for GPT-4. Flamingo also relies on a pre-trained image encoder, but instead uses the generated embeddings in cross-attention layers that are interleaved in a pre-trained LM

Figure 4: Overview of the Flamingo model. The Flamingo models are a family of visual language model (VLM) that can take as input visual data interleaved with text and can produce free-form text as output. Key to its performance are novel architectural components and pretraining strategies. [Source: DeepMind]

Performance & Competencies Of GPT-4

OpenAI team tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans. They did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam they run a variant with these questions removed and report the lower score of the two. The Exam results to be representative. Exams were sourced from publicly available materials.

Exam questions included both multiple choice and free-response questions; Open team designed separate prompts for each format, and images were included in the input for questions which required it. The evaluation setup was designed based on performance on a validation set of exams, and they report final results on held-out test exams. Overall scores were determined by combining multiple-choice and free-response question scores using publicly available methodologies for each exam.

The below is statistics from the outcome of exam results by GPT-4 and comparing with earlier variant GPT-3.5

Table 1: GPT 3.5 & GPT 4 performance on academic & professional exams (Source: OpanAI GPT-4 Tech Report)
Figure 5: Graph of GPT 3.5 & GPT 4 performance on academic and professional exams. (Source: OpanAI GPT-4 Tech Report)

GPT-4 Visual Inputs Examples

Figure 6: Example prompt demonstrating GPT-4’s visual input capability. The prompt consists of a question about an image & understanding which GPT-4 is able to answer [Source: OpenAI Tech Report]
Figure 7: Example prompt demonstrating GPT-4’s visual input capability. The prompt requires image understanding [Source: OpenAI Tech Report]
Figure 8: Example prompt demonstrating GPT-4’s visual input capability. The prompt consists of a question about an image with multiple panels which GPT-4 is able to answer [Source: OpenAI Tech Report]

What could be Practical Use Cases?

There are many use cases using GPT-4 like creating a functional website from hand-drawn sketch or transform sketch into an architecture diagram or model. Medical images/scans to provide detail about health or disease information. Identify and classify NSFW objects in image and remove it. Dating app to help with matchmaking, using profile data and preferences to determine if a match is worth pursuing and even automating the follow-up process.

Recently OpenAI partnered with a some of companies like Be My Eyes, Stripe, Khan Academy, and Morgan Stanley, among others to test GPT-4’s capabilities. Let’s understand some of these use cases and see what GPT-4 can truly deliver.

Be My Eyes:
Danish business Be My Eyes uses a GPT-4-powered ‘Virtual Volunteer’ within their software to help the visually impaired and low-vision with their everyday activities. It allows them to read website content, negotiate challenging real-world circumstances, and make well-informed judgments at the moment, much like a human volunteer would.

Source: Be My Eyes

Stripe
Stripe leverages GPT-4 to streamline user experience and combat fraud. Like the rest of the financial industry, Stripe’s support team had been employing GPT-3 to improve the quality of their customer service. It is now utilizing GPT-4 features to scan websites to learn how businesses use the platform so that it can tailor its support to their needs. It can operate as a virtual assistant to developers, comprehending their inquiries, scanning technical material, summarizing solutions, and providing summaries of websites. Using GPT-4, Stripe can monitor community forums like Discord for signs of criminal activity and remove them quickly.

Source: OpenAI

Khan Academy
Khan Academy, a company that provides educational resources online, has begun utilizing GPT-4 features to power an artificially intelligent assistant called Khanmigo. In 2022, they started testing GPT-4 features; in 2023, the Khanmigo pilot program will be available to a select few. Initial assessments suggest that GPT-4 could help students learn specific topics of computer programming while also gaining a broader appreciation for the relevance of their study. In addition, Khan Academy is trying out different ways that teachers might use new GPT-4 features in the curriculum development process.

Source: OpenAI

Morgan Stanley
Morgan Stanley wealth management deploys GPT-4 to organize its vast knowledge base. A leader in wealth management, Morgan Stanley maintains a content library with hundreds of thousands of pages of knowledge and insights spanning investment strategies, market research and commentary, and analyst insights. This vast amount of information is housed across many internal sites, largely in PDF form, requiring advisors to scan through a great deal of information to find answers to specific questions. Such searches can be time-consuming and cumbersome. With the help of OpenAI’s GPT-4, Morgan Stanley is changing how its wealth management personnel locate relevant information. Starting last year, the company began exploring how to harness its intellectual capital with GPT’s embeddings and retrieval capabilities — first GPT-3 and now GPT-4. The model will power an internal-facing chatbot that performs a comprehensive search of wealth management content and effectively unlocks the cumulative knowledge of Morgan Stanley Wealth Management.

Source: OpenAI

How to access and use GPT-4?

Check out my other article for the detail steps accessing GPT-4: How to access and use GPT-4: Comprehensive Guide

What’s the Future Scope?

Multimodal modelling can be extended further to generate images, audio and video. This requires that each signal is discretized into tokens which can be converted back to a coherent signal. Importantly, need to maintain quality of the reconstructed signal. This could allow future GPT models to be able to generate images (like DALL-E 2, Stable Diffusion, etc.); video (like Synthesia, D-id, InVideo, etc. ) or music (like Jukebox, Elevate AI, etc.) from text or speech prompts. Questions from the user can even be answered with memes or GIFs. Moreover, you’d be able to have a conversation and ask it to respond in popular celebrity or anyone else’s voice.

Conclusion

Stay tuned and follow me to learn how the accuracy of GPT-4 and ChatGPT with different business use cases. As we were eagerly await the release of GPT-4, it’s clear that the AI world is on the brink of exciting new developments. With the potential, GPT-4 promises to revolutionize the way we communicate and express ourselves. With the continued advancements in AI, we can expect even more powerful tools and technologies to emerge in the near future, changing the way we live and work in ways we can only imagine. The possibilities are truly endless, and the future of AI is looking brighter than ever before.

Many thanks for taking time to read! I appreciate every clap or comment and be part of my group to get further updates.

*Note: Some of the contents from this article generated by Chat GPT 3.5 & Bing search engine.

--

--

Amol Wagh

Solution Architect | I write about Tech, Dev, Projects Management & Life! | Let's Inspire Everyone on the Planet!