Harnessing OpenAI’s GPTs for Multimodal AI in Text and Image Creation

6 min readDec 7, 2023

In the following post, I will share the steps to craft a customized GPT using OpenAI GPTs, which can create images and text.
And by the way, it took me just 2 hours, crazy, right?
One of the most exciting announcements made on the first OpenAI developer day, held on November 6, 2023, was the release of GPTs.
GPTs allow us to create custom versions of ChatGPT that combine instructions, extra knowledge, and any combination of skills.
These GPTs offer a quick and easy way to build a ChatGPT extension through a “no-code” platform, simplifying the development of complex multimodal chatbots.

First, let’s understand what a multimodal chatbot is.
Multimodal chatbots are advanced types of chatbots that can interact with users through multiple modes of communication. Unlike traditional chatbots that primarily rely on text input and output, multimodal chatbots can understand and respond using a variety of inputs and outputs, such as text, voice, images, and videos.

GPTs are a new way for anyone to create a tailored version of ChatGPT to be more helpful in their daily life, at specific tasks, at work, or at home — and then share that creation with others. For example, GPTs can help you learn the rules to any board game, help teach your kids math, or design stickers.

Creating one is as easy as starting a conversation, giving it instructions and extra knowledge, and picking what it can do, like searching the web, making images, or analyzing data.

The following capabilities are supported in GPTs —

DALL·E 3 — newest text-to-image model
GPT-4 with Vision — GPT-4 can accept images as inputs and generate captions, classifications, and analyses.
Web browsing
Code interpreter using Python sandbox for data analysis

Moreover, you can add your proprietary APIs that this GPT can use.

This concept is similar to what has stemmed from open-source projects like Agents in LangChain, a popular framework for building LLM applications.
Please find the Agents section in my previous blog post to elaborate more.
The main difference with GPTs is that you don’t need coding skills.

I utilized a Rest API that is publicly available to gather information about different countries such as their flags, currencies, and borders. This data was then used to create personalized images using the customized ChatGPT. Furthermore, I set it up to fetch current information from the internet.

You can find the public API of countries here

Creating a custom ChatGPT has several steps:

1) Configuration using GPT Builder

My mission was to ask questions about a country, including its flag, and to ask ChatGPT to generate an image representing the country’s characteristics, including its flag.
Moreover, I wanted it to be able to search the web, so I asked it to provide current data, such as population statistics.
After a small chat with the GPT Builder bot (under the Create tab), describing my purpose, it created for me the following parameters of my customed ChatGPT —

Name — Geo Explorer
Description — Assists in leveraging REST APIs for country data and visualizing it.
Instructions — Geo Explorer is designed to assist users in creating applications that use public REST APIs for detailed country data, including flags, currency, and borders. It generates images representing a country’s characteristics, incorporating its flag, and provides current data such as population statistics. Geo Explorer guides users in accessing and using these APIs effectively, offering both detailed explanations and concise answers based on user preference. It emphasizes accuracy and up-to-date information, avoiding speculative data. Clarification is provided as needed to understand specific user requirements. Responses are tailored to be both informative and visually engaging, suitable for inquiries about various countries.
Chosen Capabilities — Web Browsing and DALL·E Image Generation

2) Create actions

I needed to enter an openAPI schema, but this API doesn’t have it.
So I overcame this issue by asking ChatGPT to provide me with one, giving it as a context the relevant code from here

So now I can provide the following spec file-

openapi: 3.0.0
info:
  title: RestCountry API v2
  version: 1.0.0
  description: API for retrieving information about countries by name
servers:
  - url: https://restcountries.com/v2
paths:
  /name/{name}:
    get:
      summary: Get countries by name
      operationId: getCountriesByName
      parameters:
        - name: name
          in: path
          required: true
          schema:
            type: string
          description: Name of the country
        - name: filters
          in: query
          required: false
          schema:
            type: array
            items:
              type: string
          description: Fields to filter the output of the request
      responses:
        200:
          description: An array of countries
          content:
            application/json:
              schema:
                type: array
                items:
                  $ref: '#/components/schemas/Country'
        404:
          description: Country not found
components:
  schemas:
    Country:
      type: object
      properties:
        name:
          type: string
        topLevelDomain:
          type: array
          items:
            type: string
        alpha2Code:
          type: string
        alpha3Code:
          type: string
        currencies:
          type: array
          items:
            type: object
            properties:
              code:
                type: string
              name:
                type: string
              symbol:
                type: string
        capital:
          type: string
        callingCodes:
          type: array
          items:
            type: string
        altSpellings:
          type: array
          items:
            type: string
        region:
          type: string
        subregion:
          type: string
        population:
          type: integer
        latlng:
          type: array
          items:
            type: number
        demonym:
          type: string
        area:
          type: number
        gini:
          type: number
        timezones:
          type: array
          items:
            type: string
        borders:
          type: array
          items:
            type: string
        nativeName:
          type: string
        numericCode:
          type: string
        languages:
          type: array
          items:
            type: object
            properties:
              iso639_1:
                type: string
              iso639_2:
                type: string
              name:
                type: string
              nativeName:
                type: string
        flag:
          type: string
        regionalBlocs:
          type: array
          items:
            type: object
            properties:
              acronym:
                type: string
              name:
                type: string
        cioc:
          type: string

I did not add any authentication methods since it is a POC.
In case you are adding a Rest API for corporate purposes, I suggest adding it. Moreover, Enterprise customers will be able to deploy internal-only GPTs via ChatGPT Enterprise

3) Adding Knowledge

I didn’t need to enrich ChatGPT with proprietary data, so I didn’t utilize the feature of uploading files to the model.

Now we are ready to go since the action is set up :)

Testing Geo Explorer GPT

The first question is, which countries border with Brazil?

We see that a request is sent to the API with the correct format, and we get an answer —

Second question — create a picture of Brazil

We can see that this customed ChatGPT followed our guidelines and generated an image representing a country’s characteristics, incorporating its flag.

Third question — how many tourists visited Germany in the last month?

We see that this time, the tool that the LLM invoked was Bing search, giving us answers on current data.

Lastly, I asked for a picture of Japan —

Summary

The new product of GPTs is convenient and easy to use.
It allows us to enrich the LLM with data from proprietary APIs or feed it with additional files using the knowledge mechanism.
All of this is done without using one line of code, which is fantastic.
In my next blog post, I will cover how to create a GenAI application with functionality similar to the one I crafted here, using Agents and tools using the LangChain framework.

Reference

OpenAI GPTs Documentation