Harnessing OpenAI’s GPTs for Multimodal AI in Text and Image Creation

Nir Bar
6 min readDec 7, 2023

--

In the following post, I will share the steps to craft a customized GPT using OpenAI GPTs, which can create images and text.
And by the way, it took me just 2 hours, crazy, right?
One of the most exciting announcements made on the first OpenAI developer day, held on November 6, 2023, was the release of GPTs.
GPTs allow us to create custom versions of ChatGPT that combine instructions, extra knowledge, and any combination of skills.
These GPTs offer a quick and easy way to build a ChatGPT extension through a “no-code” platform, simplifying the development of complex multimodal chatbots.

First, let’s understand what a multimodal chatbot is.
Multimodal chatbots are advanced types of chatbots that can interact with users through multiple modes of communication. Unlike traditional chatbots that primarily rely on text input and output, multimodal chatbots can understand and respond using a variety of inputs and outputs, such as text, voice, images, and videos.

GPTs are a new way for anyone to create a tailored version of ChatGPT to be more helpful in their daily life, at specific tasks, at work, or at home — and then share that creation with others. For example, GPTs can help you learn the rules to any board game, help teach your kids math, or design stickers.

Creating one is as easy as starting a conversation, giving it instructions and extra knowledge, and picking what it can do, like searching the web, making images, or analyzing data.

The following capabilities are supported in GPTs —

  • DALL·E 3 — newest text-to-image model
  • GPT-4 with Vision — GPT-4 can accept images as inputs and generate captions, classifications, and analyses.
  • Web browsing
  • Code interpreter using Python sandbox for data analysis

Moreover, you can add your proprietary APIs that this GPT can use.

This concept is similar to what has stemmed from open-source projects like Agents in LangChain, a popular framework for building LLM applications.
Please find the Agents section in my previous blog post to elaborate more.
The main difference with GPTs is that you don’t need coding skills.

I utilized a Rest API that is publicly available to gather information about different countries such as their flags, currencies, and borders. This data was then used to create personalized images using the customized ChatGPT. Furthermore, I set it up to fetch current information from the internet.

You can find the public API of countries here

Creating a custom ChatGPT has several steps:

1) Configuration using GPT Builder

My mission was to ask questions about a country, including its flag, and to ask ChatGPT to generate an image representing the country’s characteristics, including its flag.
Moreover, I wanted it to be able to search the web, so I asked it to provide current data, such as population statistics.
After a small chat with the GPT Builder bot (under the Create tab), describing my purpose, it created for me the following parameters of my customed ChatGPT —

  • Name — Geo Explorer
  • Description — Assists in leveraging REST APIs for country data and visualizing it.
  • Instructions — Geo Explorer is designed to assist users in creating applications that use public REST APIs for detailed country data, including flags, currency, and borders. It generates images representing a country’s characteristics, incorporating its flag, and provides current data such as population statistics. Geo Explorer guides users in accessing and using these APIs effectively, offering both detailed explanations and concise answers based on user preference. It emphasizes accuracy and up-to-date information, avoiding speculative data. Clarification is provided as needed to understand specific user requirements. Responses are tailored to be both informative and visually engaging, suitable for inquiries about various countries.
  • Chosen Capabilities — Web Browsing and DALL·E Image Generation

2) Create actions

I needed to enter an openAPI schema, but this API doesn’t have it.
So I overcame this issue by asking ChatGPT to provide me with one, giving it as a context the relevant code from here

So now I can provide the following spec file-

openapi: 3.0.0
info:
title: RestCountry API v2
version: 1.0.0
description: API for retrieving information about countries by name
servers:
- url: https://restcountries.com/v2
paths:
/name/{name}:
get:
summary: Get countries by name
operationId: getCountriesByName
parameters:
- name: name
in: path
required: true
schema:
type: string
description: Name of the country
- name: filters
in: query
required: false
schema:
type: array
items:
type: string
description: Fields to filter the output of the request
responses:
200:
description: An array of countries
content:
application/json:
schema:
type: array
items:
$ref: '#/components/schemas/Country'
404:
description: Country not found
components:
schemas:
Country:
type: object
properties:
name:
type: string
topLevelDomain:
type: array
items:
type: string
alpha2Code:
type: string
alpha3Code:
type: string
currencies:
type: array
items:
type: object
properties:
code:
type: string
name:
type: string
symbol:
type: string
capital:
type: string
callingCodes:
type: array
items:
type: string
altSpellings:
type: array
items:
type: string
region:
type: string
subregion:
type: string
population:
type: integer
latlng:
type: array
items:
type: number
demonym:
type: string
area:
type: number
gini:
type: number
timezones:
type: array
items:
type: string
borders:
type: array
items:
type: string
nativeName:
type: string
numericCode:
type: string
languages:
type: array
items:
type: object
properties:
iso639_1:
type: string
iso639_2:
type: string
name:
type: string
nativeName:
type: string
flag:
type: string
regionalBlocs:
type: array
items:
type: object
properties:
acronym:
type: string
name:
type: string
cioc:
type: string

I did not add any authentication methods since it is a POC.
In case you are adding a Rest API for corporate purposes, I suggest adding it. Moreover, Enterprise customers will be able to deploy internal-only GPTs via ChatGPT Enterprise

3) Adding Knowledge

I didn’t need to enrich ChatGPT with proprietary data, so I didn’t utilize the feature of uploading files to the model.

Now we are ready to go since the action is set up :)

Testing Geo Explorer GPT

The first question is, which countries border with Brazil?

We see that a request is sent to the API with the correct format, and we get an answer —

Second question — create a picture of Brazil

We can see that this customed ChatGPT followed our guidelines and generated an image representing a country’s characteristics, incorporating its flag.

Third question — how many tourists visited Germany in the last month?

We see that this time, the tool that the LLM invoked was Bing search, giving us answers on current data.

Lastly, I asked for a picture of Japan —

Summary

The new product of GPTs is convenient and easy to use.
It allows us to enrich the LLM with data from proprietary APIs or feed it with additional files using the knowledge mechanism.
All of this is done without using one line of code, which is fantastic.
In my next blog post, I will cover how to create a GenAI application with functionality similar to the one I crafted here, using Agents and tools using the LangChain framework.

Reference

OpenAI GPTs Documentation

--

--

Nir Bar

Senior Software Engineer at CyberArk, love using technology to solve complex problems. Deeply passionate about the GenAI revolution