I enabled ChatGPT to “see” images and made it play Dixit with my friends

To celebrate the week of Bing’s integration with ChatGPT, I built an AI bot based on GPT-3 and BLIP-2 to play Dixit and gathered some friends and co-workers to play against it.

Kha Vu Chan
10 min readFeb 13, 2023

Have you ever played Dixit? It’s a popular party game that features a deck of cards with surreal and imaginative illustrations. During each turn, a player takes on the role of “storyteller” and provides a brief, imaginative clue to describe the card they have selected. The other players then select a card from their hand that they believe most closely matches the description, with the aim of deceiving others into thinking it was the storyteller’s card. Finally, players must guess which card the storyteller chose.

This game requires players to use their creativity and imagination. Sounds like something that AI excels at, right? Let’s take a look at one of its plays:

A picture of four Dixit cards on the table.
A real round of Dixit game with 3 human players against GPT-3. Wow! The AI completely dominated this round by achieving the highest score possible: it correctly guessed the Storyteller’s card, and all other players preferred the card that GPT-3 chose over the Storyteller’s card!

During this round, a human storyteller selected a card depicting a large monster dropping a child into a maze and provided the hint “you can do it.” Upon the reveal of the cards… Wow! What just happened?! The bot accurately guessed the storyteller’s card, while the other players believed the storyteller had chosen a card that the AI chose — depicting an old man fishing with no success, despite an abundance of fish in the pond. This resulted in a perfect score for the round, showcasing the AI’s complete domination over the humans (for a brief moment in time)!

For those curious, I served as the AI’s human assistant, responsible for physically carrying out actions such as picking up, laying out, and capturing photos of the cards. The AI’s creator became its first subordinate! Oh, the irony…

Okay, it’s quite good at choosing a matching card for the given clue, but can it generate creative and imaginative captions for the cards? Hell yeah it can! Here are a few examples:

Here are a few random examples of captions generated by GPT-3 for Dixit cards during a real game. My personal favorite is the first one, “Being Heard Matters.” The final one sounds quite intriguing — a vampire embraces its transformation into… a garlic.

Just a year ago, this level of creativity and imaginative thinking was unheard of. It operates just like a genuine human! Even more impressive, you can prompt it with various personalities — a movie buff, an anime nerd, a programmer, a musician, and have it generate themed clues! The potential for enhancements and modifications is limitless.

If you want to see more examples from a real game with real people (and with real pizza 🍕🍕🍕), scroll down to the end of this blog post 😉.

In this article, you may notice that I refer to “ChatGPT” (which sometimes is also called “GPT-3.5”) and “GPT-3” interchangeably. It’s just a click bait 😛 “ChatGPT” has become a popular catchphrase recently and the use of this term can attract more attention and increase the ranking of the article (oh yeah, I’ll sell my soul for likes and upvotes). All examples here were generated using the public model text-davinci-003.

Inner workings. Spelled out.

A reader in February 2023 may feel puzzled: large language models like GPT-3 cannot see images (yet), how could it possibly play Dixit? Easy! Just give it access to image captioning and VQA models! To make the whole pipeline more robust, we can even use multiple models at the same time.

Here are the models that I used to build this Dixit AI:

  • text-davinci-003 — The most capable OpenAI GPT-3 model available to the public to date, serving as the primary “brain” behind our Dixit AI. The outputs of all other models will be fed into it.
  • BLIP-2 — A recently published state-of-the-art image captioning and visual question-answering model.
  • GIT large — Microsoft’s previous state-of-the-art image captioning model, published just a month prior.
  • BLIP-1 large — Despite being published a year ago, it still produces reliable results for images with easily recognizable objects.
  • CLIP — While it’s a relatively outdated visual-language embedding model, it has minimal importance in my pipeline.
  • Azure object detection service — I have included this model as it can sometimes be useful for object recognition.

All these models are seamlessly integrated by LangChain, an incredible library for prompting LLMs and augmenting them with other tools — in this case, the visual-language models that enable GPT-3 to “see” the world.

Generate detailed image description

Here is the complete process for generating detailed image descriptions using GPT-3, BLIP-2, and other Visual-Language models:

Full schema how I made GPT-3 to generate detailed description for a Dixit card, using Visual-Language models as tools. Nothing fancy, just some engineering with trial-and-errors.
  • Step 1: Generate Captions with BLIP-2, GIT-large, BLIP-1, and Azure Object Detection Service. Combine all descriptions into one, then label GIT-large and BLIP-1 as “Less Trustworthy” and BLIP-2 as “Highly Trustworthy.” This gives GPT-3 a granular differentiation between reliable and unreliable sources.
  • Step 2: Prompt GPT-3 to consider what details in the image it wants to clarify while providing context about the Dixit game to prevent it from asking unhelpful questions.
  • Step 3: Let GPT-3 talk with BLIP-2 😁. Observing the conversation between the blind but intelligent GPT-3 and the sighted but naive BLIP-2 as they make sense of the Dixit card is the most entertaining aspect of the pipeline.
  • Step 4: Provide all chat history to GPT-3 and have it generate a detailed description of the Dixit card. It may hallucinate additional details, but in the context of the Dixit game, this is not a bug but a desirable feature.

Why not just let BLIP-2 generate all the descriptions directly, without GPT-3 interpretation step? I tried, but usually, it’s either too short (I tried to force larger min_length, larger repetition_penalty, use beam_search, and other parameters to no avail), contains mistakes (especially on more abstract cards), or simply too dry (GPT-3 can hallucinate out the “mood” of the image, which is quite handy in Dixit). Would love to hear your suggestions.

One might also question the use of older image captioning models despite having BLIP-2 which is lightyears ahead of everything else? It is because I’m implicitly using the concept of “consistency” — By prompting GPT-3 to be skeptical of individual models, we ensure that if all models agree on something, it’s more trustworthy. Dixit is a visually challenging game, and even BLIP-2 doesn’t always provide accurate descriptions. Hence, relying on multiple models helps to cover its weaknesses.

Generate a creative clue for a card

In the storyteller’s round, the bot has to select a card and describe it with an imaginative clue. To make the AI generate a creative description for a given image, I prompted the GPT-3 model with different personalities. I provide it with the detailed image description generated in the previous step and encourage it to explicitly explain its thoughts about what the image reminds it about.

Prompting GPT-3 different personalities to generate a short clue for given image. The generic personality usually provides the most plausible clues, but other personalities can sometimes yield interesting results.

Finally, after a long chain of prompts, I ask GPT-3 to summarize all of its inner thoughts scratchpad with a single short clue.

Guess the card by given clue

The most complicated prompt chain of the Dixit AI is in the card-guessing stage. For each card, whether from the bot’s hand or other players’ piles, I generate a comprehensive explanation as to why the card may be related to the given clue using the following process:

How I generate detailed explanation why an image might be potentially related to the given clue. Involves quite a bit of talking between GPT-3 and BLIP-2. CLIP embeddings helps in more visual clues.
  • First, I create a detailed generic image description through the pipeline outlined in the previous section. This description does not contain the clue as prior so it can be generated before the game.
  • Next, I prompt GPT-3 to explain how this image may be related to the given clue and permit it to interact with BLIP-2 if it needs to clear up any details. Yes, the two AI models are talking to each other again!
  • I then concatenate the detailed explanation from previous step with the cosine similarity score of image-clue CLIP embeddings into a tuple.

The inclusion of the cosine similarity score of the CLIP embeddings in the final explanation is grounded on the observation that players may approach the game of Dixit in different ways. Sometimes they give clues based on a logical chain of abstractions, while other times they provide purely visual descriptions. That’s why I thought the bot would need both chained reasoning and visual-language similarity.

Finally, I politely ask GPT-3 to find the most plausible explanation, taking into consideration the Softmax probabilities of CLIP similarity scores with lowered temperature, and choose the best card that fits the clue (the schema below also shows a situation from an actual game):

To choose a card that matches given clue, I generate explanations for each card using the previous schema and then kindly ask GPT-3 to compare and think about which explanation is the most logical and plausible.

The prompts in reality are much longer than in the illustrations above. Creating an appropriate prompt that strikes the right balance between being too general and too specific involves a significant amount of experimentation and refinement.

Wrapping everything into a Telegram bot

The only reason I make small projects like this is to have fun with people. I needed some sort of a front-end to smoothly interact with this AI through my phone, while the main script is hosted on an Azure instance with Nvidia A100 GPU. Telegram was a natural choice.

Detect Dixit cards on a photo

Taking photo of each card during a real game is too slow (I don’t want my friends to wait forever), so I wrote an OpenCV script to find multiple Dixit cards on a photo:

Example of how the card detector based on OpenCV works in 3 stages: filtering out regions of high contrast, apply binarization, find the edges and lines, then detect rectangular-shaped objects. In the last (3rd) image, you can see that all cards were correctly detected.

This detector was written purely with OpenCV. First, I utilized Bilateral Filtering to eliminate potential background textures (i.e. wooden table) while preserving the edges. Next, I applied Adaptive Thresholding to highlight high-contrast regions and utilized the Canny detector to identify edges. After several morphology operations to reduce noise and reinforce more crucial regions, I employed the Probabilistic Hough Transform and identified the Contours Hierarchy in the processed image. Finally, for each top-level contour, I fitted a polynomial hull and determined if it is a quadrilateral. If necessary, I also performed perspective transformations on the detected cards. In the age of Deep Learning, traditional Computer Vision still matters!

Yes, I could have made this detection pipeline simpler. But I needed this detector to be absolutely bulletproof and work under any lighting conditions, so I can gather my friends and have a smooth board game night.

Telegram bot interface

The following screenshots were taken during a real Dixit game:

Screenshot of the telegram bot, taken during a real Dixit game with my friends and co-workers. The UX is simple and flexible enough to let us have a smooth game.

The telegram bot interface is quite minimalistic, as you can see from the screenshots above, with commands like /add (to add the cards to our hand), /status to display images in our hand and their short clue, /clue to choose a card based on a given clue, and other operations.

Full source code of this Dixit AI: https://github.com/hav4ik/dixit-chatgpt.

Dixit in AI research

To my surprise, there were a few scientific articles published around Dixit:

How the game went

Here is how our game looked like, with me being the subordinate of GPT-3 and responsible for executing physical actions on its behalf:

Photo of 3 human playing agains a GPT-3 + BLIP-2 bot, operated by me. And a pizza that GPT-3 cannot eat!

AI generating clues

Here are a few more examples of the clues generated by the AI during this game that it played with my friends:

Clues generated by the AI during a real game (the AI generates clues for all cards in its hand). Only two of them got played out during the game.

AI choosing a card that matches the clue

In the beginning of the round when another person is a storyteller, the AI has to choose one of the cards on its hand that matches the given clue. The examples below were not cherry-picked.

For the given clue “not vegan”, the AI chose a card depicting a boy hungrily looking at a giant shell.
For the given clue “born to be wild”, the AI chose a card depicting a human covered in leafs hiking in the mountains. Pretty good choice to me!
The legendary “you can do it” round, where the AI completely dominated! Notice that the last card and the 4th one can also be described as “you can do it”, but the AI chose the best option, which clearly indicates that it understood the deep context of the image.

AI guessing storyteller’s card

Guessing the storyteller’s card is the hardest thing in Dixit. Here is where the AI struggles the most.

On the left: all cards fits the clue “I’m brave” quite well — in fact, most of the human players guessed the same card that the AI have guessed. On the right: correct guess by the AI.
Here, both of the times our AI failed to correctly guess the storyteller’s card. On the left, I think the AI’s guess is still pretty good. On the right, most players guessed the same card that our AI have guessed.

The AI’s success rate is pretty close to human level though — in many rounds, the majority’s guess agreed with the AI’s guess. The limiting factor here is the BLIP-2 model and the whole image description pipeline.

Final thoughts

In the future, large language models are expected to become multi-modal, as evidenced by recent developments such as Flamingo and BLIP-2. The tricks outlined in this article will likely become obsolete by 2024 or even in the latter half of 2023. Nevertheless, it is fascinating to observe the capabilities of GPT-3, while having so limited ability to interact with the outer world through other modalities.

By the way, ChatGPT helped me to write most of this blog post as well 😉 Oh, and Github Copilot helped me to write most of the code.

--

--

Kha Vu Chan

Software Engineer @ Microsoft Bing. Former Machine Learning Engineer @ Samsung Research. My personal blog for longer posts: https://hav4ik.github.io/