Guess The Country

Pamudu
10 min readDec 31, 2023

Introduction

The inception of the “Guess The Country” game was inspired by the compelling demo presented by Google to showcase their groundbreaking model, “Gemini .” Witnessing the capabilities of the Gemini models and their potential to craft an immersive gaming experience left me truly amazed. The game concept that emerged from this demonstration offers users a dynamic platform enriched with remarkable audio and interactive features.

I particularly appreciated the performance of the Google Gemini Ultra, as highlighted in my previous LinkedIn post. The model seamlessly integrates text, video, and audio functionalities, and the “Guess the Country” game, featured in the Google demo, exemplifies Gemini’s prowess.

Figure 1: The ‘Guess the Country’ features on the “Google Gemini Ultra” Demonstration video.

In the Google Demo game, the game provides an explanation of a country using emojis, enhancing the user experience with accompanying audio descriptions. The challenge for users is to guess the country based on these hints. Users actively engage with the machine, expressing reactions such as “that’s easy,” fostering a highly interactive gaming environment.

I replicate this game for play in our local setting. Although we lack access to the powerful audio, video, and text multimodal capabilities of Google Gemini, currently exclusive to Google, we construct a dedicated game environment for the “Guess the Country” game.

Here is our version of the game. The initial step involves setting up the world map, positioned to the camera. Basic instructions are displayed within the game, and players are tasked with guessing the country within a specified time frame based on the displayed emojis. To aid those facing challenges, a hint feature has been introduced. Hints offer descriptive information about the emojis, assisting players in making more informed guesses. As the game unfolds, users continue guessing, striving to identify countries within the given time frame. This dynamic and engaging gameplay ensures a captivating experience for players.

Methodology

Now, let’s discuss how this game is created step by step.

STEP 1

First, we need to get a good map. I had a hard time identifying a suitable map for this game, as I tried several maps bought from the stationary shop. Most of the maps are blue, making it a bit difficult to differentiate between the sea and countries. Therefore, a map with the country boundaries in black and white is better, and that’s the type of map I am using.

Figure 2: A good map for our game; the boundaries of the countries are clearly visible.

STEP 2

The next step is to find a way to identify the country locations with the corresponding country names. For that, I am using an image segmentation model — specifically, the YOLOv8 segmentation model.

Why YOLOv8?

The reasons behind choosing YOLOv8 are that YOLO models provide excellent real-time speed, and with the pre-trained ImageNet weights, it can be trained with a relatively lesser number of annotated images. Another reason is that YOLOv8 is very easy to use; Ultralytics has packaged it, and with just 3 to 4 lines of code, we can train the model. Additionally, we can conveniently train the model on Google Colab.

Computer Vision Algorithms Fail…

I think it is worth mentioning that, initially I considered using traditional computer vision algorithms to separate countries. However, due to various lighting conditions, especially when placing the camera on top of the map, which introduces shadows, I decided to use an intelligent algorithm to segment out countries.

2.1 Camera Setup

I used my phone’s camera with a mobile app called ‘IP Webcam.’ To make it work, both my phone and computer had to be on the same WiFi. I kept the phone steady using a phone holder for the right height. If you have a webcam with adjustable height, it makes things easier.

2.2 Image Annotation

First, I set up the camera and captured video recordings, which I later converted into frames. To reduce redundancy, I have skipped adjacent frames and saved frames with a skipping frequency.

Figure 3: Extract frames from the video

Next, I compiled an image dataset and uploaded it to the Roboflow annotation platform. While other annotation platforms or local tools like LabelMe could be used, I chose to annotate images with bounding boxes. For this, I focused on eight countries: the United States of America, Australia, Greenland, Brazil, Canada, India, Russia, and China. These larger countries made annotation easier.

The annotated images were augmented by variations in random brightness, exposure, contrast, and zoom. This augmentation was selected to simulate potential real-time scenarios where such variations might occur.

Manual image segmentation is a time-consuming task. Instead of segmenting the countries directly, I drew bounding boxes around them. These bounding boxes didn’t need to match exactly; an approximate fit was sufficient.

After that, the annotations were downloaded in PASCAL VOC format XML files from the Roboflow platform instead of the YOLO format, as the PASCAL format is easy for our later processing.

I’ve generated segmentation masks using Meta AI’s Segment Anything model (SAM). I provided bounding boxes as prompts to the SAM model and the model produced corresponding masks for those bounding boxes.

Figure 4: The workflow for image annotation: starting with bounding box annotation using Roboflow exporting these annotations in Pascal VOC format, then feeding them into the Segment Anything Model (SAM) to obtain segmented masks.

Code for automated annotating process

Following this preparatory work, we now possess a dataset with segmentation masks for the selected countries.

STEP 3

The subsequent step involves training the segmentation model. I chose the YOLOv8 Small model and trained it on Google Colab using a T4 GPU for 100 epochs. This way we have created a segmentation model for our game.

Figure 5: Trained results on YOLOv8 segmentation model.

STEP 4

The next step is to prepare emojis with their descriptions for different countries. I used GPT-4V for that task and organized them in an Excel file. You can also come up with your own creative emojis and descriptions.

Country Code: USA
Country Name: United States of America
Emojis: 🗽 🦅 🍔 🌽
Description: Symbolized by the Statue of Liberty, representing freedom and democracy. The bald eagle signifies strength and freedom. Hamburgers are iconic in cuisine, and corn is a staple crop.

Country Code: AUS
Country Name: Australia
Emojis: 🦘 🐨 🏄‍♂️ 🌊
Description: Known for unique wildlife, including kangaroos and koalas. Famous for surfing spots and beautiful ocean waves. The landscapes range from beaches to deserts.

Country Code: GREENLAND
Country Name: Greenland
Emojis: 🧊 🐋 🎣 ❄️
Description: Characterized by vast ice sheets and a cold climate. Whaling and fishing are traditional activities. The environment is harsh but beautiful, with pristine snow and ice landscapes.

Country Code: BRAZIL
Country Name: Brazil
Emojis: ⚽ 🌳 🎭 ☕
Description: Celebrated for its love of soccer and the lush Amazon Rainforest. Vibrant Carnival festival showcases rich cultural heritage. Renowned for coffee production.

Country Code: CANADA
Country Name: Canada
Emojis: 🍁 🏒 🐻 ❄️
Description: Famous for natural beauty, including maple trees and autumn leaves. Ice hockey is a beloved sport. Known for diverse wildlife and cold winters.

Country Code: INDIA
Country Name: India
Emojis: 🕌 🌶️ 🐅 🧘‍♂️
Description: A land of diverse cultures and history, symbolized by the majestic Taj Mahal. Spicy cuisine and vibrant festivals reflect the rich heritage. Tigers roam national parks, and yoga has its origins here.

Country Code: RUSSIA
Country Name: Russia
Emojis: 🏰 🐻 🚀 ❄️
Description: Known for rich history and cultural landmarks like the Kremlin. Vast landscapes are home to bears and a cold climate. Significant contributions to space exploration.

Country Code: CHINA
Country Name: China
Emojis: 🐲 🍚 🏮 👲
Description: History is symbolized by the dragon, a sign of power and good luck. Cuisine, including rice, is a staple. Traditional festivals and attire reflect a rich cultural tapestry.
Figure 6: Created Excel file with country emojis and descriptions

Excel file Link

STEP 5

5.1 Identify hand gestures

For hand detection, I’ve chosen the MediaPipe library due to its exceptional speed and accuracy in identifying hand key points. The library consistently detects 21 hand landmarks, offering detailed information about hand poses. Unlike alternative keypoint detection algorithms such as YOLO Pose, MediaPipe has a module purposely built for hand tracking, making it a reliable choice for precise and efficient hand pose recognition.

5.2 Check which finger is up

def is_finger_raised(tip_y, pip_y, image_height):
return tip_y < pip_y
Figure 7: Explanation of the logic behind identifying whether the finger is raised or not

If the y-coordinate of the fingertip is greater than the y-coordinate of the proximal interphalangeal joint, it implies that the finger is raised.

5.3 The Country Identification Process

To determine whether the current index finger position is within a country, I need to input the set of masks from the YOLO model and the current position of the tip of the index finger to a custom function.

def find_mask_for_coordinate(masks, coordinate):
x, y = coordinate

for mask_index,mask in enumerate(masks):
mask = masks[mask_index,:,:]
mask_array = np.array(mask)

# Ensure that the coordinate indices are integers
x_index = int(x)
y_index = int(y)

pixel_value = mask_array[y_index, x_index]
if pixel_value == 1:
# Return the index of the mask if the point is inside
return mask_index

return None

The function checks whether the desired location pixel value is equal to 1, indicating that the coordinate is inside the current mask. If so, it returns the index of that mask. If the coordinate is not found in any mask, the function returns None. The purpose of this function is to identify which mask, if any, contains a given coordinate.

Figure 8: Segmented binary masks from the YOLO model

In this context, the set of masks refers to the output of the YOLO segmentation model, which provides a binary mask for each country. In my case, I have obtained 8 masks since I annotated data only for 8 countries. This function enables the identification of the specific country corresponding to a given coordinate, allowing us to determine the country the player is pointing to.

STEP 6: GUI Implementation

For the implementation of the GUI, I chose the PySide6 library due to its permissive license, which is suitable for our project development. To incorporate real-time camera feed and hand position detection, a separate thread runs alongside the main thread to ensure smooth functionality.

Combining Everything

Let’s delve into the game dynamics. The initial window provides instructions, the user can start by clicking the button or pressing enter. Then transition to the subsequent window where the actual gameplay comes to life. On the left side, there’s a camera view, and players are tasked with pointing to the country represented by emojis.

Behind the scenes, the game employs MediaPipe’s hand-tracking module to detect hand points as players move their hands. Every hand position takes into account how the fingers are positioned, and different actions are associated with different finger setups. To counteract false predictions during hand movement, players must maintain their hand over a country for a specified duration for a selection.

The gameplay involves users pointing to a country using their index finger. Correct guesses contribute to the correct score, while incorrect ones increment the incorrect score.

For a hint on the displayed emojis, players reveal two raised fingers, and the corresponding guess appears on the GUI. To pause the game, players display all five raised fingers; to resume, they replicate the gesture of raising all five fingers. This interactive and gesture-driven approach adds an engaging layer to the gaming experience.

LIMITATIONS

1) The detection and tracking method employed by Mediapipe relies on the initial detection of the entire hand. If the hand cannot be detected, tracking won’t occur, and there’s a risk of missing detection when only a part of the hand is visible.

2) Optimal results are achieved when the camera remains stationary during the gameplay. This is because the country masks are initially detected, and subsequent calculations are based on that information. If there are changes in the positions of these country masks due to camera movement, it can impact the gameplay results.

3) Clear visibility of the fingers is crucial for the system to function. If the fingers are covered for any reason, the system won’t work effectively, particularly in identifying the pointing fingers.

4) The functionality of the system relies on the YOLOv8 segmentation model. If different types of country maps are used, it may lead to failures in the segmentation and detection process.

5) The system needs to be trained with a diverse set of world maps. Annotating these maps even with the bounding boxes is a time-consuming task.

6) If the user plays the game a second time, the same country, emojis, and hints will be present.

Improvements

  1. Implement difficulty levels, such as easy (5 emojis), medium (4 emojis), and hard (3 emojis), adjusting the time given for each difficulty level.

2. Introduce a Game Over scenario upon reaching 3 incorrect guesses.

3. Set up a high-score system to track and display players’ achievements.

4. Enhance the hint interaction by having the system speak the hint, increasing the overall interactivity of the game.

5. Use different sets of emojis and hints for each game to provide a fresh and varied experience with new challenges every time.

This concludes the implementation of Guess the Game. Feel free to share any suggestions, and I invite you to give it a try!

Github Link for this project

Pamudu Ranasinghe

--

--