Taming GPT, Part 2: Lessons from Three False Starts in Generative AI + Gaming
Introduction
Since my previous post, I’ve continued to tinker on prototypes where I build in generative AI into the core of novel gaming experiences. Not all prototypes reach a playable state, but that does not mean we cannot learn from them.
I’d like to open by sharing a brief analogy. I like to think of exploring potential gameplay concepts as if I were assessing the best way to climb a mountain range:
- Which peak should be our target?
- Among numerous climbing paths, which is the most accessible?
- If we’ve ascended a false peak or local optimum, how would that prepare us for the subsequent summit?
This analogy is not unlike Jon Radoff’s articulation of creativity as a searching function:
If the universe is a nearly-infinite number of possibilities, parameters and variables — then perhaps creativity is about applying efficient processes towards this search for effective solutions.
In recent months, I’ve undertaken several short expeditions, surveying the ‘mountain’. While I didn’t ‘summit’ any peaks — meaning, I didn’t create playable games — I did traverse several valleys and am now ready to share my discoveries.
In this post, I’ll share my process of how I evaluated generative AI hypotheses across three separate projects. For each project, I’ve included:
- Premise: A succinct outline of my initial concept.
- Inspirations: Existing products that fuelled my enthusiasm for the premise.
- Testable hypothesis: The critical question I aimed to answer.
- Work done: An overview of relevant tasks completed and strategies attempted.
- Conclusion: My personal takeaways from the project.
- Conditions to revisit: What factors may trigger a reconsideration of this concept?
Project #1: Isometric Tile City Builder
Premise: A 2.5D isometric city builder where the player’s city tiles are unique to their session.
Testable Hypothesis: Can GenAI construct 2.5D isometric tiles that both a) integrate seamlessly next to each other in a game, maintaining stylistic coherence, and b) encourage boundless creativity through player-provided (or game-provided) textual LLM prompts?
Inspirations:
- Scenario.gg’s product forms the cornerstone of this project, offering a stylistically consistent response to text-to-image prompts.
- The “Sparks of AGI” talk stirred my interest by proposing the idea of using GPT to “sketch” basic shapes with code, which a diffusion model could then refine into a high-fidelity piece of art.
Initially, I experimented with TikZ, as suggested in the video. However, I soon realized that TikZ is specific to “LaTex,” and I had reservations about running it from Python (and ultimately C#, as would be necessary for a Unity game prototype).
So, I naturally turned to ChatGPT for advice and it recommended OpenSCAD and Pycairo. I decided to start with OpenSCAD. After a few attempts with simple primitives, I was able to create fairly complex 3D structures from text prompts. My ChatGPT prompt follows:
Very good. Let's try another.
Please give me openscad code for a 3d model that includes the following features:
- the base tile is a green square
- Somewhere on the square, there is a brown simple house - perhaps a cube with a triangle on top
- There is a large light grey wind turbine, including the tower and simple rectangles making up the turbine blades
Copying and running the ChatGPT-provided code resulted in this image:
Incorporating this into Stable Diffusion’s web tool as image-to-image — supplemented by additional text input outlining my requirements — yielded some reasonably appealing tiles.
When experimenting with several other prompts, I found that most didn’t yield results as successful as my initial attempt.
Please give me openscad code for a 3d model that includes the following features:
- the base tile is a green square
- There are little dark green triangles representing tents
- there is a little orange pyramid representing a campfire
This approach encountered two significant challenges:
- The primitives generated by ChatGPT lacked consistency in their quality.
- OpenSCAD utilized a perspective view instead of an orthogonal one, inhibiting the visually seamless alignment of adjacent tiles.
For my next strategy, I engaged with Pycairo to rectify the perspective view problem and experiment with more complex shapes. In this instance, I was able to dictate the perspective by setting a green diamond as the base tile, then prompting ChatGPT to illustrate on top of it:
I am using pycairo within Python to create sketches of an isometric tile to be used in a game. I would like to ask you to draw a picture for me using colored shapes defined by pycairo lines. Rules of this process:
Provide output in lines of pycairo to be run from python.
The image should be viewed from an apparent 45 degree isometric, orthographic perspective such. This should have the effect of all regular shapes in the image being defined by diagonal lines, as if it were seen from the side and slightly from above.
The content of the image should all be placed upon an isometric view of a square tile, with the eastern and western corners touching the edges of the 512x512 canvas. I will provide you the code for this this tile upon which to draw things.
There should be no background color.
Do not include “surface.write_to_png”.
Do not include any code within a def.
Respond only with the code snippet itself, enclosed with triple backticks. Exclude setting the scene, e.g. “Here's an example of Python code…”. After the code snippet, do not explain yourself. However, within the code snippet, use comment to describe what you are trying to do, e.g “# Draw tree trunk (brown)”
Be as detailed as you can. Feel free to imagine many shapes, if you can think of them, and if you can define them using python lines. 100 lines is good, 200 lines is better, 500 lines is probably too much.
Be creative. Use different colors for stylistic effect, and add minor details if you like, so long as you include the subject material.
Please give me python code to depict this drawing using pycairo containing the following features:
- the terrain on the tile is green (like grass)
- on the tiles are three trees. One is a leafy, deciduous oak-like tree. Another is a pine tree. The third is a weeping willow.
Below is the start of the code. Continue afterwards.
import math
import cairo
# Set up the canvas
surface = cairo.ImageSurface(cairo.FORMAT_ARGB32, 512, 512)
context = cairo.Context(surface)
# Draw isometric square tile
context.move_to(256, 512)
context.line_to(512, 384)
context.line_to(256, 256)
context.line_to(0, 384)
context.close_path()
context.set_source_rgb(0, 1, 0) # Green
context.fill()
After several attempts, I obtained the following images. At first glance, it seemed like progress was being made, as I could envision the tiles fitting together. However, the initial sketch still left room for improvement:
I tried geometric shapes to see if pycairo was better with something that might look like buildings:
…
Please give me python code to depict this drawing using pycairo containing the following features:
- the terrain on the tile is green (like grass)
- the tile is that of a cross section of a tall castle wall. The castle wall extends from northwest to southeast, as if a section of a much longer wall. The exterior of the wall is facing southwest and is dark grey.
- From the southeastern perspective, we can see the interior of the castle wall, which is filled with light grey.
…
Finally, I tried exactly what was attempted in the Sparks of AGI video:
…
Please give me python code to depict this drawing using pycairo containing the following features:
screenshot of a 3d city-building game showing a terrain with:
- a river from left to right
- a desert with a pyramid below the river
- a city with many highrises above the river
…
Faced with unsatisfactory sketches and unwieldy results from Stable Diffusion, I decided to step away from this approach.
Conclusion: Throughout the process of generating these tiles, I found that maintaining stylistic consistency often conflicts with enabling unrestricted player expression. Ensuring certain elements of the tile remained consistent, such as the base diamond tile, while simultaneously allowing for virtually anything to be depicted on top proved challenging. I later discovered that I had arrived at the same conclusion that Reed Berkowitz did on this podcast.
Conditions to revisit: I predict there are two breakthroughs needed to make headway here:
- Technically, can you enforce consistency in specific regions of the image? I am not familiar enough to say, but perhaps ControlNet might offer a potential solution.
- From a design perspective, how can we involve the user in adjusting or approving the generated image? This could involve adjustable tools like DragGAN, or other human-in-the-loop UX strategies.
Project #2: Chimera Sports Network
Premise: ChimeraQuest with a new twist on quest resolution. Instead of quests resolving in a Discord thread, they’re completed by your creatures “live on TV”, narrated and analyzed by AI sportscasters on a never-ending live stream.
Testable Hypothesis: Can GenAI consistently create an audio-visual feed that responds to player actions?
Inspirations:
- AtheneLive, a Twitch creator, has developed endless AI-generated content programs, such as AIJesus. These channels operate 24/7, with content created in direct response to viewer comments.
In previous work, I validated the quality of quest resolution text in ChimeraQuest, and felt confident in generating endless text content of satisfactory quality. Consequently, I saw the riskiest aspects of the audio track as the following:
- Text-to-speech conversion for commentary.
- Text-to-audio translation for sound effects, audience reactions, etc.; ensuring it sounds “authentic.”
- Real-time, continuous audio generation.
My target milestone was to develop an approximately one-minute-long audio segment featuring AI speakers setting the scene for an upcoming quest event. This segment would integrate sound effects and AI commentary, designed in such a way that multiple similar segments could be generated using a series of templates.
Here are the text-to-speech options I evaluated:
I looked at a few options for text-to-SFX as well but could not find any of acceptable quality. Meanwhile, Freesounds.com offered an abundance of straightforward, relevant audio clips that required minimal editing (example).
The script itself was easy to create via this ChatGPT prompt:
I would like you to provide me a script for two characters speaking as commentators at a sporting event. The first character is the Primary Commentator, nicknamed Primary, whose role it is to narrate what is happening and explain to the audience the sporting event. The second character is the Color Commentator, nicknamed Color, whose role it is to provide additional details about the event , it’s competitors, and reactions to what Primary is describing.
The contest itself is called The Chimera Talent Show, and it is similar to a Dog Show, where intelligent creatures attempt to impress judges by completing feats. Generally, be respectful of the creatures: express admiration at their abilities, excitement when they succeed, and sympathy when they fail.
At this point in the story:
You are returning from commercial break just as the next competitor is about to begin their attempt.
The next competitor is named Floofy, and Floofy is a Stripey Snout, which is a creature descended from a giant anteater and a clownfish.
The next feat is attempting to climb over a Fence.
Have the characters explain to the audience what they are watching, and set the scene for Floofy attempting the feat.
Conclude the script right before Floofy attempts his feat.
In the script below, provide all dialog in the following json format, and provide no additional context before or after. I will get you started:
{
“Primary”: “Hello, and welcome back to the Chimera Talent Show.”
}
My experimentation with Microsoft’s Voice Gallery was quite efficient; it didn’t take long to identify the characters and their corresponding voice styles.
To synthesize the audio scripts, I employed xml structures following the speechsdk guidelines. This approach quickly and cost-effectively generated all the necessary audio snippets.
Ultimately, I used pydub’s AudioSegment to import and overlay all individual audio files at specified start times, resulting in the final, multi-layered audio file:
Conclusions:
- Endless audio is viable now as evidenced by this demo and existing 24/7 streams.
- Video, however, presents a greater challenge. While I could use deepfake technology to create talking heads for sportscasters, similar to Athene’s approach, generating actual on-screen content appears to be both crucial and complex.
- From the perspective of interactive gameplay, I’d prefer a significantly higher level of user interactivity influencing on-screen elements. This current structure offers a limited throughput, with only one creature completing a quest every few minutes.
Conditions to revisit:
- Advances in video generation technology could make revisiting this concept worthwhile. Once it’s possible — and controllable — this could transform into an incredibly captivating experience.
- From a design perspective, could there be ways to feature more players per minute on screen?
Project #3: Endzone
Premise: Players field a team of footballer NPCs for a fantastical backyard football game, coaching NPCs using natural language. Football plays occur on livestream.
Testable Hypothesis: Can we design an AI that can interpret and act on controls provided via natural language inputs?
Inspirations:
- SaltyBet offers a 24/7 interactive stream where players can wager on custom characters competing in a Street Fighter-style game. What piqued my interest was the hidden logic dictating the characters’ actions; could a player potentially ‘coach’ these characters to improve their performance?
- Tecmo Bowl’s simplistic physics, plays, and accessible fantasy aspects seemed like an excellent starting point.
Initially, I laid down some basic “laws of physics” to dictate character interactions on the field. Drawing heavy inspiration from Tecmo Bowl, these rules were designed to ensure that on-field dynamics remained intriguing and unpredictable even without changes in the coach’s instructions.
Once these mechanics were functioning adequately, I proceeded to validate the central hypothesis of controlling the characters through natural language. From prior projects, I was fairly confident that ChatGPT could generate code. However, there was a significant challenge: any new code provided would need to be compiled before the game could run. This problem is compounded by the risk of ChatGPT producing faulty code, leading to suboptimal character intelligence or a malfunctioning game.
To bypass this hurdle, I decided to abstract all potential low-level mechanics for characters to a higher-level script. This high-level logic served as a buffet of methods from which a custom compiler could select to fulfill the objectives of the human player’s natural language input.
An example applied to just one character:
Knight3 is the runningback
Run to the north end of the endzone
Despite the successful execution of some very simple plays, I quickly realized the complexity involved in designing higher-level logic that accurately mirrors the coach’s intentions while simultaneously behaving realistically. Consider the choices that a running back might face at any given time:
- Where should they run based on the positions of all the defensive opponents?
- When should they risk going wide for a touchdown versus playing it safe and pushing directly forward for yardage?
Coupled with the anticipated difficulty of building a custom compiler, this task, while possible, was more technically challenging than I was prepared to tackle at that moment.
Conclusion: Regrettably, I couldn’t test the central concept of “Using natural language to coach characters,” as the scope of my approach proved too vast. The takeaway is a reminder of the classic “MVP” lesson: I should have started with a project of a much smaller scope to test the hypothesis effectively.
Conditions to revisit:
- This concept seems possible now, especially for team-based play. One approach could be to incorporate actual reinforcement learning for agent AI (example). Could the agents be designed to learn the “physics” of the world while following natural language instructions appropriately?
- Alternatively, we could employ simpler models where player-coaches provide guidance to character AI. What if, instead of football, players coached a 5v5 matchup of auto-battlers, akin to AFK Arena?
Looking Forward
As anyone who’s built with AI can attest, this new technology is easier to demo than to polish into a user-ready product. While none of the above concepts reached a playable stage, I learned a great deal about the potential challenges along the path.
Despite the pitfalls, my enthusiasm for the application of Generative AI in gaming remains undimmed. The potential of each project discussed in this post still intrigues me, even though they didn’t reach fruition, and there are countless more GenAI concepts yet to be conceived. I’ve only scratched the surface of what’s possible with the tools currently at my disposal; Imagine the possibilities with the advancements yet to come.
As always, I’m keen to connect with fellow enthusiasts and communities of like-minded creators. Please feel free to reach out! I look forward to exchanging thoughts and ideas.