AI in video games: a historical evolution, from Search Trees to LLMs. Chapter 3: 2000–2023

Jose J. Martinez
19 min readNov 15, 2023
AI in video games: a historical evolution, generated using DALL·E 3

Introduction

Welcome to the third chapter of our ongoing series exploring the evolutionary journey of AI in video games. Our exploration began in 1950, chronicling the emergence of AI up until 1980 in this inaugural chapter.

There, we delved into the foundational years of video game AI, encompassing the years 1950–1980, and examined the primary approaches of that era: Discrete Logic, Data Trees, Search Algorithms, Selective Search (Minimax, Alpha-Beta, Monte Carlo), Knowledge Bases, AI Patterns, Behaviors, Interaction Meshes, and Text-Based Gaming strategies.

If you missed the first chapter 1️⃣, I highly recommend reviewing it before proceeding. Your support and feedback would be greatly appreciated!

After, our journey took us to the period between 1980 and 2000, covered in the second chapter 2️⃣. This era witnessed the integration of well-established techniques from other domains into video game development, including Pathfinding Algorithms, Finite State Machines, Artificial Neural Networks, and Behaviour Trees. Unfortunately, the pace of AI development (not just limited to video games but across all domains) led to what would later be termed the First and Second AI Winters.

We’ve arrived at the culmination of this series — an exploration that commences with Navigation Meshes (aka Navmeshes), Symbolic NLP and Behavior Collections ; delves into the intricacies of Behavior Trees and Procedural Generation; examines Adaptative Behavior as a facet of Reinforcement Learning, including newer techniques as Neuroevolution (in contrast to fixed topologies with Gradient Descent); highlights NVIDIA’s notable contributions in Graphics, includes DLSS; and finally, reaches the current NLP revolution caused by the dawn of Embeddings, Transformers, Large Language Models, Generative AI, Text-to-Speech, Speech-to-Text, Voice Generation and Retrieval-Augmented Generation (RAG) which are now shaping the landscape of content generation immersive conversations with NPCs.

A word cloud of the concepts of this chapter

Let’s embark on this final journey together!

Navmeshes: 3D Pathfinding

Navigating the vast virtual landscapes of video games requires intricate pathfinding systems to guide characters seamlessly through complex terrains. One key element that plays a crucial role in this process is the Navmesh. In this first section, we’ll delve into the world of Navmeshes, exploring their significance in enhancing gaming experiences and improving AI behavior.

At its core, a Navmesh, short for Navigation Mesh, is a fundamental component of pathfinding algorithms in video games. It represents a spatial mesh that defines walkable areas within a game environment, allowing characters to intelligently navigate through the terrain. It’s main difference is that it’s spatial and prepared for continuos spatial spaces, while pathfinding is discrete and presupposed to work with fixed discrete coordinates.

On a), a discrete coordinates path generated used Pathfinding and coordinates. On b), a spatial, continuos path found used Navmeshes.

Navmeshes include the following features:

  1. Pathfinding Optimization: Navmeshes serve as a powerful tool for optimizing pathfinding calculations. By precomputing and storing information about walkable areas, games can significantly reduce the computational load associated with real-time pathfinding, thereby enhancing overall performance.
  2. Dynamic Adaptability: Unlike traditional grid-based navigation systems, Navmeshes provide a dynamic and adaptable solution. They can account for irregularly shaped environments, inclines, and varying terrain, enabling more realistic and fluid character movements.
  3. Collision Avoidance: Navmeshes contribute to effective collision avoidance strategies. Characters can intelligently navigate around obstacles, adjusting their paths in real-time to avoid collisions and create a more immersive gaming experience.
  4. Integration with AI: The integration of Navmeshes with AI systems is pivotal in defining non-player character (NPC) behavior. Developers can specify regions on the Navmesh to influence NPC movement patterns, allowing for diverse and context-aware interactions within the game world.
  5. Usage in Level Design: Level designers leverage Navmeshes to craft engaging and challenging environments. They can designate specific areas as off-limits or strategically position obstacles, influencing how both player-controlled and AI characters traverse the game space.

Modern IDEs as Unity and Unreal, include Navmeshes as ready-to-use components, which can analyze the meshes of the environment and define which areas are walkable:

Navmesh in Unreal, with green walkable areas.

2005. Symbolic NLP and Believable Characters with Behaviour Collections.

The landscape of Natural Language Processing underwent a profound transformation with the introduction of BERT in 2018 and subsequently, Large Language Models (LLMs) from 2020 — revolutionizing the field (acknowledging earlier concepts such as Word Embeddings, which will also be explored). However, a pivotal moment in the narrative of Natural Language Processing and Behavior Agents in Video Games occurred with the release of Façade, an interactive story. This event can be rightfully deemed a significant milestone, contributing to the evolution of these technologies.

Façade immerses players as a friend of Trip and Grace, a couple who invited the player to their New York City apartment for drinks. Despite the initial pleasant atmosphere, tensions arise between the couple upon the player’s arrival. In the game, NLP is incorporated to enable players to converse with Trip and Grace by typing sentences.

Façade utilizes Hap/ABL, a programming language to encode what they call Believable Agents with Behavior Collections. These behaviors are comprised of conditions (triggers), sequential steps and a final NLP interaction with the user, called mental act. Each step’s execution determines the availability of the subsequent step: a successful step enables the next one, while a failure causes the overall behavior to fail.

An example sequential behavior in ABL is shown below.

sequential behavior AnswerTheDoor() 
{
[...] with success_test { w = (KnockWME) } wait;
act sigh();
subgoal OpenDoor();
subgoal GreetGuest();
mental_act { [...] }
}

In the example above, the game employs a sequential behavior model where there is a condition (waits for someone to knock on the door) and then a series of steps (sighs, greets) if the condition succeeds. If the conditions and the steps are completed, then the mental act (interaction with the player using NLP) is triggered.

The game’s major component lies in its system for interpreting player keyboard inputs, functioning as a NLP challenge in AI. While the technology behind it has advanced in recent years, 20 years ago, Facade struggled to tackle this issue, necessitating shortcuts due to time constraints.

This language processing system consists of two phases:

  1. A Multilabel Text-classification problem: Mapping text to around 30 Discourse Acts to discern the player’s intentions. This can be considered as a Classification Problem, and is frequently used in intent detection.
  2. A Parsing Mechanism. Utilizing over 800 rules to understand a wide array of behaviors, emotions, and idioms in the English language. Due to limitations, some words like ‘melon’ were banned due to slang meanings. The system prioritized responding to user input, leading to more false positives than silence or false negatives, lacking nuanced comprehension of the text.

The NLP technique used then was Syntactic Parsing, which analyzed words grammatically and matched preprogrammed structures.

This approach falls within the category of Symbolic Artificial Intelligence, which are currently being supplanted by Language Models — Neural Networks such as Transformers. These models can comprehend the semantics of text and function within a mathematical space, diverging from the reliance on rules and parsed parts-of-speech. However, further discussion on this shift will be provided later when we talk about Large Language Models.

2008. Behaviour Trees and Procedural generation.

The year 2008 marked a significant milestone, primarily due to the emergence of revolutionary concepts in gaming mechanics and procedural generation. This chapter delves into the transformative impact of Spore (2008), a groundbreaking video game that introduced and popularized two fundamental elements: Behavior Trees and Procedural Generation.

Spore, 2008.

Behaviour Trees

Through its inventive use of Behavior Trees enabling complex decision-making for in-game entities, and its sophisticated use of procedural generation, offering limitless variations in world creation, Spore set a precedent for future games and laid the groundwork for an entirely new level of player experience and interactivity.

Let’s define first what is a Behavior Tree. A Behavior Tree (BT) system is a library for declaring and ticking Hierarchical Finite State Machines operating on Actors. At the highest level, a behavior tree is made up of arrays of Deciders (what we called before Triggers or Conditions). Eventually the deciders bottom out and reference specific “behaviors”. Deciders do the decision-making in the tree. A decider has a function called Decide which looks at the state of the world and calculates whether the decider should activate (what we called before in Behaviours input variables).

Deciders activate their referenced Behavior when they activate. Modern video games IDEs as Unreal Engine offer specific interfaces for modelling hierarchical trees of behaviours:

Modelling of behaviour trees in Unreal Engine 5.2

A simplified example of a Spore’s Behaviour Trees is the following (obtained from Spore Behaviour Trees Documentation here):

          +--------+
ACTOR--->| FLEE | +----------------+
| GUARD -|------------------------------>| YELL_FOR_HELP |
| FIGHT | | FIGHT |
| EAT ---|--------+ +------------+ | PATROL |
| IDLE --|--+ +---->| FIND_FOOD | | REST |
+--------+ | | EAT_FOOD | +----------------+
| +------------+
| +-------+
+->| PLAY |--+ +--------+
| REST | +-->| FLIP |
+-------+ | ROLL |
| DANCE |
+--------+

Behaviour Trees were included in Unreal Engine as early as March 23, 2014. They can be used to execute branches containing logic, and rely on another asset called Blackboards, which serves as the brain for a Behavior Tree.

An example of Procedural Generation with the Spore: Creature Editor

The Creature editor allowed you to model your own creatures, which were then applied textures, overlays, colours, and patterns in a way. Wikipedia defines procedural generation as follows:

Procedural Generation (sometimes shortened as proc-gen) is a method of creating data algorithmically as opposed to manually, typically through a combination of human-generated content and algorithms coupled with computer-generated randomness and processing power. In computer graphics, it is commonly used to create textures and 3D models. In video games, it is used to automatically create large amounts of content in a game.

Spore Creature Editor

Procedural Generation uses algorithms to dynamically create diverse game content, including terrains, textures, characters, quests, and narratives. It operates on mathematical rules, with the following main components:

  • An algorithmic Core: Central to procedural generation.
  • A type fo content to generate: Terrains, textures, characters, objects, quests, and narratives. Modern games like Starfield used it for star system and planet generation, while early platformers as Spelunky employs it to produce ever-changing levels.
  • Noise Functions: Key components like Perlin or Simplex noise functions produce consistent yet random values across different spaces (2D, 3D, or higher dimensions), ideal for creating varied terrains and textures.
  • Seed Values: Procedural algorithms often start with a ‘seed’ value to maintain consistency in generated content. Sharing this seed enables players to experience identical worlds, promoting the reusability of generated content.

While procedurally generated content is memory-efficient by generating content on-the-fly, it can demand higher CPU usage, as the game computes content rather than retrieving it. Balancing these aspects is crucial for developers to optimize game performance.

Example of procedural generation assets with Unity

Reinforcement Learning.

Reinforcement Learning (RL) stands as the pivotal branch of AI that transforms computerized agents into adept players across the Atari gaming realm. At its core, RL involves algorithms wherein an Agent takes on the responsibility of making decisions within a defined environment. The agent refines its decision-making prowess through numerous interactions within the environment, learning and adapting its rules through a continuous process of trial and error. Harnessing reinforcement learning, an agent can swiftly attain mastery over games that were once the subject of weeks or months of practice during our childhood.

On the left, the model learning. On the right, the model after several stages of Reinforcement Learning (taken from Mauro Comi Medium Article available here)

Let’s see an example from a recent MARI/O (LUA code here), a Neural Network trained using Reinforcement Learning to learn how to play Mario Bros and Super Mario World.

Super MARI/O, a project to train an Agent to play Super Mario Bros / World using Reinforcement Learning

Using what they call a Fitness function, they try to maximize it over a series of evolutionary epochs. Fitness includes calculations on how far Mario has got and after how much time. It’s a concept of Neuroevolution, the specific Reinforcement Learning technique used for this use case (more about this later). But before discussing anything else, let’s cover the common concepts of Reinforcement Learning for any technique, from training NLP models to Agents in videogames:

  1. The Reward Function. This is the heuristics we will use to make the model understand it is improving. In our MARI/O case, it is a Fitness Function, defined by the distance from the start divided by the elapsed time from the start: R(x) = d(x, start) / t(x,start).
  2. Policy. The Policy is the brain of the agent, which contains the decisions to make given the status of the environment. In the case of MARI/O, it would tell Mario what key to push (UP, DOWN, X…) given the sprites on the screen. When we use Reinforcement Learning, we train the Policy, based on a reward function.
  3. The Agent. Our actor, which will be managed by the Policy to perform several actions, and whose actions will be evaluated by the Reward Function.
  4. The Environment. The series of Input Variables to be taken into account. In case of Mario, the sprites in the screen. This usually requires a component called the Interpreter, which transforms the information of the pixels into variables for the model.
  5. The State is usually another input variable. It contains information not about the invorment, but about the actor itself, and is very similar to the logic of having Finite State Machines to track which is the current status of an Actor.
The commons of Reinforcement Learning: environment, agents, rewards, interpreters, states.

Based on the Input Parameters (Environment, State), the Policy gets trained on moving the Agent based on the Reward Function.

The next step in Reinforcement Learning: Neuroevolution

If you watched the video about Super MARI/O, you may have noticed that there was something else there. When training, they talk about Genotypes (Genomes), Phenotypes and Generations.

These two concepts come from Neuroevolution, and basically consist of an innovate way of training an engine, not by using a fixed Neural Network topology / architecture, but changing its topology (layers, neurons, etc) as the training runs. The Neural Network is then trained over Generations, producing new Phenotypes (topologies) of Neural Networks, each of them containing the information Genotypes, encoding the information of weights, interactions between neurons, etc. of the training process.

An example of evolution of the phenotype of a Neural Network after generations, extracted from Paul Pauls Medium Article available here

If we see the resuls of a training process of MARI/O, we can see it can reach a high fitness score of about 4000 after 32 generations (32 evolutions of the Neural Network).

On the x axis, the numbe of generations of the training process. On the Y axis, the score of the reward fitness function.

2017-2020s. The dawn of transformers for Language Modelling. First Generative models.

The inheritance from Word Embeddings, such as Google’s Word2Vec and Stanford’s GloVe, played a crucial role in shaping the present of NLP. Word embeddings provided a way to represent words as numbers (dense vectors), capturing semantic relationships and contextual information and allowing analisys (as computation of similarity) using mathematical operations (as cosine distance).

The cosine distance between words Coffee and Tea is much smaller than between Ball and Crocodile

But in the realm of NLP and AI, the real breakthrough came with the release of Transformers, working on top of embeddings. Originally introduced in a seminal paper by Vaswani et al. in 2017, Transformers revolutionized the field with their attention mechanism, enabling parallel processing of input sequences and capturing long-range dependencies.

This architectural innovation laid the foundation for various applications, including NLU (Natural Language Understanding) and NLG (Natural Language Generation). One noteworthy offspring of Transformers is BERT (Bidirectional Encoder Representations from Transformers). Introduced by Google in 2018, BERT pushed the boundaries of natural language understanding by considering both left and right context during pre-training, and a specific type of attention called Self-Attention. This bidirectional approach enabled BERT to capture richer contextual information, leading to improved performance on a wide range of NLP tasks.

Self attention heads capture which words from the surroundings are important in defining the word we are processing.

Unlike traditional sequence models, BERT’s Self-Attention allows each word to contribute to the representation of others, fostering a richer understanding of context and relationships within the sequence. This mechanism has proven to be instrumental in various NLP tasks, enhancing the model’s ability to handle intricate linguistic patterns and improving performance on tasks like language translation, sentiment analysis, and text generation.

As the trajectory of AI progressed, the integration of Generative Language Models emerged as a prominent application. Large language models (LLMs), particularly exemplified by OpenAI’s GPT series (GPT1 was released in 2018, GPT4 in 2023), demonstrated unprecedented capabilities in generating coherent and contextually relevant text. These models, trained on massive amounts of diverse data, learned to understand and replicate the nuances of human languages.

Generative AI in Video Games, first applications.

Content generation finds widespread applications across various domains of knowledge, yet its utilization within the video game industry has been relatively limited as of 2023. The initial forays into this realm included:

  1. Game Content Generation: Generation of quests texts, descriptions, texts, character backgrounds, etc. Some games as AI Roguelite is 100% AI-driven, generating for you description of places and images of your own adventure.
Images, stories created by AI as you progress on AI Roguelite. Worth mentioning is the reaction of the game to an event, saying “The AI decides that a new character has appeared”.

You can easily integrate LLMs now with projects as LLamaSharp for C#. In this case, I used Mistral 7B quantizied to provide quick, real-time content generation for my video game.

I was missing some content for my Lord of the Rings based game and I generated it with Mistral 7B as I was programming.
  1. NLP for Procedural Generation: NLP techniques contribute to procedural content generation, allowing for the automatic creation of game content such as levels, environments, and quests. One of the examples is Mario GPT, a paper introducing level generation using GPT2.
A figure extracted from the paper explaining the interpretations of the Natural Language by GPT and the outpus produced.

3. Adaptive Game Environments: Content generation using NLP extends to the adaptation of in-game environments based on player actions, including dynamic changes in weather, terrain, etc.

4. Localization: NLP enables the generation of localized content, allowing games to adapt to different languages, cultures, and player preferences. Uncharted 4: A Thief’s End, Final Fantasy XIV, League of Legends, The Overwatch and Horizon Zero Dawn are some of the games which used LLMs for Localization purposes.

And last, but not least, Intelligent Non-Player Characters (NPCs), which are worth being addressed separately.

LLMs in Video Games: Intelligent NPCs

One compelling avenue for the application of generative LLMs is in the realm of Intelligent Non-Player Characters (NPCs) in video games. Traditionally, NPCs were scripted with predefined dialogues and behaviors, limiting the player’s interaction to predetermined scenarios. Generative LLMs, however, offer the potential to create dynamic and responsive NPCs that can adapt to player actions and provide a more immersive gaming experience.

By leveraging the contextual understanding and generative capabilities of LLMs, game developers can create NPCs with the ability to engage in natural and contextually appropriate conversations. These NPCs can respond dynamically to the player’s choices, ensuring a personalized and evolving narrative within the game. This not only enhances the overall gaming experience but also opens up new possibilities for storytelling and character development.

One of the first of its kind was the mod for Skyrim called Herika — The ChatGPT Companion. By leveraging GPT3.5, you can have a companion who understands and answers to your questions, but also having an own personality. Latest versions include Text-to-Speech (TTS or Spech Synthesis) and Speech-to-Text (STT or Speech Recognition) functionality, so that the answers coming from GPT-3.5 are translated into voice, and you can dictate instead of writing.

Interaction wih the companion using Natural Language Processing

There are concepts in LLMs which are very relevant in video-games:

  1. Personality. The idea characters with own personalities, emotions, backgrounds, goals resembling those of a human motivated companies as InWorld.ai and Charisma.ai to create SDK to make the life of Game Developers easy.
  2. Ethics. Preventing jailbreaking is mandatory os you may be facing situations like this:
Well, this is embarrasing…

3. Long-term memory. Remembering what has happened is crucial to mimic human behaviours. Usually this is done by using Vector Stores (a specific kind of embeddings database) to store past interactions and then retrieving it using what is call the RAG (Retrieval-Augmented Generation) approach.

The way RAG works is simple: before asking the AI to generate an answer, we send a request to our memories Vector Store to retrieve anything relevant to the topic conversation which happened in the past, and we send it all along with the actual question. This is a very frequently used technique in other fields of business NLP to retrieve context, FAQ, former discussions or interactions with clients, etc.

A schema of how RAG works

4. Real-time response. LLMs are known to be computationally and monetarily expensive. That’s why in most cases they run in external servers which can handle big loads of work and are accessible via API. However, recent optimizations are allowing compressed / pruned smaller models to run in local machines leveraing gaming GPUs (not without adding bias, losing accuracy, etc).

Picture taken from https://tinyml.substack.com/p/navigating-the-complexities-of-llm

About InWorld.ai

One of the most promising cmpanies in the field is InWorld.ai, created by former employees of Google Dialogflow. Inworld.ai offers a comprehensive suite of features for creating lifelike NPCs, including different integrations through SDKs in different IDEs (Unreal Engine, Unity), in Node.js, via API, etc.

Here are the main features they offer:

  1. Performance Real-time AI: Optimized for real-time experiences, Inworld offers low-latency interactions that scale. They orchestrate across multiple large language models (LLMs) to deliver higher quality interactions with faster inference and lower costs.
  2. Advanced Behavior Character Brain: This feature adds depth and realism to NPCs. Inworld’s multimodal AI mimics the full range of human expression including conversation, voice, animations, and emotions.
  3. Awareness Contextual Mesh: This feature allows NPCs to be rendered within the logic, lore, and fantasy of their worlds. It allows the addition of custom knowledge, content, safety guardrails, narrative controls, and more to keep your AI in-character and in-world.
  4. Real-Time Generative AI: They claim to be production-ready and auto-scalable, solving one of the biggest headaches of applying LLMs in real-time.
  5. Character Brain: Distinct character personalities can be configured in minutes using natural language and simple controls. Smart NPCs can learn and adapt, navigate relationships with emotional intelligence, have memory and recall, and are capable of autonomously initiating goals and actions that are integrated with gameplay.
  6. Personality: You can create distinct personalities for your AI NPCs by describing them in Natural Language.
  7. Emotions: NPCs express emotion in response to users. Emotions can be mapped to animations, goals, and triggers.
  8. Real-Time Voices: Use built-in voices for minimal latency or integrate third-party services like Eleven Labs, a known Generative AI Voice tool.
Elevan Labs webpage, offering Generative AI Voice features

9. Goals & Actions: Use client-defined triggers or activate goals through intent recognition or motivations.

10. Contextual Mesh: Keep characters rendered within the logic and fantasy of their worlds, avoid hallucinations, and give each interaction a personalized context.

11. Knowledge: Add shared lore and personal knowledge characters should have about their worlds, contexts, and backgrounds2.

12. Configurable Safety: Conversational safety varies by use case, determine yourself what topics characters are able to discuss2.

13. Player Profiles: Add personalized interactions with NPCs that can remember players, relationships, and context.

14. 4th Wall: Characters only draw from elements and knowledge that exist in their worlds, avoiding hallucinations that break immersion.

Inworld’s platform is designed to be accessible to writers, designers, and creators, with integrations available for popular game engines like Unreal and Unity.

In their demo webpage, you can already see a lot of creations, as those for The Elder Scrolls V: Skyrim, Stardew, Mount and Blade II: Bannerlord and Roblox:

Take a look at InWorld.ai demo page here

In particular, this is a video of how Mount&Blade II: Bannerlord looks like when you talking to an NPC in an immersive way, using InWorld:

InWorld immersive NPC in Mount and Blade II: Bannerlord.

2023. Nvidia’s DLSS to synthetically generate frames

NVIDIA’s Deep Learning Super Sampling (DLSS) stands as a groundbreaking advancement in the realm of computer graphics, particularly within the gaming industry. The innovation lies in its ability to seamlessly marry AI and gaming, fundamentally altering the landscape of visual fidelity and performance.

DLSS addresses a perennial challenge in gaming: the trade-off between image quality and real-time rendering speed. Traditionally, increasing a game’s resolution for sharper visuals meant a heavier computational load, impacting frame rates. DLSS ingeniously flips this paradigm. Leveraging the power of Neural Networks, DLSS employs a two-step process:

  1. Firstly, an AI model is extensively trained using high-resolution images. This model, through the intricate patterns it learns, gains the ability to predict how a lower-resolution image should appear in higher fidelity. This training phase is crucial, representing the neural network’s comprehension of diverse gaming scenarios.
  2. Secondly, during real-time gameplay, DLSS utilizes this trained model to upscale lower-resolution images on the fly. The result is a visually impressive output that closely resembles native high-resolution rendering. Astonishingly, this is achieved with a considerably reduced computational load, leading to higher frame rates and smoother gaming experiences.
An example of frame upscaling in Forza Horizon 5

Want to know more?

I hope you enjoyed the series. While we delved into various AI approaches in videogames across three post publications, there are still numerous facets we couldn’t explore.

If you want to know more, I recommend you to keep an eye on:

Hugging Face posts on Video Games AI, including:

  1. AI for Game Development: Creating a Farming Game
  2. Creating an AI Robot NPC using Transformers and Unity Sentis
  3. Thomas Simonini Medium and webpage, who is doing a great job in creating a community around Video games AI.
  4. HF discord channel for Game Dev channel, coordinated by Thomas Simonini. They have an upcoming video games AI course going to be released soon!

Inworld.ai Blog and discord

ElevenLabs Blog

Rest assured, I’ll be producing additional content on AI in video games and sharing coding projects in Unity and Unreal, catering especially to those with a technical inclination.

Thank you for your readership. Stay tuned for more exciting insights and practical demonstrations!

About me

My name is Juan Martinez, I’m a Sr. AI engineer who was been working in the field of NLP for about 15 years now. I’m now focused on Video Games Development and specially in the intersection between Video Games and AI. If you are in the video games arena or you just want to say hi, I’m happy to connect in LinkedIn !

--

--