NPC-GPT — An exploration of large language models in video games

18 min readMay 15, 2023

Code: https://github.com/kylearbide/npcgpt

By Kyle Arbide, Sean Kim, and Asare Buahin

Introduction

The video game industry has seen an incredible surge in growth over the past few decades, fueled in large part by advancements in technology and the increasing accessibility of gaming to a wider audience. With the rise of smartphones and other mobile devices, gaming has become more convenient and accessible than ever before, and the industry has responded by producing more diverse and engaging titles that cater to a wider range of interests, preferences, and player types. As a result, the video game industry has now surpassed the film and sports industries combined, generating billions of dollars in revenue each year and employing millions of people worldwide. The immersive and interactive nature of gaming has also helped to create passionate communities of players who are deeply invested in the games they love, further fueling the industry’s growth and success. The success of the video game industry was further bolstered by the COVID-19 pandemic. As billions of people around the world were forced to quarantine in their homes, many turned to video games as an outlet for entertainment and social connection.

With the recent rise of large language models such as BERT and ChatGPT promising to transform every industry in some way, the video game industry seems like a natural fit to benefit. In many ways, video games have become interactive movies and a robust storytelling medium. With limitations on budgets and man hours, especially with indie games, large language models can help to alleviate some of the development overhead while simultaneously giving players a dynamic, unique, and more immersive experience. Leveraging this technology, writers can focus on the main storyline dialogue to keep the player experience controlled and utilize large language models for supplementary experiences in game, such as side character NPC dialogue.

The potential benefits of this approach are obvious. Not only could these large language model powered NPC’s give unique dialogue, but it would also allow for players to respond and converse with them in a way that closely resembles natural conversation, further increasing the immersion. This approach could also have huge benefits on a game’s replay and post game experience. These NPC’s could have a wide range of functionality, from assigning side quests and offering in-game transactions to filling out an otherwise sparse map or simply flushing out the game’s universe.

Our Approach

There are multiple layers to consider when creating an engaging and interactive NPC. Natalie Mikkelson from GameDeveloper describes worldbuilding and writing good NPC dialogue through three general steps: building the world, establishing a style, and knowing the different dialogue types. Each of these factors presents their own unique considerations and potential solutions when building NPC’s powered by large language models. In order to effectively address each of these points, we first had to determine the scope of our proof of concept project. In researching different games, we settled on building out our proof of concept for the game Stardew Valley. Stardew Valley is a computationally inexpensive game while simultaneously offering a robust modding framework and an extensive modding community. Stardew Valley is also a relatively open-ended game in terms of story and caters to different player types, allowing us some creative freedom in how we designed our NPC without potentially being restricted by a linear or strict storyline. When building out our proof of concept with these considerations in mind, we came up with a three-fold approach.

Starting off with building the world, it is important that the NPC, as a character within the game, has a realistic and believable place within the game’s universe. This is where the first step of our approach comes in. The first model is a character generation model that will dynamically and on the fly generate an NPC’s backstory and persona. Mikkelson describes building out your world as laying out the geography and fleshing out the specific townships, races, relationships, and factions. These steps serve as the mission statement of the character generation model. The character’s short backstory and persona have to carve out a realistic place for the character within the game, including where the character is from, what relationships within the game the character has, if any, and what types of hobbies and interests the character has.

Next, we considered how to establish a dialogue style for the NPC and how to pick between the different NPC dialogue types that are common in video games. This is where the second step of our approach comes in. The second model is a character dialogue model that will take the generated backstory and persona from model 1 as an input and allow the player to converse with the NPC. Picking a dialogue style was straightforward, since the dialogue model would be trained on the existing pre-written dialogue from other NPC’s in the game. This step makes sure the dialogue model closely follows the writing style of the existing game characters, and as a result, creates a more seamless fit within the game’s world.

Finally, Mikkelson describes the different common dialogue types that NPC’s are usually given during game development. There are two broad categories, idle chatter and actionable conversations. Since the dialogue model will be dynamic enough to hold conversations with the player fulfilling the idle chatter side, we focused on the potential actionable intent of the dialogue. This is where the third step in our approach comes in. The third step uses two instances of the spaCy Matcher class to identify and extract any potential actionable intent in the conversation. In Stardew Valley, there are two main types of quests in the game, item retrieval quests and mob related quests. The NPC might say to the player “Hey there, I heard you’re quite the adventurer, would you be willing to collect 10 cranberry pips for me? I need them for a new recipe I’m working on”. In order to dynamically process this quest, the third script needs to be able to identify the target item (cranberry pips), target quantity (10), and the potential reward for completing the quest (if applicable).

Part 1: Character Generation

1.1 Training Dataset

Vanilla Stardew Valley only has 45 existing villagers (NPCs) in the game. Given that we wanted the generated short bio/personas for training to be similar in structure, formatting, and content as the existing characters, we needed some way to generate new sample character bios to supplement the limited amount of existing NPC backstories in the game. To do this, we used ChatGPT. We used the following prompt, “Can you generate sample Stardew Valley characters with unique names and full paragraph bios at least 50 words long similar to the Stardew Valley villager wiki pages. Make sure the characters are from the various locations in the game Stardew Valley.” We ended up with about 2,500 sample character bios to train on. Each bio had a similar format, starting off with the characters name, providing some personality descriptors, an occupation, a hometown/village, and some additional information about hobbies or relationships.

Example: “Jenna is a friendly and outgoing woman who runs the local pub in Pelican Town. She loves to chat with her customers and is always willing to lend an ear to those who need it. Jenna is also an accomplished chef and loves to experiment with new recipes.”

1.2 Dataset Preparation

The starting point for the character generation model was OpenAI’s pre-trained GPT2LMHeadModel. The first step in preparing the data to fine-tune the GPT2LMHeadModel was to create a custom pytorch Dataset class that would hold, format, and load the training data from the CSV file. The custom Dataset class tokenizes the bios using the GPT2Tokenizer, adds a beginning of sequence token to the beginning of the bio, adds an end of sequence token to the end of the bio, and appends the resulting record into a LongTensor.

def __init__(self, bios : pd.Series, gpt2_type = 'gpt2', max_length = 1022, truncate = 0, **kwargs):
        ''' Constructor. 

        Parameters
        ----------
        bios : pd.Series
            Pandas series of sample bios to populate the dataset.
        gpt2_type : str
            Gpt2 type to load in from the transformers module.
        max_length : int
            Specifies the maximum length of a token sequence after tokenization.
        truncate : int
            Whether to truncate dataset to a specific size. 
        **kwargs : dict, optional
            Arbitrary keyword arguments. 
        '''
        
        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type, **kwargs)
        self.bios = []

        for idx, text in bios.iteritems():
            
            if (truncate > 0) and (idx == truncate):
                break 
        
            bio_tokens = self.tokenizer.tokenize(text)
            if len(bio_tokens) > max_length:
                is_start = np.random.randint((len(bio_tokens) - max_length))
                bio_tokens = bio_tokens[is_start:(is_start + max_length)]
            
            self.bios.append(torch.LongTensor([
                self.tokenizer.bos_token_id,
                *self.tokenizer.convert_tokens_to_ids(bio_tokens),
                self.tokenizer.eos_token_id])
            )
        
        self.bios_count = len(self.bios)

Before the training could start, we first calculated each of the sample bio word counts and filtered out bios that had too few (less than 35) or too many (more than 80) words. Then, a small 15% test set was separated from the training dataset. In the test set, the last 35 words were removed from the bios and moved to a different data column to serve as the label.

Lastly, a custom function, pack_tensor, was used to dynamically batch the data into a single tensor of a maximum length. This was done due to the size of the GPT2LMHeadModel and the size of the data. This allowed the data to be processed more efficiently, saving some computational resources instead of padding individual tensors with 0s to make them all the same length.

def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    ''' Dynamically batches data into a single tensor of length max_seq_len. Due to the size of 
    GPT2 and the data, this is done to more effeciently use computational resources instead of 
    padding single tensors with 0s to make them of the same length. 

    Parameters
    ----------
    new_tensor : torch.Tensor
        The new data to be batched.
    packed_tensor : torch.Tensor or None
        The existing packed_tensor to add the new_tensor to. 
    max_seq_len : int 
        Maximum sequence length for the packed tensors. 

    Returns
    -------
    (torch.Tensor, bool, torch.Tensor or None)
        The new packed tensor. Boolean value indicating whether the packing was successful.
        If the packing was not successful, the tensor that could not be packed, None otherwise.
    '''
    if packed_tensor is None:
        return new_tensor, True, None 
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        # eos and bos tokens are the same, only need one between sequences
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim = 1)
        return packed_tensor, True, None

1.3 Model Training/Evaluation

After the data was properly loaded and formatted, the training function is called. The training function uses a linear schedule with 200 warmup steps, the AdamW optimizer, and a value of -1 for the num_training_steps. Since we are working with text data, the labels when computing the intermittent outputs during training is the input_tensor itself. The loss is then computed and the next epoch starts. We found that training on 15 epochs was sufficient to not overfit and to save on computational resources.

The evaluation function is called next to test the trained model on the 15% test set. We used top_p (nucleus) sampling to predict the next word in the bio, using a top_p value of 0.80 and a temperature of 1.00. Since we split the last 35 words off from each bio to serve as the label, the model gets fed the first (bio_length)-35 words as a seed prompt. The maximum number of words to generate is set to 60, so the prediction loop either stops when the next token predicted is the end of sequence token or it hits the maximum token count of 60.

for word in range(bio_length):

            # get prediction for the next word 
            outputs = model(prompt_toks_ids, labels = prompt_toks_ids).to_tuple()
            # unpack the output
            loss = outputs[0]
            logits = outputs[1] 
            # test
            hidden_state = outputs[2]
            # slice just the predictions for the last word and then divide by the temperature  
            logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

            # sort the logits for the most likely first 
            sorted_logits, sorted_indices = torch.sort(logits, descending = True) 
            # apply the softmax function to the logits to convert them to probabilties
            # then apply the cumulative sum function along the column 
            cum_probs = torch.cumsum(F.softmax(sorted_logits, dim = -1), dim = -1)

            # creates a boolean tensor to indicate which indices to set to the filter value 
            remove_indices = cum_probs > top_p
            # we never want to remove the first token as to not have an empty tensor causing an error 
            # shift the values to the right (last indices will always be greater than top_p since it equals 1)
            remove_indices[..., 1:] = remove_indices[..., :-1].clone() 
            # set the first indices to False (0) so it will never get dropped 
            remove_indices[..., 0] = 0 
            # use `remove_indices` as a boolean mask on the sorted indices 
            indices_to_remove = sorted_indices[remove_indices]
            # replace the selected logits to be removed with the filter value (-inf)
            logits[:, indices_to_remove] = filter 

            # after the correct filter values have been assigned, re-compute the probabilities and then sample one
            next_token = torch.multinomial(F.softmax(logits, dim = -1), num_samples = 1)
            # concatenate the new predicted token id to the original encoded prompt 
            prompt_toks_ids = torch.cat((prompt_toks_ids, next_token), dim = 1)

            # boolean to determine if the bio has finished or not 
            finished = (next_token.item() == tokenizer.eos_token_id)
            if finished:
                break 
        
        num_generated = (prompt_toks_ids.shape[-1] - num_token_ids)
        # print(f'sanity check: {num_generated == (word + 1)}')

        output_list = list(prompt_toks_ids.cpu().squeeze().numpy())
        # only grab the generated text 
        generated_list = output_list[-num_generated:]
        generated_text = f"{tokenizer.decode(generated_list)}{'' if finished else tokenizer.eos_token}"
    
    return generated_text

Part 2: Character Dialogue

2.1 Dataset Building

2.1.1 Choosing Conversations

When building the dataset to be used for our character dialogue, we needed to make a decision on what types of conversations to include. The types of conversations we select will determine the capabilities of the NPC dialogue. For our use case, we focused on conversations which: discussed the NPCs background, discussed the weather and season in the game, defined items within the game, and prompted quests for the player. We felt these types of conversations were enough to show a proof of concept regarding the ability to understand the environment and reflect their persona. The last two types, definitions and quests, were particularly important as they allowed us to tie in the game’s knowledge base.

2.1.2 Knowledge Base

Collecting and building conversations surrounding the game’s knowledge base is key to ensuring the NPCs are embedded into the reality of the world they exist within. A dictionary of crops, minerals, mobs, and other in-game items was created by mining the game’s Wiki, and each of the entities was assigned both a definition and a quest prompt. This ensured that our training data consisted only of items which existed within the game, allowing for a more immersive experience.

2.1.3 Prompt Building

Originally, the plan for creating the NPC dialogue AI model consisted of training off of the original game dialogue. This repository by Sean Roberts and Stephanie Rennick made it simple to mine the game files and create a structured dataset. After training on this corpus, we quickly realized a few limitations with this dataset. Mainly, in Stardew Valley it is very rare to have a conversation where the player provides any input at all, meaning our corpus was full of one-liner NPC conversations. The resulting output was limited in its flexibility to player input, and ultimately we decided to go in a different direction.

Our new dataset would be created by prompting the ChatGPT API and storing the results in the form of a conversation. The availability of the API allowed us to automatically feed prompts concerning different items, conversations, and NPC personalities to ensure diversification of the dataset. This also meant we could easily tie in the knowledge base by automatically generating prompts for each item.

def run_kb_calls(kb:dict):
    outputs = []
    # Create the personalities list
    personas = format_personalities(personas_df)
    for key in kb.keys(): # Loop though the knowledge base
        if key == "locations":
            for item in kb[key]:
                prompt = build_location_input(item)
                response = openai.ChatCompletion.create(model = model,
                                                        messages = prompt,
                                                        max_tokens = 300)
                response_text = response['choices'][0]['message']['content']
                personality = random.choice(personas)
                output = convoSample(personality, f"tell me about the {item.lower()} location .", format_input(response_text))
                output.add_candidates([])
                print(output.to_json())
                outputs.append(output.to_json())
                
        elif key == "mobs":
            for item in kb[key]:
                prompt = build_location_input(item)
                response = openai.ChatCompletion.create(model = model,
                                                        messages = prompt,
                                                        max_tokens = 300)
                response_text = response['choices'][0]['message']['content']
                personality = random.choice(personas)
                output = convoSample(personality, f"tell me about the {item.lower()} mob .", format_input(response_text))
                output.add_candidates([])
                print(output.to_json())
                outputs.append(output.to_json())
                
        else:
            for item in kb[key]:
                item_type = key.replace("_"," ")
                if item_type.endswith("s"):
                    item_type = item_type[:-1]
                prompt = build_item_input(item,item_type)
                response = openai.ChatCompletion.create(model = model,
                                                        messages = prompt,
                                                        max_tokens = 300)
                response_text = response['choices'][0]['message']['content']
                personality = random.choice(personas)
                output = convoSample(personality, f"tell me about the {item.lower()} {item_type.lower()} .", format_input(response_text))
                output.add_candidates([])
                print(output.to_json())
                outputs.append(output.to_json())

The result was a dataset of roughly 15,000 two-six line conversations, each of which surrounded one of the aforementioned topics. These conversations also specified a variety of personas, each of which was generated by model 1.

2.2 Model Training, Evaluation, and Limitations

For training our conversational model, we followed this medium post, which guided us in using pytorch and hugging face to fine-tune GPT models. The guide also implements persona based conversation, which is exactly what we were hoping to achieve. The training method uses a multiple choice loss method, where multiple candidate lines are presented to the model along with the correct option. A sample entry of our dataset can be found here. The base model we selected was DialoGPT, as it is the most compatible with the conversational use case.

After some hyperparameter tuning, we ended with a model that is able to perform legitimate conversations with relative consistency. For example, when the model is prompted to introduce itself, it will provide a short description of its AI generated personality. Additionally, it can be prompted to provide the player with quests, descriptions of the weather, and item description.

The model clearly does not perform well when moved outside of its training set into conversations surrounding other topics. Whether this is a result of model selection, catastrophic forgetting, or our training set and methods, is unclear.

Part 3: Entity Recognition

For the entity recognition portion of our approach, we used two Spacy Matchers to detect if the NPC dialogue had any actionable intent. The first matcher detects whether one of three patterns were present in the dialogue: a buy transaction pattern, an item quest pattern, or a mob quest pattern. The second matcher triggers when one of the three patterns is detected. The second matcher isolates and returns the relevant information in the dialogue, including the pattern that was detected, the target quantity requested, the target item or mob requested, and the price or reward if specified.

Part 4: Integration into Stardew Valley: The Modding Process

Stardew Valley is equipped with a robust modding framework called SMAPI, which provides a variety of tools and APIs for creating mods using C# in Stardew Valley. For our use case, we leverage content packs, event based actions, and in-game components to fully integrate our AI pipeline into the game. We also leveraged a series of existing C# mods and packages, including IronPython, ML.Net, and onnxruntime.

4.1 Content Packs and Event Triggering

Content packs allow you to add and edit in game maps, NPCs, and warp locations. Using content packs, we created a sample character which serves as the stand in for our AI. We call the character ‘CaptAIn’. A sample area was also created for CaptAIn to stay within, to ensure it is easy to locate.

Event triggering also plays a large part in making the integration of the models seamless to the gameplay. With this, we are able to link actions that trigger on player inputs. This is used to trigger the character generation model when CaptAIn is clicked, trigger the conversational model when the player finishes typing, and trigger quest granted and completed notifications. Events in Stardew Valley must be pre-determined and cannot be updating gameplay, so we worked around this by adding a linked chest which serves as the quest completion checkpoint.

4.2 Model Integration

Once event triggering was established, we needed a way for the player interactions to be fed into our model, and for the outputs to be displayed in the game interface. This was particularly challenging, as up to this point all our model interfacing had been done through python, and Stardew Valley is only programmable in C#. We tested a number of solutions, including the following methods.

4.2.1 Models as .exe files

While we were not able to trigger the python methods directly, we could however run executable files on event triggers which contained the python code for model interactions. Using auto-py-to-exe, we created an executable for the character generation model, and another for a combination of the conversational and quest recognition models. Using this method, we were able to achieve full game integration, as user inputs could be fed to the executable as command-line inputs and outputs could be read using the C# StreamReader functionality. While this was a fully functional solution, its limitations were obvious in its speed. Using executables, models could not be pre-loaded, meaning each interaction required re-initializing the model as well as prompting it. Under this method, every personality generation event takes ~10 seconds and each conversation event takes ~5 seconds.

public string run_convo(string user_input, string persona)
    {
        ProcessStartInfo start = new ProcessStartInfo();
        start.FileName = @"file-path";
        start.Arguments = $"--user_input \"{user_input}\" --persona \"{persona}\"";
        start.UseShellExecute = false;// Do not use OS shell
        start.CreateNoWindow = false; // We don't need new window
        start.RedirectStandardOutput = true;// Any output, generated by application will be redirected back
        start.RedirectStandardError = true; // Any error in standard output will be redirected back (for example exceptions)
        using (Process process = Process.Start(start))
        {

            using (StreamReader reader = process.StandardOutput)
            {
                string stderr = process.StandardError.ReadToEnd(); // Here are the exceptions from our Python script
                string result = reader.ReadToEnd(); // Here is the result of StdOut(for example: print "test")
                return result;
            }
        }
    }

4.2.2 Models using onnxruntime

Onnxruntime is a huggingface adjacent package which supports optimizing and packaging models for integration with user interfaces. Onnx includes C# integration, meaning we could solve some of our issues with speed by preloading the models on game start. To achieve this, we use a combination of pytorch, huggingface, and onnx to convert our existing models into the correct format. While much more promising than executables, this method brought its own limitations, as C# is not equipped with the same tokenizers that are available in python. We’ve attempted to replicate the tokenizers in C#, but have not had success.

4.3 Intent Recognition to In-Game Event

Of similar work we explored during this project, we feel this is the most unique feature we are able to provide using our pipeline of models. Unlike most existing AI conversational models that have been equipped for video games, we aimed to take it a step further and link the outcomes of conversations directly to in-game events. We achieved this simply by running our entity recognition model across the conversation and returning any matches. Once the game code receives a match, it triggers a series of actions which include prompting the user that a quest has been assigned and changing the requirements of the quests to include the game items and amount that were specified by the conversation. We hope that this type of integration serves as a proof of concept for AI integration in games, showing how immersive and seamless it can actually be.

4.4 Final Integration

In the end, we were successfully able to use .exe files in C# to create a fully functional AI NPC within Stardew Valley. To view a short demo, check out this video:

5: Future Direction

In the time we have been building this project, the portfolio of open-source large language models has greatly increased in size and model complexity. Projects such as gpt4all have made using and training the next generation of large language models much more accessible to use cases like ours. Updating our dataset and training processes to use these kinds of models could greatly increase the quality of our conversational outputs.

Additionally, we would consider hosting the pipelines to be accessible through an API and publishing our mod for public use. Publishing using an API would make running this mod feasible to those users with lower performance devices. Additionally, player feedback would provide significant insight into the building of additional features and model tunings.

Finally, a point of further exploration for the implementation of our project is creating more visually detailed character models. Due to time constraints, we did not create a unique character sprite for the NPC that we tested. This can provide another layer of detail for our character dialogue model and the quest events that are generated with the help of our entity recognition model. We can expand on NPC customization in the future by exploring ways in which we create dynamic NPC movement schedules. Each character in Stardew Valley has a specific movement pattern that puts them in a specific area at a specific time based on events in the game. Creating a dynamic schedule for an NPC will give us an opportunity to implement place and time into our character dialogue and entity recognition model, thus making for more unique character dialogue.

Conclusion

Artificial Intelligence products such as the one we demonstrate in this paper provide significant benefits to both players and game developers. Players are provided with a more immersive experience, where every Non Playable Character is equipped with the capabilities to engage in situational, unique conversation, as well as influence gameplay and quests. Developers are provided tools to create an immersive world, expand the immersive-ness of existing worlds, or build entirely unique concepts which were never before possible. We hope that our project provides insight and inspiration for other to discover the possibilities of AI in the gaming space.

TLDR

We build fully conversational AI NPCs in Stardew Valley! These NPCs have unique AI generated personas, and can perform a variety of conversations. We are also some of the first to translate conversations into in-game quest events. The AI is fully modded into the game, and a demo of the functionality can be seen here:

If the player base so desired, we would consider building this out publicly as well as improving the loading speeds, model performance, and mod features.