Programming on Hard Mode: Markov chains for creating game titles

Reid Case
16 min readNov 2, 2021

Click to read Part 3 Defining an MVP

To really convince myself that I am actually working on a project, I figure I’ll need a name for it. It would let me refer back to the project when discussing with others at the least.

The thing is, I am horrible at naming things. I had an iguana when I was a kid. The thing lived more than 10 years. I never named it myself. I just called it lizard. My father called it Earl after the father character in the sitcom, Dinosaurs. It turned out to be an Earlene after we discovered an egg in the enclosure. I’ve made plenty of paintings and drawings, none have titles. I don’t name my bikes. I don’t usually name my cars, although one did pick up a name over time.

My home smart devices are iterations of “bun” short for bunny, what I called a pet rabbit that actually had a name that I never used. Rather, these devices are refered to by their status as not being “the bun.” notTheBun, notTheBun2, …

More importantly, I get stuck for weeks at this stage when starting a new project. I can’t come up with a name. Whether it be a website, a game, random utility program. The only thing Ive been able to name consistently are my gaming PCs. Why? I have no idea.

I need a utility to name my stuff — press a button, out comes a name.

Let’s use Markov Chains!

We will start with a list of real names. If I want a game title, I run the utility, pass it some argument, it creates its “training” set and does its Markovian magic, out comes a title.

  • I don’t need a UI. Running from Terminal or CMD will suffice.
  • I will implement this in Python, on this iPad to stick with the theme.
  • I will use GitHub for source control and also to host code snippets to post here.

Finally, I will use this utility to figure out a title for this project. If it produces a good one, I’ll purchase a related domain for it. If that domain isnt available, then I have the utitlity to keep generating titles until I find one that works.

What is a Markov Chain?

According to Wikipedia a Markov Chain, “…is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.” Wikipedia contributors. (2021, October 13). Markov chain. Wikipedia. https://en.wikipedia.org/wiki/Markov_chain

What the hell is that?

What Wikipedia is describing is a sequence of “events” that occur based on the probability that they might occur given the previous sequence of events. In other words, given some sequence of events like {I left my apartment, then I walked down the street, then I went in the cafe, then I ordered a bagel, the bagel was prepared and handed to me, I sat down at a table, and held the bagel to my mouth}, what is likely to occur next? I take a bite of the bagel? Maybe I flap my arms like chicken wings? I hit the lottery jackpot?

The most likely event to follow in the above scenario would be that I take a bite. Given more options, there may even be conflicting probable events. Maybe a friend enters the cafe and sits down at my table, maybe the server rushes from behind the counter to inform me that they forgot the lox. Anyway, the basic premise is that we have some singular “event” that is observed to frequently follow some series of “events.” This forms the chain.

In the case of generating a name for my project, the events will be characters. These will be sourced from real names of games. I will extract n-grams from this list of existing names paired with their terminating character. I will mash all these together into a massive dictionary and use that dictionary to reassemble words to form my project name.

That seems simple enough.

So, I know I need a few things. I am going to be using Python to build this Markov Chain name generating utility. I will need an IDE. I will be using GitHub for source control. All this needs to be done on my iPad Pro. I will need a list of game titles. I probably want some sort of environment within which to use this utility.

Let’s start with the IDE. I have chosen Pythonista 3 for the time being. There may be better or worse out there. This IDE seemed to cover what I need. It offers support for both Python 2 and 3. It emulates the repl so that I can run my code. I dont know all of the details and features it offers, yet.

Code written in Pythonista 3 can be saved to the local file system or to iCloud. That seems versatile enough to start. It comes with most of the modules you may need, but is still lacking a few common ones. It also has a few modules that let you interact with the iPad system itself. I am curious about playing with these, but I need to stay focused.

There is small catch, it is a paid app. It was a mere $9.99, and I feel that was fully worth it. I mean, I just spent how much on this iPad? I can spare a fellow software engineer, or team of them, $10 for an app.

Okay, the IDE seems ready. Let’s take a look.

Pythonista 3 on iPad Pro 12.9" in dark mode

Creating a file is simple enough. Click New File… and select where you want to save it. To run your script, click the play button on the upper right corner. The page will transition to the repl and show you the output.

My first Python script on the iPad
Repl area of Pythonista 3

Everything is good so far. I’m gettting excited that this whole personal challenge may work out. Let’s play with the repl just to get a lay of the land. On the bottom of the screen you can enter Python just as you would at the Terminal.

Using the repl to enter code

It will take input and print just as expected. This was an easy test, but so far everything is passing. Now lets turn up the difficulty. We may want to install a few extra modules along the way. In fact, I guarantee we will need to install other modules. Having access to a proper bash shell would be a really nice feature. Unfortunately, Apple doesn’t offer us that. There are a few apps out there that do, but I would really like them to be a part of Pythonista’s environment.

We are in luck, StaSh is wonderful module that can be used in Pythonista 3 to emulate a bash shell and allows us to do a few things as if this were a perfectly normal OS to be developing any sort of software in. It isnt perfect, and it cant do everything, but hoepfully it can do enough. This isnt a tutorial for installing and configuring Pythonista 3 with StaSh, so I will leave that part up to you. It isn’t hard though.

StaSh running in Pythonista 3 on iPad Pro 12.9"

Great, now I have a bash shell to work in if needed. I tried getting virtual environments installed, but was not successful. It seems that this is unecessary on iOS using Pythonista 3. I will see how that goes. I’d rather not have to manually edit out uneeded modules from my requirements.txt files.

On to source control. I already have a GitHub account. If you don’t, I recommend looking into signing up. It would be weird these days if you didnt have one already. There are a few other repo hosts our there. Im satisfied with GitHub.

There is a GitHub app for iOS, but it does not allow us to create local repos or really do much source controlling beyond administering repositories on GitHub. I installed it anyway. Im sure Ill need it eventually.

I still need a way to create and manage local git repositories. Trying to install git through StaSh resulted in an error similar to below.

Working Copy to the rescue. The app is free, but I highly recommend purchasing the pro tier. Again, $20 is not much in the grand scheme of things. Especially so when comparing it to the cost of this iPad. The interface is intuitive, it integrates with GitHub, BitBucket, and GitLab. There is a student version, so if you are still in school this may be a good option for you.

The interface is intuitive for someone familiar with using git and other clients. It is GUI based, so you bash shell die-hards may feel left out. I found there were some tricks to getting a folder established locally, linked in Working Copy and available in Pythonista 3. I assure you, it does work. It mostly involves linking or opening as an external folder/file. If there is a demand for a tutorial on how to use the two apps together, please comment below.

One complaint many expressed on forums and other online communities was that using Working Copy on an iCloud folder meant they would need internet access just to be productive. I have a few feelings about this.

  1. I’m working on a tablet. I didn’t spring for the LTE version, but I can always hotspot my phone and gain internet access if needed. Otherwise, I dont plan on carrying this iPad way out into the Southern California chaparral, outside of wifi or cellular range any time soon just to plunk away at this surprisingly tactile Magic Keyboard. Basically, I’ll always have some form of internet access available.
  2. You can use the Files utility/app in iOS to build out your project folder, then link it in Working Copy and create your remotes and etc., then open the folder as an external file/folder in Pythonista 3. Everything seems gravy after that. Write code, quick test if possible, commit, push, profit… If only it was that easy.

Now that the controversial subjects are covered, I can get to coding.

I need to start with a local folder to do my work in. I used the Files utility in iOS to create a new folder. Then, in Pythonista 3, I attached this folder thorugh the external files/folders option on the right navigation bar. This isnt a tutorial, but below is where I found the option. Just click Open, find your folder and add it. You also should create a file stub just to have somethign in there to commit. I’ve already created a very basic file to start with.

On to initializing my repository. In Working Copy you should add this external folder as well. Do so by linking to an external folder. Establish your remote, initialize the repository, and perform your first commit.

Everything should be ready to go.

We need a list of names of games. Storing them in a file is fine.

Read Programming on Hard Mode: Making API calls in Python using the IGDB API to learn how I got my list of game titles to use in this post.

Now that we have a list of game titles we can begin with the generator. Start by drafting out how we may want to interact with the generator. I have no idea how you all may start, but this helps me get an idea for the structure of the classes I might use.

Basically, we need to bring in the list and do some minor processing to it to make sure its useable. Of course, I dont want to do this within my title generator. I want to be able to use this generator for other data sources in the future, maybe…

Next I want to be able to create an instance of this object that is the title generator, set it up with a “pad,” pass it the list through fit(), then call transform() to get a title back. The title will be printed.

I commented out the import of numpy because of my weird issue with Pythonista 3 and StaSh. One day I’ll deal with this.

Moving on, create two class stubs. Each has functions within them based on how I wanted to interact from the previous gist. We should be able to instantiate each, passing my “pad” length to the Markov Chain, then retrieve the data list from my DataHandler, pass it to the Markov Chain’s fit(), then retrieve a game title through the Markov Chain’s transform().

To give us something to work with, the get_list() function of DataHandler returns a basic, multi-string list.

Before we go any farther, lets discuss the basic operation of this Markov Chain we are trying to make. Its honestly really simple. It takes a list of words (game titles), processes them in such a fashion that it can use them, builds a model from the list, and reconstructs new words (game titles). It does this with probabilistic forecasting. Its really not all that exotic relative to other modern forecasting methods in Natural Language Processing.

How does this model forecast words? Well, it really just forecasts the next character in a word based on previous character(s). The previous character(s) is defined by our “pad.” The pad length is the length of the n-gram we form from previous sequentially appearing characters. If we are given the word elephant, and we are given a pad length of 3, then some of the legal n-grams formed of the word elephant might be ele, pha, ant, and so on. Illegal n-grams, for our purposes, might be ept, hel, and so on.

These n-grams are paired with the next observed character such that for each instance of an n-gram there is a single character that follows it. Using the elephant example we might see n-gram:character pairs of ele:p, pha:n, lep:h and so on.

But wait!

What about ant? What follows the last n-gram? Also, what about the first character? What precedes it? How do we handle the beginning and end of the word (game title)?

A trivial solution is to “pad” the word on either end. Hence “pad.” This acts as a prefix and suffix and must be added to each word (game title) in our list before building the model. Using the elephant example, we would also see n-gram:character pairs consisting of our pad:first character, and on the other end, some set of ending character(s) plus a few pad characters paired to a single pad character.

For this code I decided to use the underscore character (_). If I determine to use a pad of length 3, as in the elephant example, I will produce pairs as follows, ___:e, __e:l, _el:e, ant:_, nt_:_, and t__:_.

This also gives us a trivial stopping condition when generating words (game titles). If my current sequence is equivalent to “pad,” then I stop. It also provides a bit of a warm start, using “pad” to start the generation sequence. However, if I prime the resulting sequence with “pad” then the current sequence will be “pad” and the algorithm may decide its time to stop predicting. This isn't a hard problem to solve as we will see.

Let’s create a utility class for our Markov Chain to use for establishing this padding feature. The utility class gets loaded up with a padding character and the length of the pad desired. It has accessors to get the pad itself, and to get the length of the pad — probably redundant. Additionally, it has functions to add the appropriate pads to a list of words passed in and a function to remove the padding from a single word passed in.

The MarkovChain class is updated to reflect the usage of the new utility class. When it is initialized, the utility class is instantiated. When fit() is called, pad_data() is applied to the data list passed in. Notice that I updated the function parameters to be a little less confusing — maybe… The transform() function now applies strip_pad() to the data passed out of it.

Next, let’s work on the data handler just to get it out of the way. Remember, we can’t use Pandas… So, just open the csv using the csv module and run it out into a list. Be careful not to bring the list in as a list of lists of game titles, nor do you want to bring it in as a list of lists of characters. We want a list of game titles. Nothing more.

I also modified the MarkovChain class to use the data it is getting when it’s functions are called. We should be able to test all of this. The data handler should open the file, read everything into a list, pass that along to the Markov Chain. The chain will add the padding to every word, then print to list for a sanity check. When transform() is called it will strip the padding from the first element and return that to be printed.

Hopefully it does this quickly. I may do future posts about parallelizing this just to see if I can use the multiprocessing module in Python on this iPad. For my list of 50000 game titles, this code runs fast enough.

With the utility and data handler out of the way, we can get back to business with the Markov Chain. It will need an object to hold the model. It will also need code to build that model, then code to use that model to create new game titles.

The data structure of choice is a key-value object like a dictionary. At the highest level, the model will have keys representing each of the n-grams extracted from each of the game titles that it experiences. It should have no duplicates. From the elephant example the keys would be ele, lep, eph, pha and so on. The values for each of these keys would be a nested key-value dictionary like object where the keys are the characters that are observed following each n-gram key. The values at this level would be the probability for which they should be expected to follow the n-gram.

To illustrate I need a few more words. Lets use dog, cat, frog, bird, fox, and owl. This collection of words would produce the following dictionary using a length 2 n-gram.

Notice that the values are not the raw counts, but the percent times the value followed the n-gram key. In the case of the raw pad, we have an interesting set of probabilities for which actual characters may start the sequence. Also note that for the n-grams og and g_, which appear twice in the data each, the character _ follows each 100% of the time, just as it does for the other terminating n-grams.

Scale this idea out to the 50000 game titles I have in a list from IGDB.

My solution is pretty ugly, but it works and does so quick enough. I go through each padded game title, break it into n-grams of the desired length, check if it exists in my dictionary, if it doesn't, add it with its own dictionary with a single key of the character we also identified and its raw count of appearance. If it does exist, then I check if the character exists, if it doesn't, I add it and the value 1. If it does exist I increment it’s raw count.

You may notice that I added raw count and not the percentage of times it appeared for each n-gram. I run through the keys of the model one last time and calculate that percentage and replace the raw count with it.

There may be a prettier, more performant, and/or more Pythonic way of doing this. Again, it works and its fast enough with my list of games. I find this fairly legible. For those of you copying this, feel free to optimize and comment about your optimizations.

We have made some headway. Now all we need is for the function transform() to use the model to reconstruct a potential game title. To do this is somewhat the reverse of building the model. We start with a seeded pad n-gram, and randomly select the next character from the recorded observations based on the probability it appeared. Then we advance the pointer, establishing a new n-gram that includes the new character, and then do it all over again. Remember that we should stop when we get the pad again. Infact, we can save a few cycles by stopping when we hit a single pad character, since we are not expecting the pad character in the dataset (this is an assumption. Beware, one day we could end up with this character in the dataset).

Finally, strip the padding off and return the generated game title. Below is the entire code used.

I added a loop to create a bunch of potential titles, just to get an idea for its performance. I also added a check against the original list as I would like to know if the title already exists. I wouldn't want to rip someone else off.

Below are some of the game titles it generated that were not in the list using a pad length of 5.

  • Raft Adventures of the Party Phrase Files: City: Episode One
  • Total Chavo
  • Fight
  • Jack & Dressup Salon
  • Bliss Insane In The Fallenge
  • Point Blackwater League 3
  • The White
  • Rifter Sports 2: Heidi-Express Up
  • Bombers
  • Dream Shovel
  • Baja Bugged Union
  • Imagine: Nihon Prison
  • Call of Fate
  • The Time
  • Super Duel
  • Star Vipers
  • Everyday Cat’s Chess Blitzy
  • Line Races
  • ABC Memorial by Fire Blizzard
  • Slickpoo : The Game

Play around with pad length. I found that the longer the length, the more likely it was to create a game title that was already in teh list, or more closely resembled one there. Shorter lengths produces somewhat more erratic titles. Length 3 is a bit too erratic. Length 4 produced some whacky titles such as Space Twistery Boxes — Toddle — Champions — KI-Luxury of Destructor X: Steam House Vol. 1: BR 411 ‘ICE-T’ EMU Add-On. Yes, thats one single long title. Using length 1 seems to take too long to run, or I may have a bug in my code. Either way, I dont expect a length of 1 to produce coherent titles. Cranking it up to 20 resulted in just pure copies from the list.

So which game title will I choose? I’ll let you know next post. Great work! Stage and commit so you can show it off.

--

--