Programming Rewordable: A Tale of Computer-Assisted Word Game Design

This is the story of how I failed to make a computer-generated word-building card game.

The game in question is Rewordable, which I co-designed with Adam Simon and Tim Szetela. The tagline we chose for the game is “the uniquely fragmented word game,” which is a good description: it’s a word building game where the units of play are cards with one-, two- and three-letter sequences of letters. Rewordable was chosen for the NYU Game Center’s Incubator program, and (as of this writing) the Kickstarter for the initial print run of the game is 262% funded. Almost everyone who plays the game loves it, and we’re really proud as a team of what we made together.

So why do I feel like a failure? Because of programmer’s hubris. Let me explain.

Scrabble sucks

We initially made Rewordable in 2008, as the final project for a game design class taught by Frank Lantz at ITP. The initial impetus for the game was the frustration we shared with traditional word-building games—like Scrabble—that reward rote knowledge of short, obscure dictionary words over more accessible, creative expression based on every English speaker’s internal knowledge of word structure. Our strategy for solving this problem was to make a word game where you play with sequences of letters instead of individual letters, thereby encouraging—and enforcing—the formation of longer, more interesting words. (If you’re interested in learning more about this argument, I gave a talk at Indiecade East 2015 on the topic called Beyond the Scrabble Word List. The slides are available here.)

Rewordable is a card game, and card games have a particular material limitation—namely, that they must be finite, and preferably finite enough to fit in a reasonably-sized box. So the question arose: which sequences of letters should be on the cards? As the resident computer programmer and linguist on the design team, it fell to me to take the first stab at this problem.

N-gram frequencies

Using the 26 letters of the English alphabet, there are 17576 possible three-letter sequences (26⨉26⨉26), 676 possible two-letter sequences, and (of course) 26 possible one-letter sequences. Of these, only a small percentage are actually found in English words. My task was to identify which of these sequences to print on cards and include in the game. Our goal was for word formation in Rewordable to feel natural, intuitive, even a little bit magical: the words should seem to fall out of the deck, almost unbidden. My first instinct was that we could achieve this with the sequences of letters that occur most frequently in English words.

So I took a list of English words and ran a character-level n-gram frequency analysis on it. (“N-gram” is the fancy computational linguistics term for “a sequence of N items” where N can be any number and “item” might mean a letter or a word.) This yielded a list of the most common one-, two- and three-letter sequences. The top 120 in such an analysis are:

For the original Rewordable deck, we ran with a list of cards very similar to this one, padding out with additional single letters to form a deck of 160. That was the version version of the deck that we turned in to Professor Lantz in 2008 as our final game design project, and it was also the version that we had available for print-and-play on the Rewordable site for years.

Tio and Ati

Simple and neat, right? At first glance, that list of n-grams looks fine. You can probably pick through the list and start spelling words with these sequences without too much trouble: LI+ST, ST+ING, ON+ION, and so forth.

I felt good about this deck because it was almost entirely computer-generated: the algorithm had given us the answer to the problem. As a computer programmer and an engineer, this was very satisfying to me.

But when we actually played games with this deck, there was a problem: some of the cards were impossible to use, and would get stuck in people’s hands, sometimes until the very end of the game. The TIO card appeared often in the word “ratio” (RA+TIO), but hardly ever in any other combinations. Drawing the ATI card was essentially like a card with the pile of poo emoji on it: useless garbage. The reason for this is that while both TIO and ATI are very common sequences of English letters (think words like information and computation; count how many times you can find these two sequences in this very article!), they’re difficult to use in a word unless you also have other scaffolding, in the form of letter sequences like ON or individual letters like A and N.

Even when you have that scaffolding available, a sequence like ATI is usually found only in longer words than a sequence like SH. In one word list that I experimented with, words containing ATI are on average about 12 letters long, whereas words containing SH are on average are only about 9 letters long. So to make a word with ATI, you need not just to have the specific sequences that almost always go with it (like ON), but you also need to have many other sequences already in play, just so you can form a word with it.

The Matrix

Despite this frustrating TIO, the deck stayed in this state for a long time. But then, when we were accepted to the NYU Game Center Incubator program last summer, I decided to buckle down and solve the problem of “sticky cards” once and for all. And I decided to do it the only way I knew how: good old fashioned computer programming. I made a new directory in my Rewordable project called “science,” opened up Jupyter Notebook and got to work.

The idea was to find a new procedure for selecting the sequences in the Rewordable deck. Ideally, the procedure would be clever, simple, and require no human intervention after the results had been found: clean, pure algorithmic design.

I took as axiomatic for my summer experiments that I should drop my “exhaustive” list of all English words, and instead use a list of the most common English words. (I ended up using spaCy’s word frequency information for some of my experiments and this list of the 10000 most common English words for others.) This strategy would ensure that frequencies weren’t thrown out-of-whack by sequences that occur mostly in rare words—and attenuate the frequency of sequences like ATI, which are found in a lot of unique words but not as many common words.

The first thing I tried was a matrix of sequence co-occurrences. The idea was that I could make a better deck by selecting not the most frequent sequences, but the sequences that happened most frequently with other sequences. In other words: the friendliest sequences. I got a big Pandas data frame that looked like this:

Pandas data frame for co-occurrence of sequences, 4945 by 4945

This is just a slice of the larger table, but it shows, for example, that the sequence CC occurs in the same word as TED thirteen times (e.g., “accumulated”), while the sequence CAR occurs in the same word as TE 35 times (say “cartel” or “carbohydrate”). Gathering the 120 “friendliest” sequences yielded the following list:

s, e, i, t, n, d, g, ng, a, ing, ed, r, l, es, y, o, in, ti, er, c, ly, on, te, le, rs, ion, ts, m, al, io, u, en, at, nt, ns, tio, h, ted, ers, ic, st, b, ati, li, it, ne, re, tin, p, ri, an, ty, ent, ce, ie, ons, or, ll, ve, ies, ble, ate, me, ss, ar, is, us, ta, lly, ity, ry, v, se, bl, k, ess, ive, all, ca, na, el, si, ra, nts, ia, ou, nce, ls, ab, iv, de, f, red, ous, abl, nc, ni, men, ica, ter, les, tic, ge, he, to, di, hi, nes, ds, et, ci, cal, tes, nd, sh, il, ul, tiv, lin, la

This is an interesting list, a bit different from the list based on raw frequency. But ATI and TIO are both still in there! So I decided this wasn’t the right tack.

The Spread

It occurred to me that what I wanted wasn’t necessarily the friendliest sequences, but the most versatile sequences: sequences that could be used to begin a word, or end a word, or go somewhere in the middle. So I ranked all of the sequences by their distribution, assigning “0” for every occurrence of a sequence at the beginning of a word, “1” for every occurrence at the end of a word, and a number between, based on the position, for occurrences in the middle of the word.

The idea was that a list of the sequences ranked by the spread (in this case, standard deviation) of their positional scores would give me the sequences that occurred in the widest variety of positions in words. Here are the sequences with the largest positional spread, along with a graph that shows the distribution:

Visual representation of intra-word distribution of several letter sequences

These graphs show you that, for example, when the letter S is in a word, it usually occurs at the beginning or at the end. SH, on the other hand, occurs mostly at the beginning, while NS occurs mostly at the end. L has a remarkably even distribution. For reference, here are some of the sequences with the lowest spread:

Lowest-spread sequences

This shows, e.g., that even though DIS is relatively frequent (1953 occurrences in the corpus), it really only ever appears at the beginning of words.

Chipmunk

While I was lost in Standard Deviation Land, Tim was actually at the Game Center, playtesting the game with anyone who would sit down with him for twenty minutes. I was trying to get a sense of how the deck might work statistically, while he was getting hands-on experience learning about how decks with particular compositions worked in actual play. I sent him a few decks derived from my experiments with sequence distribution, but we both agreed that some of the cards it proposed (including odd fellows like NC, TIC, AB) were unlikely to work in practice.

Meanwhile, Tim had been playtesting a smaller, svelte version of the deck with only 64 cards, which he called Chipmunk. The idea behind Chipmunk is that it would be smaller, and therefore both cheaper to produce and simpler to play with (since we could eliminate the setup step that required players to count out the appropriate number of cards before starting to play).

I was very skeptical about Chipmunk at first, especially because Tim had decided which sequences to include in the deck without consulting the almighty computer program first (gasp!). But I played a round and it was a revelation: every round of Rewordable that I’d played up to that point was fun, but ponderous—not entirely unlike a slime mold spreading across the table. The Chipmunk deck was quick and light. Each turn just fell into place, like magic.

Chipmunk deck composition. Many of these cards didn’t make it to the final deck!

A big problem with early versions of Rewordable was that unlucky, unplayable hands were fairly frequent, meaning that players often had to pass their turn. We weren’t sure how to fix this problem. After playing a round with Chipmunk, I realized that playing with a smaller deck actually made it easier to form words. This was partially because lower-frequency sequences weren’t present in the smaller deck, but also because each individual card had a smaller surface area for “incompatibility” with other sequences.

Deck genetics

We ultimately rejected the idea of a 64-card deck, because of its limited expressive power (you’d get almost exactly the same words in every game). But after playing with Chipmunk, I realized that a smaller, carefully-curated deck in which the cards were chosen not for their frequency but for their mutual compatibility could lead to a more fluid play experience—with fewer passes.

So now my problem wasn’t evaluating how suitable individual sequences were—it was evaluating entire decks of cards for suitability. This is, computationally speaking, significantly more difficult: there are only around eighteen thousand possible one-, two- or three-letter sequences using the 26 letters of the English alphabet, a small enough number that each one can be quickly and individually evaluated for (more or less) any measure of fitness I throw at it. But there are (26³+26²+26)¹²⁰ possible 120-card decks using those same eighteen thousand sequences—which is a 512-digit number. Even winnowing this down by excluding decks containing unattested n-grams, the possibility space is enormous.

Given that the expected delivery date of our Kickstarter is April 2017, well before the heat-death of the universe, I needed a way to traverse this possibility space quickly. My initial thought was to use a genetic algorithm, seeded with the decks that Tim had been playtesting. I wrote a function to “evaluate” a deck by simulating thousands of rounds of play, keeping track of how many unique words were formed during each game and how even the distribution of card usage was, with the idea of rewarding decks that were expressive and “balanced.” Decks that did well were “mutated” by introducing new random sequences, and then evaluated again. Here’s a typical result after a few generations:

This was in August, and Team Rewordable’s internal deadline for finalizing deck composition was approaching fast. The genetic algorithm was producing interesting results, but it was tough to tweak the fitness function (the game simulator) to avoid local maxima. And even evaluating one deck by simulating thousands of rounds took a few seconds—which doesn’t seem like much, but is long enough to make it difficult to run the genetic algorithm with a population large enough to be truly effective.

Eventually I realized that I had an even better fitness function at my disposal. But in order to use it I had to give up on my dream of a pure, clean, algorithmic solution.

That fitness function? Tim’s playtesting.

A hybrid solution

Over the summer, Tim and I regularly sat down together and had lengthy discussions—which, I’ll be honest, sometimes became heated arguments—about Rewordable’s “deck composition”: which sequences should be included in the deck. (One argument that still stings is whether or not to include an S card in the deck… but that’s a topic for another Medium post.) I’d feed him with the latest outputs of the procedures I was working on, and he’d playtest them and tweak them (often beyond recognition), and I would incorporate his insights back into my experiments. It was a good system, but slow and clunky.

As it became clear that the genetic algorithm was unlikely to yield the results I wanted on its own, I realized that the fitness function I’d written to evaluate the playability of the randomly-generated decks could actually be a standalone program. If Tim could run that evaluation program with a deck of his own design, he’d be able to identify problems with it, make adjustments, and feed the adjusted deck back into the program, getting thousands of rounds of insight without having to organize thousands of rounds of playtesting. Here’s some sample output from the evaluation program, evaluating a deck from earlier this summer, code-named Lemur.

And it worked out great! Tim’s been using the program since then to guide deck design, and I’ve been making tweaks to it as we’ve made new discoveries and recognized new requirements. Here are a few ways we’ve used the program in practice:

  • The program shows how often each sequence was used, including sequences that weren’t used at all. Almost all decks can be improved simply by eliminating unused or little-used sequences and replacing them with others. (Fortunately, my experiments with sequence distribution provided a fruitful list of promising sequences to try.) Why doesn’t the Rewordable deck have single-card J, X, Q, or Z? Because the simulator showed us that more often than not, they were never used in thousands of rounds of simulated play.
  • The program shows a list of all the words formed during the simulated rounds, which gives us an idea of what the “texture” of the game will be like and what words are likely to show up when players actually play the game. It also helps us make sure that cards that occur frequently are, in fact, being used in different words, not just in the same word (or similar words) over and over.
  • The “patterns” section of the output shows how frequently cards with sequences of different lengths are used, and in what ways. This part of the output was vital in balancing the reward tokens—we specifically designed the deck to have lots of yellow/yellow (3 letter/3 letter) words to make the “All yellow word” reward easier to win.

So, I failed: the deck that we’ll ship to our Kickstarter backers will not be the pure, clean 100% computer-generated deck of my most secret computer science dreams. But thanks to our hybrid approach—statistical experimentation, combined with real-life playtesting, informed by simulation—I think we ended up with a very successful deck. And I can’t wait for you all to play a few games with it!