Other AIs, and the Question of Reward Design

This post is part of ChurrPurr.ai, a challenge to design an online strategy game and the AI to master it. Play the latest version of the game here.

3 min readOct 31, 2017

Today was a longer day at work, with a farewell to a colleague heading back to Singapore, so I didn’t get too much time this evening to work on the Churr-Purr project-proper. That said, I did come across a fairly interesting article on other AI developments that illustrates many of the questions I’m now pondering.

Gyroscope’s AI v AI with 8 Street Fighter II characters

Adam Fletcher and the folks at Gyroscope Software trained an AI on the arcade game Street Fighter II and then held an elimination tournament at the Samsung Developer Conference, in which attendees attempted to predict the eventual tournament winner. Very cool project, and I appreciated learning about some of the intricacies of classic game systems.

One aspect that struck me while reading is how the Gyroscope team defined the AI’s reward function — and in particular, whether I might be handicapping my AI by giving it no information about how well it’s faring until the game’s conclusion:

A common question asked is why we didn’t have a “win” as the reward function. In short, it creates a delayed reward, which makes training much more difficult and lengthy. The health gap was a reasonable heuristic that we believed would lead to wins — and, it did.

Currently my plan is for the Churr-Purr AI to receive no score information until a game is complete, so as to avoid favoring short-run techniques that are suboptimal in the long-run. An alternative would be to award points as an opponent’s pieces are eliminated, but a larger-magnitude amount of points for the overall win condition, such that the game’s final outcome significantly outweighs the path to it.

If I were confident that aggressively removing an opponent’s pieces were always the right strategy, which does feel somewhat intuitive (one step closer to winning, right?), I would feel more comfortable encouraging the AI toward removing pieces whenever possible.

On the other hand, I am quite not-good at Churr-Purr, so I don’t much trust my own judgments of optimal play. Additionally, I’ve come to recognize that superior stage-one piece counts can be thwarted by stage-two positioning (or at least can be squandered if the remaining pieces are placed suboptimally), so I’m not sure that piece counts are the end-all, be-all here.

In chess, for example, there’s a popular motif around sacrifice plays; what if my Churr-Purr AI never discovered these paths, or relatedly, could easily be tempted by an opponent’s sacrifices?

RIP Ron, one of the great chess sacrifices of all time.

To some extent, the short-run score effects should align well with victories, provided the AI searches to enough depth, since I’ll make sure that winning overpowers the short-term accumulation of points. That said, since my AI’s eventual domain search is very much in flux, I’m not yet ready to commit to one score-keeping method or the other.

All in all, Street Fighter II made for quite an interesting AI read, and maybe there’s something to learn from its graphics to make Churr-Purr a bit more flashy …

P.S. I found Adam and the Gyroscope team’s excellent Medium post through Import AI, the weekly mailing list of Jack Clark (OpenAI’s Strategy and Communication Director). I’d highly recommend it for those interested in keeping with the frontier of AI research.

Read the previous post. Read the next post.

Other AIs, and the Question of Reward Design

This post is part of ChurrPurr.ai, a challenge to design an online strategy game and the AI to master it. Play the latest version of the game here.

Steven Adler is a former strategy consultant focused across AI, technology, and ethics.

If you want to follow along with Steven’s projects and writings, make sure to follow this Medium account. Learn more on LinkedIn.

Written by Steven Adler