Beatris: an Evil Tetris AI
A data science guide to pitting a Tetris AI that aims to beat the game against Beatris, a game AI that plays to beat you.
This project was built by Amogh Agnihotri, Reese Costis, Prajakta Joshi, Rebecca Phung, Suhas Raja, and Ramya Rajasekaran. Play against Beatris and find our scripts and models here.
Tetris was first created in 1984 by Soviet Russian software engineer Alexey Pajitnov and was the first game exported to the United States from the Soviet Union. After its release on the GameBoy in 1989, Tetris has been widely regarded as one of the greatest video games of all time and is now the second most downloaded game of all time behind Minecraft.
The game itself is a tile-matching puzzle video game where different geometric shapes descend from the top of the screen. Points are accumulated when an entire row of blocks are filled at which point the row is cleared. If the player cannot clear rows quickly enough, the screen will fill with blocks and the game will end. The game never ends in player victory; the player can only clear as many lines as possible until an inevitable loss.
Tetris is a popular game in the mainstream and in the software developers’ community. Many AIs have been created to beat the game. However, not as many AIs have been made to make the game as hard as possible for the player. Our goal for this project was to create a malicious AI that chooses the hardest pieces to give a smart Tetris AI playing the game. We’ll call the smart Tetris AI the “agent” and the malicious game AI as Beat Tetris, or “Beatris” for short. We chose an agent implementation from GitHub user nuno-faria. You can find the implementation code here. His AI uses deep reinforcement learning and q-learning to learn how to beat normal Tetris within 2000 game plays. Using organic game results, we aimed to train and improve the agent and subsequently train “Beatris” to give the agent the worst time, and score, possible.
The Agent — How it Works
We first had to train the agent against the normal Tetris game to create a formidable opponent for Beatris. Here’s an explanation of its training process.
At a High-Level
Let’s start with the highest level explanation. The agent starts by randomly making choices in-game. It learns by studying the state of the board and receiving a varying numerical reward based on the move it made. It then uses a neural network at the end of every episode to train itself using the moves made and the reward received as an input to a q-learning algorithm in an attempt to maximize reward. Before we dive in to the specifics of our agent’s training process, first we must understand q-learning.
Q-learning is a learning algorithm that tries to find the best action in response to the current state. In our case, the action is a placement of the given piece and the state is the current board’s layout. It finds the right actions to take based on taking random actions at first, receiving a total reward at the end of its actions, and then attempting to maximize reward.
During training, a q-table is created. The q-table’s axes are state and action and all values are initialized the first time a state and corresponding action are chosen. The q-table becomes a reference for the agent to select the best action for a state based on the corresponding q-value.
The agent can take action in two ways. It can either use the q-table and look at the q-value for every possible action for the current state and take the action corresponding to the highest q-value, or it can take a random action. Selecting a random action allows the agent to find new states that may not be found by using the existing q-table.
In our case, the q-table is updated after every finished game with the following equation, Q_state = reward + discount × Q_next_state, where discount is a multiplier between 0.8 and 0.9 that discounts the future value of next state rewards it is considering. We made reward dependent upon (lines cleared)² and bumpiness.
After training is done, the q-table is finished and the agent can then use the q-table in future games to respond to all possible states. For example, knowing the given piece and the state of the board, the agent will perform every action on the given piece with the given state and then reference the q-values for every resulting state, finally taking the action corresponding to the highest q-value. If you want a more in-depth explanation of q-learning before you continue, check out this article.
For our application of q-learning, the agent evaluates the state of the board based on four heuristics (think of them as features) that it uses to train its behavior upon. Those are:
- Lines cleared
- Number of holes (empty spaces completely surrounded by already played pieces)
- Bumpiness (the sum of the difference between the heights of adjacent columns)
- Total height
The agent also takes in a couple of hyperparameters: episodes and epsilon_stop_episode. The number of episodes determines the number of games the agent will play to train. The epsilon_stop_episode parameter determines at which episode, or game attempt, the agent should stop making any random decisions for the rest of the training episodes. In order to control this rate of random decision making, the agent keeps track of an epsilon value. Epsilon is a value between 0 and 1 that determines what percentage of the decisions the agent makes in placing pieces are totally random or deliberate, 1 being every decision made is random and 0 being all decisions are deliberate. Epsilon starts at 1 and is uniformly decremented at every episode from the first episode to the stop episode so that the agent makes more and more deliberate decisions as it trains. For example, if episodes is chosen to be 1000, and stop episode is chosen to be 800, then at episode 1 the agent will make all random decisions with epsilon 1 and starting at 800 it will make all deliberate decisions with epsilon 0. The random choice aspect allows the agent to continue to discover new actions to execute at different states that perform even better than its prior knowledge.
So now we know how the agent evaluates the playing field state and what hyperparameters we can use to influence the training. But how does the agent evaluate its own performance in order to improve? That’s where the scoring function comes in.
Initially, we had one scoring function that doubled as the reward (training score the agent aims to maximize) and a metric we could use to evaluate the performance of the agent. That scoring function was primarily based on (lines cleared)²*board_width, where lines cleared measures how many lines were cleared in the last piece placement for the most recent round. This worked fine at first when we would tune state features and hyperparameters to improve agent performance. However, we found that the most significant change in agent behavior came when we changed the reward function to include various parameters, such as bumpiness. The problem with changing the reward function is that agents with different reward functions could not be compared with the final score since the function making that score differed. So, we created a normalized score that could be used to evaluate the performance of agents against each other based solely on (lines cleared)².
Tuning Agent Performance
Our agent’s initial performance after our first training session resulted in an average testing score of 16, meaning 4 lines cleared. After reading nuno-faria’s README file on GitHub and seeing the amazing performance of his agent that seemingly doesn’t lose, we were sorely disappointed at 4 lines cleared. We were also confused why our agent was performing so differently than nuno-faria’s. So we set off to find ways to improve the agent’s performance by altering the training process.
Our initial instinct in tuning our agent’s performance was to modify the input features (state parameters) and the hyperparameters. After experiencing only marginal improvement, we experimented with changing the training scoring (reward) function to include more heuristics like number of holes and maximum bumpiness with more success. The following sections explain in-depth our attempts to improve the agent’s performance.
The hyperparameters we played with included the number of training episodes, epsilon stop episode, and size of the DQN. By increasing the number of episodes, we hoped the agent would have more games to train on — the equivalent of increasing the size of a dataset when training a linear model — and therefore recognize state/action patterns increasingly well. According to nuno-faria and his graph shown below, the agent’s score should have spiked around 1400 episodes (training games played) and increase very fast after that. However, changing episodes from 2000 to 10000 produced negligible results for us.
We also tried playing around with the epsilon stop episode. We noticed that the performance of the agent would spike around the epsilon stop episode and then inexplicably decrease significantly afterwards. However, no different epsilon stop episodes resulted in better performance.
Adding New Features
Next, we tried adding new state features. Upon observing Tetris world championship games, we noticed that professional players use various techniques to make the Tetris board appear “smooth”. This equates to having low bumpiness, low height, and minimizing the number of holes. We first added maximum bumpiness, the biggest height difference between any two columns on the board, to the feature set. However, adding this new feature did not show any improvement in the score. We also tried removing the number of lines cleared from the feature set. Unfortunately, this did not lead to a higher score either. So, we decided to stick to our original state feature set.
A Heavier Death Penalty
When logging in values into the q-table, the initial implementation of the agent reduced the reward by 2 if it loses. Since we really wanted to discourage death, or loss of the game, we tried punishing the agent by reducing reward by 200 with loss. Adding a heavier death penalty improved the score considerably. Thus, we ended up incorporating this technique in our final agent.
Improved Reward Function
We noticed that the agent would leave the leftmost and rightmost columns empty even if it did receive a piece that would fit well. To encourage a “smoother” pattern, we added a negatively weighted maximum bumpiness term to the reward. With this tactic, the average score improved from 16 to about 135 and the maximum score improved from 60 to about 316,000, so we kept negative maximum bumpiness in the reward function for our final agent. Still, the agent created a considerable number of holes. We added a similar negative term to account for the number of holes in the reward function. However, it did not increase the agent’s score. The final reward function is shown below. Notice how for every round the agent survives, it gains at a baseline of 1 reward, increased if any lines are cleared and decreased based on bumpiness and more so for the game ending.
Final Agent Technique and Performance
Our final agent had an average score of 135 and a maximum score of over 300,000. We kept the original four state features, bumpiness, lines cleared, holes, and height. Our final reward function was based on (lines cleared)² and bumpiness.
Existing “Beatris-like” Implementations
We were inspired to build Beatris after discovering Hatetris, an adversarial network that chooses the next piece to give the player based on the following algorithm: test all possible placements of all possible pieces, determine which piece has the worst best-case placement scenario, and spawn that piece. The metric to choose which scenario is the worst is simply max height of the field after piece placement.
However, this strategy is extremely time-consuming and does not use any machine learning models to make a decision. We wanted to make an even more “evil” AI. The highest score attained against Hatetris is 30 lines cleared. In our implementation, we wanted to make use of more advanced machine learning techniques and improve (reduce) the score a user could get against it.
How Beatris Works
Beatris is an adversarial model based on a deep q-learning network similar to the original Tetris agent. However, for every move that the agent makes, Beatris logs the agent’s move with a negative reward. In other words, Beatris is punished for many of the same things the agent is rewarded for. Find our implementation here. Below is Beatris’ reward function. For every round the agent survives, Beatris receives a baseline negative reward of 1 and is further punished for any lines cleared by the agent. It is incentivized to increase bumpiness and holes and is most heavily incentivized to end the game.
Initial Beatris Performance
When we tested our initial agent with an average score of 16 with Beatris, we were able to observe a new average score of 13. What is more interesting is that Beatris was able to stifle the maximum score to 19 from 60. Against the improved agent, Beatris lowered the averaged score from 135 to 50. However, we did run into interesting situations when the models interacted and wanted to improve Beatris’ performance. We first attempted to fix the infinite loop problem.
The Infinite Loop Problem
We noticed that the training process always got stuck at a certain point. By rendering the game more often, we were able to pinpoint the cause. The game got stuck in an infinite loop where Beatris would provide the same piece over and over again, helping the agent clear the same lines. We realized that Beatris was being too greedy in its decision to give the agent a certain piece. Upon adding the number of holes to Beatris’ reward function, we incentivized Beatris to place more holes on the Tetris board, hoping that would stop the infinite loop. It worked, and we were able to reduce the agent’s average score to 18.
While our original goal was to make adversarial Tetris AI to be able to play against, our AI vs. AI training process made this project two-pronged: creating the agent and Beatris. With regards to the agent, we hoped we would have an almost unbeatable Tetris AI off the bat. However, we had to spend half of our project time on training and tuning our agent to an acceptable level of performance, increasing average score from 16 to 135. With Beatris, once we decided upon training using a negative reward deep q-learning network, it was relatively straightforward to create problems for our agent, especially since we understood the q-learning process and our domain much better after having worked on improving agent performance. When all was said and done, the final agent average score decreased from 135 to 18 when playing against Beatris, a huge improvement from earlier impact of 135 to 50. The agent’s maximum score was stifled from 316,000 to 45!
Some things we wished we could have tried if we had more time include:
- Comparing agent performance against Beatris and against Hatetris
- Continuing training on agent until it can play indefinitely
- Training Beatris against human play
- Create a smoother human playable interface