Atari Games 🎮🤖

Sirojsuxrobov
9 min readMar 6, 2023

--

Description of the Project:

Once upon a time the Atari was the (only) console of choice. Created by Nolan Bushnell and Ted Dabney in 1972, it played a big part in the huge popularity of arcade games and created a home console boom in the 70s and 80s — with some of the best Atari games still remembered fondly by gamers of a certain age.

Retro games from the Atari platform were quite well-liked back then.

My Mission:

My mission is to build an AI that “plays” Atari video games. You will build 3 diffeent models that solve three different Atari games.

Three games were provided for the project’s artificial intelligence to build to (CartPole, SpaceInvaders, Pacman), but in order to stand out from the competition, it unilaterally selected four other games.

My deliverables:

  • a model that plays Cart Pole
  • a model that plays Mountain Car
  • a model that plays Freeway
  • a model that plays IceHockey

And to complete the task, we need to “connect ” a special library OpenGym.

What is Reinforcement Learning?

A machine learning training method called reinforcement learning rewards desired behaviors and/or penalizes undesirable ones. A reinforcement learning agent can typically perceive and comprehend its surroundings, act, and learn by making mistakes.

Developers provide a way of rewarding desired actions and penalizing undesirable behaviors in reinforcement learning. In order to motivate the agent, this technique assigns positive values to desired acts and negative values to undesirable behaviors. This trains the agent to seek maximal overall reward over the long run in order to arrive at the best possible outcome.

Requirements to realize following AI models:

deap
stable_baselines3
gym[atari]==0.19.0
tensorflow
tf-nightly
keras

Cart Pole

In the first two, he used genetic algorithms to build artificial intelligence.

Briefly explain what it is…

What Is the Genetic Algorithm?

The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions. At each step, the genetic algorithm selects individuals from the current population to be parents and uses them to produce the children for the next generation. Over successive generations, the population “evolves” toward an optimal solution. You can apply the genetic algorithm to solve a variety of optimization problems that are not well suited for standard optimization algorithms, including problems in which the objective function is discontinuous, nondifferentiable, stochastic, or highly nonlinear. The genetic algorithm can address problems of mixed integer programming, where some components are restricted to be integer-valued.

This flow chart outlines the main algorithmic steps.

The genetic algorithm uses three main types of rules at each step to create the next generation from the current population:

  • Selection rules select the individuals, called parents, that contribute to the population at the next generation. The selection is generally stochastic, and can depend on the individuals’ scores.
  • Crossover rules combine two parents to form children for the next generation.
  • Mutation rules apply random changes to individual parents to form children.

Only in a horizontal direction (left, right) can it be moved, maintaining the pole’s vertical position. When it is initially slightly slanted, staying still will not help.

We can also make use of the OpenAI Gym package and the command we are already familiar with to create such a virtual environment:

import gym
env = gym.make('CartPole-v1')

The reset() method is used to initialize the environment, and the step() method is used to send commands at each iteration:

observation = env.reset()
observation, reward, done, info = env.step(action)

The observation object contains four real numbers:

  • trolley position (from -2.4 to 2.4);
  • cart speed (-∞; +∞);
  • pole angle (from –41.8° to 41.8°);
  • pole tip speed (-∞; +∞).

For each step at which the rod does not fall (deviated from the vertical by less than 15°), reward = 1.0 is assigned. Accordingly, the episode ends if:

  • the rod deviated from the vertical by more than 15°;
  • the trolley moved away from the center by more than 2.4 units.

The goal of the algorithm is to keep the rod from falling for 500 steps. That is, the maximum reward is 500 units. To achieve this goal, you need to be able to correctly generate commands for the cart, depending on the state of the pole:

  • 0 — impulse to move the trolley to the left;
  • 1 — impulse to move the trolley to the right.

To implement the genetic algorithm, we will use the library deap.

Here is the code:

import gym
import numpy as np
import matplotlib.pyplot as plt
from deap import creator
from deap import base, algorithms
from deap import tools
import random
import algelitism
from neuralnetwork import NNetwork
import time


env = gym.make('CartPole-v1')
NEURONS_IN_LAYERS = [4, 1]
network = NNetwork(*NEURONS_IN_LAYERS)
LENGTH_CHROM = NNetwork.getTotalWeights(*NEURONS_IN_LAYERS)
LOW = -1.0
UP = 1.0
ETA = 20
POPULATION_SIZE = 20
P_CROSSOVER = 0.9
P_MUTATION = 0.1
MAX_GENERATIONS = 50
HALL_OF_FAME_SIZE = 2
hof = tools.HallOfFame(HALL_OF_FAME_SIZE)
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)

toolbox = base.Toolbox()
toolbox.register("randomWeight", random.uniform, -1.0, 1.0)
toolbox.register("individualCreator", tools.initRepeat, creator.Individual, toolbox.randomWeight, LENGTH_CHROM)
toolbox.register("populationCreator", tools.initRepeat, list, toolbox.individualCreator)
population = toolbox.populationCreator(n=POPULATION_SIZE)


def getScore(individual):
network.set_weights(individual)

observation = env.reset()
actionCounter = 0
totalReward = 0

done = False
while not done:
actionCounter += 1
action = int(network.predict(observation.reshape(1, -1)))
observation, reward, done, info = env.step(action)
totalReward += reward

return totalReward,


toolbox.register("evaluate", getScore)
toolbox.register("select", tools.selTournament, tournsize=2)
toolbox.register("mate", tools.cxSimulatedBinaryBounded, low=LOW, up=UP, eta=ETA)
toolbox.register("mutate", tools.mutPolynomialBounded, low=LOW, up=UP, eta=ETA, indpb=1.0/LENGTH_CHROM)

stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("max", np.max)
stats.register("avg", np.mean)

population, logbook = algelitism.eaSimpleElitism(population, toolbox,
cxpb=P_CROSSOVER,
mutpb=P_MUTATION,
ngen=MAX_GENERATIONS,
halloffame=hof,
stats=stats,
verbose=True)

maxFitnessValues, meanFitnessValues = logbook.select("max", "avg")

best = hof.items[0]
print(best)

plt.plot(maxFitnessValues, color='red')
plt.plot(meanFitnessValues, color='green')
plt.xlabel('Generation')
plt.ylabel('Max/average fitness')
plt.title('Dependence of maximum and average fitness on generation')
plt.show()

observation = env.reset()
action = int(network.predict(observation.reshape(1, -1)))

while True:
env.render()
observation, reward, done, info = env.step(action)
if done:
break
time.sleep(0.03)
action = int(network.predict(observation.reshape(1, -1)))
env.close()

But this is not enough, since we have to control the position of the cart and the speed of the rod.

The usage of a neural network, or rather, the creation of a neural network controller, is required in this situation. The input data of the environment (observation object) should include the position of the cart, the speed of the trolley, the angle of the pole, and the speed of the tip. When you need to map certain input data to a specific output, neural networks are a great solution.

class NNetwork:
@staticmethod
def getTotalWeights(*layers):
return sum([(layers[i]+1)*layers[i+1] for i in range(len(layers)-1)])

def __init__(self, inputs, *layers):
self.layers = []
self.acts = []

self.n_layers = len(layers)
for i in range(self.n_layers):
self.acts.append(self.act_relu)
if i == 0:
self.layers.append(self.getInitialWeights(layers[0], inputs+1))
else:
self.layers.append(self.getInitialWeights(layers[i], layers[i-1]+1))

self.acts[-1] = self.act_th

def getInitialWeights(self, n, m):
return np.random.triangular(-1, 0, 1, size=(n, m))

def act_relu(self, x):
x[x < 0] = 0
return x

def act_th(self, x):
x[x > 0] = 1
x[x <= 0] = 0
return x

def get_weights(self):
return np.hstack([w.ravel() for w in self.layers])

def set_weights(self, weights):
off = 0
for i, w in enumerate(self.layers):
w_set = weights[off:off+w.size]
off += w.size
self.layers[i] = np.array(w_set).reshape(w.shape)

def predict(self, inputs):
f = inputs
for i, w in enumerate(self.layers):
f = np.append(f, 1.0)
f = self.acts[i](w @ f)

return f

Mountain Car

The goal of the game is to drive the car up the hill.

The goal of training an agent is to be able to form such actions a, depending on the current state of the environment s, in order to maximize the total (total) rewards r.

One of the following actions (actions) can be applied to the device at regular intervals:

0 indicates a leftward acceleration, 1 an inertial movement, and 2 a rightward acceleration.

Here is the code:

import gym
import matplotlib.pyplot as plt
import numpy as np
from deap import creator
from deap import base, algorithms
import algelitism
from deap import tools
import random


env = gym.make('MountainCar-v0')
LENGTH_CHROM = 200
POPULATION_SIZE = 50
P_CROSSOVER = 0.9
P_MUTATION = 0.2
MAX_GENERATIONS = 150
HALL_OF_FAME_SIZE = 3
hof = tools.HallOfFame(HALL_OF_FAME_SIZE)
RANDOM_SEED = 42
random.seed(RANDOM_SEED)

creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
creator.create("Individual", list, fitness=creator.FitnessMin)
toolbox = base.Toolbox()
toolbox.register("randomAction", random.randint, 0, 2)
toolbox.register("individualCreator", tools.initRepeat, creator.Individual, toolbox.randomAction, LENGTH_CHROM)
toolbox.register("populationCreator", tools.initRepeat, list, toolbox.individualCreator)
population = toolbox.populationCreator(n=POPULATION_SIZE)


def getCarScore(individual):
FLAG_LOCATION = 0.5
observation = env.reset()
actionCounter = 0
for action in individual:
actionCounter += 1
observation, reward, done, info = env.step(action)
if done:
break
if actionCounter < LENGTH_CHROM:
score = 0 - (LENGTH_CHROM - actionCounter) / LENGTH_CHROM
else:
score = abs(observation[0] - FLAG_LOCATION)
return score,


toolbox.register("evaluate", getCarScore)
toolbox.register("select", tools.selTournament, tournsize=2)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutUniformInt, low=0, up=2, indpb=1.0/LENGTH_CHROM)
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("min", np.min)
stats.register("avg", np.mean)


population, logbook = algelitism.eaSimpleElitism(population, toolbox,
cxpb=P_CROSSOVER,
mutpb=P_MUTATION,
ngen=MAX_GENERATIONS,
halloffame=hof,
stats=stats,
verbose=True)

maxFitnessValues, meanFitnessValues = logbook.select("min", "avg")

best = hof.items[0]
print(best)

plt.plot(maxFitnessValues, color='red')
plt.plot(meanFitnessValues, color='green')
plt.xlabel('Generation')
plt.ylabel('Max/average fitness')
plt.title('Dependence of maximum and average fitness on generation')
plt.show()

observation = env.reset()

for action in best:
env.step(action)
env.render()

env.close()

Freeway

In the next two games, I used the RL algorithm A2C:

In the field of Reinforcement Learning, the Advantage Actor Critic (A2C) algorithm combines two types of Reinforcement Learning algorithms (Policy Based and Value Based) together. Policy Based agents directly learn a policy (a probability distribution of actions) mapping input states to output actions.

Recall the policy gradient:

Actor-critic methods are a popular deep reinforcement learning algorithm, and having a solid foundation of these is critical to understand the current research frontier. The term “actor-critic” is best thought of as a framework or a class of algorithms satisfying the criteria that there exists parameterized actors and critics.

import gym
from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_atari_env
import os


def make_env():
env = gym.make('Freeway-v0')
episodes = 5
for episode in range(1, episodes + 1):
state = env.reset()
done = False
score = 0

while not done:
env.render()
action = env.action_space.sample()
n_state, reward, done, info = env.step(action)
score += reward
print('Episode:{} Score:{}'.format(episode, score))
env.close()


def train_model():
env = make_atari_env('Freeway-v0', n_envs=4, seed=0)
env = VecFrameStack(env, n_stack=4)
log_path = os.path.join('Training1', 'Logs1')
model = A2C("CnnPolicy", env, verbose=1, tensorboard_log=log_path)
model.learn(total_timesteps=1500000)
a2c_path = os.path.join('Training1', 'Saved Models1', 'A2C_model1')
model.save(a2c_path)


def test_model():
a2c_path = os.path.join('Training1', 'Saved Models1', 'A2C_model1')
# del model
env = make_atari_env('Freeway-v0', n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)
model1 = A2C.load(a2c_path, env)
evaluate_policy(model1, env, n_eval_episodes=5, render=True)
obs = env.reset()
while True:
action, _states = model1.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
env.close()


if __name__ == "__main__":
# make_env()
# train_model()
test_model()

Here I have trained the model 1500000 times and saved the resulting model to use for testing.

The maximum score is 32, and the model scores from 19–23.

Ice Hockey

Hockey is played by two teams of two in this game. To put more goals into the opponent’s net is the objective.

import gym
from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_atari_env
import os


def make_env():
env = gym.make('IceHockey-v0')
episodes = 5
for episode in range(1, episodes + 1):
state = env.reset()
done = False
score = 0

while not done:
env.render()
action = env.action_space.sample()
n_state, reward, done, info = env.step(action)
score += reward
print('Episode:{} Score:{}'.format(episode, score))
env.close()


def train_model():
env = make_atari_env('IceHockey-v0', n_envs=4, seed=0)
env = VecFrameStack(env, n_stack=4)
log_path = os.path.join('Training', 'Logs')
model = A2C("CnnPolicy", env, verbose=1, tensorboard_log=log_path)
model.learn(total_timesteps=1000000)
a2c_path = os.path.join('Training', 'Saved Models', 'A2C_model')
model.save(a2c_path)


def test_model():
# del model
a2c_path = os.path.join('Training', 'Saved Models', 'A2C_model')
env = make_atari_env('IceHockey-v0', n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)
model1 = A2C.load(a2c_path, env)
evaluate_policy(model1, env, n_eval_episodes=5, render=True)
obs = env.reset()
while True:
action, _states = model1.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
env.close()


if __name__ == "__main__":
# make_env()
# train_model()
test_model()

The total timesteps of training one million.

As you can see, both teams score the same number of points in the first three minutes. If you extend the game time, the agent’s performance will get even better.

Conclusions

To conclude, we used genetic algorithms and RL algorithm A2C to build AI models that can play above four atari games.

Thank you for attention!

--

--