Reinforcement Learning for Atari Video Games: A Deep Q-Network Approach

9 min readJun 24, 2023

Introduction

Reinforcement Learning (RL) has shown remarkable success in training Artificial Intelligence (AI) agents to excel at complex tasks. In this article, we explore the application of RL and the Deep Q-Network (DQN) algorithm to tackle the challenge of building AI agents capable of playing Atari video games.

Atari games provide a diverse and challenging environment, making them an ideal testbed for RL algorithms. By leveraging OpenAI Gym, a popular RL framework, we develop models to play three distinct Atari games: CartPole, Space Invaders, and Pacman.

Through this article, we delve into the underlying concepts of RL and DQN, discuss our approach in training the models, present the results obtained, and explore potential avenues for future improvement. Join us on this exciting journey as we unlock the potential of AI in conquering the world of Atari video games.

Background: Reinforcement Learning and Deep Q-Networks

Reinforcement Learning (RL) is a branch of machine learning that focuses on training agents to make sequential decisions in an environment to maximize cumulative rewards. RL algorithms learn through interaction with the environment, receiving feedback in the form of rewards or penalties.

One popular RL algorithm is the Deep Q-Network (DQN), which combines RL with deep neural networks. DQN revolutionized the field by introducing a neural network architecture to approximate the action-value function, known as Q-values. The Q-values represent the expected rewards for taking different actions in a given state.

The DQN algorithm utilizes experience replay, where past experiences are stored in a replay memory and randomly sampled to break correlations in the data. This technique facilitates more stable and efficient learning.

By training a DQN on a sequence of states, actions, rewards, and next states, the agent learns to make optimal decisions in complex environments with high-dimensional input spaces, such as Atari video games.

In the next sections, we will explore how DQN and RL concepts are applied to build AI agents that can play Atari games effectively.

Understanding Atari Video Games and OpenAI Gym

Atari video games, released in the 1980s, serve as a benchmark for testing AI algorithms. These games provide diverse challenges, requiring skills such as decision-making, timing, and strategic thinking.

OpenAI Gym is a popular Python library that provides a standardized environment for developing and evaluating RL algorithms. It includes a wide range of Atari games, allowing researchers and developers to train AI agents on these games.

By leveraging OpenAI Gym, we can access the game environments, retrieve observations, and take actions using RL algorithms. This integration enables us to train AI agents to learn and improve their performance in playing Atari video games.

Problem Statement: Building an AI Agent to Play Atari Games

The goal of this project is to develop an AI agent that can surpass human-level performance in playing Atari video games. We aim to build models using reinforcement learning and the Deep Q-Network (DQN) approach to train the agents to learn optimal strategies and achieve high scores across a variety of Atari games.

Approach: Deep Q-Network (DQN) Algorithm

Our approach utilizes the Deep Q-Network (DQN) algorithm to train AI agents for playing Atari games. The key components of our approach are as follows:

Model Architecture

I designed a neural network model that takes the game state as input and predicts Q-values for each action. The model consists of multiple layers with ReLU activation, culminating in a linear output layer.

Experience Replay

Experience replay was employed, storing agent experiences in a replay memory buffer. During training, we sample batches of experiences randomly to break temporal correlations and enhance learning stability.

Exploration vs. Exploitation Trade-off

A balance exploration (taking random actions) and exploitation (taking actions based on the learned Q-values) by gradually decreasing the exploration rate over time, allowing the agent to transition from exploratory behavior to more optimal actions.

Training and Evaluation Process

The agent was trained on a sequence of game episodes, updating the Q-values iteratively. After training, the agent’s the agent’s performance was evaluated by measuring the average rewards achieved during gameplay.

Implementation Details

CartPole: Balancing the Pole

dchen327/cartpole-rl: Solving OpenAI’s cartpole environment with Q-learning (github.com)

In the CartPole game, the AI agent learns to balance a pole on a cart. The agent receives observations related to the cart’s position and velocity, as well as the pole’s angle and angular velocity. It must take actions to maintain balance and prevent the pole from falling.

Space Invaders: Alien Invasion

SNEAK PEEK : “Space Invaders” On The Big Screen

In the Space Invaders game, the AI agent faces an alien invasion. The agent receives pixel-based observations of the game screen and must learn to shoot down the invading aliens while avoiding their attacks.

Pacman: Navigating the Maze

In the Pacman game, the AI agent controls the iconic character Pacman in a maze filled with ghosts and pellets. The agent receives observations of the maze layout and must navigate through the maze, eat pellets, and avoid the ghosts.

For each game, I trained separate models using the DQN algorithm to enable the AI agents to learn effective strategies and excel at playing the respective Atari games.

Code

Creating DQN Class

The DQN class represents a Deep Q-Network (DQN) agent for reinforcement learning. It uses a neural network model to approximate the Q-values of different actions in a given state. The DQN agent can remember past experiences, select actions based on an exploration-exploitation strategy, and update its model through experience replay.

class DQN:
    def __init__(self, state_space, action_space):
        self.state_space = state_space
        self.action_space = action_space
        self.memory = []
        self.gamma = 0.95  # discount factor
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.model = self.build_model()

    def build_model(self):
        model = Sequential()
        model.add(Flatten(input_shape=self.state_space))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(self.action_space, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))
        return model

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return np.random.randint(self.action_space)
        q_values = self.model.predict(state)
        return np.argmax(q_values[0])
    def experience_replay(self, batch_size):
        """Store and sample experiences from the environment."""
        # Store the most recent experiences in a replay buffer
        for state, action, reward, next_state, done in self.memory:
            self.replay_buffer.append((state, action, reward, next_state, done))

        # Sample a random batch of experiences from the replay buffer
        batch = self.replay_buffer.sample(batch_size)

        # Update the Q-network using the sampled experiences
        self.update_q_network(batch)

    def update_exploration_rate(self):
        self.epsilon = self.epsilon * self.epsilon_decay
        if self.epsilon < self.epsilon_min:
            self.epsilon = self.epsilon_min

CartPole Training and Evaluation

# Training loop
batch_size = 32
episodes = 10

for episode in range(episodes):
    state = env.reset()
    done = False
    total_reward = 0

    while not done:
        env.render()
        # Preprocess state (if necessary) and reshape it
        state = np.expand_dims(state, axis=0)

        # Choose action
        action = agent.act(state)

        # Take the action in the environment
        step_result = env.step(action)
        next_state = step_result[0]
        reward = step_result[1]
        done = step_result[2]


        # Preprocess next_state (if necessary) and reshape it
        next_state = np.expand_dims(next_state, axis=0)

        # Store the experience in memory
        agent.remember(state, action, reward, next_state, done)

        # Update the state
        state = next_state
        total_reward += reward

    print(f"Episode: {episode+1}, Total Reward: {total_reward}")

# Evaluate the trained model
env = gym.make('CartPole-v0')

# Load the trained model
model = tf.keras.models.load_model('cartpole_dqn_model.h5')

# Evaluate the model for 10 episodes
total_reward = 0
for episode in range(10):
    state = env.reset()
    done = False
    while not done:
        # Reshape the state
        state = np.expand_dims(state, axis=0)

        # Get the action from the model
        action = np.argmax(model.predict(state))

        # Take the action in the environment
        next_state, reward, done, _ = env.step(action)

        state = next_state
        total_reward += reward

print(f"Average reward: {total_reward / 10}")

Space Invaders Training and Evaluation

# Training loop
batch_size = 32
episodes = 10

for episode in range(episodes):
    state = env.reset()
    done = False
    total_reward = 0

    while not done:
        env.render()
        # Preprocess state (if necessary) and reshape it
        state = np.expand_dims(state, axis=0)

        # Choose action
        action = agent.act(state)

        # Take the action in the environment
        step_result = env.step(action)
        next_state = step_result[0]
        reward = step_result[1]
        done = step_result[2]


        # Preprocess next_state (if necessary) and reshape it
        next_state = np.expand_dims(next_state, axis=0)

        # Store the experience in memory
        agent.remember(state, action, reward, next_state, done)

        # Update the state
        state = next_state
        total_reward += reward
        print(f"Episode: {episode+1}, Total Reward: {total_reward}")

# Evaluate the trained model
env = gym.make('SpaceInvaders-v0')

# Load the trained model
model = tf.keras.models.load_model('spaceinvaders_dqn_model.h5')

# Evaluate the model for 10 episodes
total_reward = 0
for episode in range(10):
    state = env.reset()
    done = False
    while not done:
        # Reshape the state
        state = np.expand_dims(state, axis=0)

        # Get the action from the model
        action = np.argmax(model.predict(state))

        # Take the action in the environment
        next_state, reward, done, _ = env.step(action)

        state = next_state
        total_reward += reward

print(f"Average reward: {total_reward / 10}")

Pacman Training and Evaluation

# Training loop
batch_size = 32
episodes = 10

for episode in range(episodes):
    state = env.reset()
    done = False
    total_reward = 0

    while not done:
        env.render()
        # Preprocess state (if necessary) and reshape it
        state = np.expand_dims(state, axis=0)

        # Choose action
        action = agent.act(state)

        # Take the action in the environment
        step_result = env.step(action)
        next_state = step_result[0]
        reward = step_result[1]
        done = step_result[2]


        # Preprocess next_state (if necessary) and reshape it
        next_state = np.expand_dims(next_state, axis=0)

        # Store the experience in memory
        agent.remember(state, action, reward, next_state, done)

        # Update the state
        state = next_state
        total_reward += reward

    print(f"Episode: {episode+1}, Total Reward: {total_reward}")

# Evaluate the trained model
env = gym.make('MsPacman-v0')

# Load the trained model
model = tf.keras.models.load_model('pacman_dqn_model.h5')

# Evaluate the model for 10 episodes
total_reward = 0
for episode in range(10):
    state = env.reset()
    done = False
    while not done:
        # Reshape the state
        state = np.expand_dims(state, axis=0)

        # Get the action from the model
        action = np.argmax(model.predict(state))

        # Take the action in the environment
        next_state, reward, done, _ = env.step(action)

        state = next_state
        total_reward += reward

print(f"Average reward: {total_reward / 10}")

Discussion and Future Work

The approach of using the Deep Q-Network (DQN) algorithm to train AI agents for playing Atari games has yielded promising results. The trained models demonstrated significant progress in learning optimal strategies and achieving high scores across different games.

However, there are still opportunities for future improvements. One area of focus could be fine-tuning the model architectures and hyperparameters to enhance performance. Additionally, exploring more advanced RL algorithms, such as Double DQN or Dueling DQN, could lead to further improvements in the agents’ capabilities.

Furthermore, expanding the training dataset by including more Atari games and increasing the training duration could enhance the agents’ generalization and adaptability to new environments.

Overall, the approach sets a strong foundation, and with continued research and refinement, we can expect even more impressive results in building AI agents that excel at playing Atari video games.

Conclusion

In this project, I utilized reinforcement learning and the Deep Q-Network (DQN) approach to build models capable of playing Atari video games. We trained models for three different games: CartPole, Space Invaders, and Pacman.

The results indicate that the models achieved varying levels of success in playing the games. While the Space Invaders model performed relatively well with an average reward of 9.4, the CartPole and Pacman models showed room for improvement with average rewards of 76.5 and 70.0, respectively.

Further exploration and experimentation with different architectures, hyperparameters, and training techniques may lead to enhanced performance and more optimal gameplay. By continuing to refine and iterate on the models, we can strive to build AI agents that surpass human-level performance in playing Atari video games.

References

1. github.com/purnastarc/CartPole-v1
2. stackoverflow.com/questions/67258316/keras-rl-reinforce-model-after-its-training
3. github.com/DableUTeeF/RL
4. github.com/ianmagyar/machine-learning-ii
5. github.com/shivaverma/OpenAIGym
6. github.com/HebdaMarta/Deep-learning-projects
7. github.com/rllabmcgill/rl_final_project_turtle