Reinforcement for Cartpole Using Deep Q Network Keras Libraly Updated 2024

Wiwat Pitakworarat
5 min readJun 28, 2024

--

5 elements:

1. Introduction of DQN Keras Cartpole

The CartPole environment is a classic problem in the field of reinforcement learning, where the goal is to balance a pole on a moving cart. Deep Q-Networks (DQN) is a powerful method that uses deep learning to approximate the Q-value function, which helps in deciding the best action to take in each state. In this article, we will walk through the implementation of a DQN to solve the CartPole problem using Keras.

2. Data for DQN Keras Cartpole

The CartPole environment is part of the OpenAI Gym toolkit. It provides a simple interface to interact with the environment and collect data for training our DQN model. The data consists of states, actions, rewards, and next states, which we will use to train our model.

  • State: A 4-dimensional vector representing the position and velocity of the cart and the angle and angular velocity of the pole.
  • Action: Either move the cart left or right.
  • Reward: +1 for every time step the pole remains upright.
  • Next State: The state of the environment after taking the action.
  • Done: A boolean indicating whether the episode has ended.

3. Platform Using VSCode and Python

For this project, we will use Visual Studio Code (VSCode) as our Integrated Development Environment (IDE) and Python as our programming language. VSCode provides a robust environment with extensions for Python that simplify development and debugging.

4. Implementation of DQN Keras Cartpole

Step 1: Import Libraries

import gym
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense
from tensorflow.keras import optimizers
from collections import deque
import random

Step 2: Set Log

import logging
import os

# Define the log folder and file name
log_folder = 'logs'
log_file = 'batch.log'

# Create the log folder if it doesn't exist
if not os.path.exists(log_folder):
os.makedirs(log_folder)

# Define the full path to the log file
log_path = os.path.join(log_folder, log_file)

# Create a logger with the name of the current module
logger = logging.getLogger(__name__)

# Specify the file handler to output logs to a file named 'logs.log', in write mode ('w')
file_handler = logging.FileHandler(log_path, mode='w') # 'w' stands for write

# Set the logging level for the file handler
file_handler.setLevel(logging.INFO) # Set to the desired level (e.g., WARNING)

# Define the format of the log messages
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)

# Clear existing handlers (if any) and add the file handler to the logger , avoid duplicate logs
logger.handlers = []

# connect to log file using logger.addHandler(file_handler)
logger.addHandler(file_handler)

# Example of logging a WARNING message
logger.warning("-------- Start --------")

Step 3: Create Neural Network

#-- Inheritate from Sequential Class
class NeuralNetwork(Sequential):
#-- number of observation and action
def __init__(self, observation_space, action_space, learning_rate):
#-- Constructor , we will set 2 hidden layer and 24 node for each
#-- super().__init__() : USe for build DeepLearning
super().__init__()
#-- add input and first hidden layer
self.add(Dense(24, input_shape=(observation_space,), activation = "relu"))
self.add(Dense(24,activation = "relu"))
self.add(Dense(action_space,activation="linear"))
self.compile(loss='mse',optimizer = optimizers.Adam(learning_rate=learning_rate))

#-- State = X , target_output = Y (X data , Y Label)
def train(self,state,target_output):
self.fit(state, target_output, epochs =1, verbose= 0)

#-- Predict
def predict_expected_reward_for_each_action(self,state):
return self.predict(state)

Step 4: Create Agent

class Agent():
def __init__(self, epsilon_initial=0.5):
#-- 1 million data store max
self.memory = deque(maxlen = 10000000)
#--
self.batch_size = 20
self.learning_rate = 0.001
#-- discount_factor
self.discount_factor = 0.95
self.epsilon = epsilon_initial
#-- decay_factor for discount eplsilon
self.decay_factor = 0.99
#-- for append the data inside the function of class , store the data
self.reward_for_each_episode = []
self.batch_length_store = []
self.batch_store = []
self.state_store = []
self.state_next_store = []
self.action_store = []
self.reward_store = []
self.qupdate_store =[]
self.qvalue_before_adj_store = []
self.qvalue_after_adj_store = []
self.terminal_store = []
self.neural_store = []
self.epsilon_store = []
self.q_values_train_store = 0
self.run_count = 0
self.reward_pergame_store = []
#-- Neural Network
self.neural_network = NeuralNetwork(4,2,self.learning_rate)


def play(self, env, number_of_episode=10, isRender = False):
# def play(self, env, number_of_episode=3000):
for i_episode in range(number_of_episode):
if(self.epsilon < 0.025):
print("Epsilon is Lower than Value")
print("Last Episode Run :", i_episode)
break
print("Episode {} of {} ".format(i_episode + 1, number_of_episode))

#-- state has 4 thing observation
state, info = env.reset()
#-- state : 1x4 array state
state = np.reshape(state,[1,4])

# self.epsilon *= self.decay_factor

total_reward = 0
end_game = False
truncated = False

while not (end_game or truncated):
if isRender:
env.render()
if self.__with_probability(self.epsilon):
#--
action = self.__getActionByRandomly(env)
else:
#--
action = self.__getActionWithHighestExpectedReward(state)

#-- Must have both end_game and truncated for end the loop
new_state, reward, end_game, truncated ,info = env.step(action)
new_state = np.reshape(new_state, [1,4])
if end_game or truncated:
reward = -200
self.run_count += 1
else:
total_reward += reward
self.reward_pergame_store.append([i_episode,reward])
#-- เก็บข้อมูลในหน่วยความจำ

self.remember(state,action,reward,new_state,end_game or truncated)
state = new_state
self.experience_replay()
self.reward_for_each_episode.append(total_reward)

def __with_probability(self, probability):
return np.random.random() < probability

def __getActionByRandomly(self, env):
return env.action_space.sample()

def __getActionWithHighestExpectedReward(self, state):
return np.argmax(self.neural_network.predict_expected_reward_for_each_action(state)[0])

def __getExpectedReward(self, state):
return np.max(self.neural_network.predict_expected_reward_for_each_action(state)[0])

def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))

#-- Function Train
def experience_replay(self):
#-- if data is less than 20 do nothing
if len(self.memory) < self.batch_size:
return
#-- Get Random Sample
batch = random.sample(self.memory, self.batch_size)

#-- log batch
batch_length = len(batch)
self.batch_length_store.append(batch_length)
self.batch_store.append(batch)
logger.warning(f"length of batch is : {batch_length}")

for state, action, reward, state_next, terminal in batch:
#-- Discover Data Using Append
self.state_store.append(state)
self.state_next_store.append(state_next)
self.action_store.append(action)
self.reward_store.append(reward)
self.terminal_store.append(terminal)

#--
q_update = reward
if not terminal:
q_update = reward + self.discount_factor * self.__getExpectedReward(state_next)
self.qupdate_store.append([q_update,action])
#-- q value neural network
q_values = self.neural_network.predict_expected_reward_for_each_action(state)
q_values_copy = q_values.copy()
self.qvalue_before_adj_store.append(q_values_copy)
q_values[0][action] = q_update
self.neural_network.train(state,q_values)
self.qvalue_after_adj_store.append(q_values)
self.q_values_train_store = q_values

#--
self.epsilon *= self.decay_factor
self.epsilon_store.append(self.epsilon)

Step 5: Run the Agent

#--  gym.make('CartPole-v1',render_mode="human") : This make process very slow
# env = gym.make('CartPole-v1',render_mode="human")
env = gym.make('CartPole-v1')
agent = Agent()
qtable_store = agent.play(env, 10
, False)
# agent.play(env)
plt.title("Performance over time")
plt.ylabel('TotalReward')
plt.xlabel('Episode')
plt.plot(agent.reward_for_each_episode)
plt.show()

Step 6: Check Result

display("Q Update & Action :", agent.qupdate_store[-5:])
display("Q Value Before Update :", agent.qvalue_before_adj_store[-5:])
display("Q Value After Update :",agent.qvalue_after_adj_store[-5:])
display("Q Value Last Update :",agent.q_values_train_store[0])

5. Conclusion for DQN Keras Cartpole

Implementing a DQN to solve the CartPole problem demonstrates the power of combining reinforcement learning with deep learning. Using Keras, we can build and train a neural network to approximate the Q-value function effectively. This method can be extended to more complex environments and problems, showcasing the versatility of DQNs. By following the steps outlined in this article, you can experiment with different architectures and hyperparameters to further enhance the performance of your model. Happy coding!

--

--

Wiwat Pitakworarat
0 Followers

A Machine Learning Lover who is addictive for building AI Model