Application of Genetic Algorithm for Policy Search in Open AI ‘CartPole-v1’ Environment

Rahul Kaplesh
The Startup
Published in
4 min readSep 28, 2020

Aim : To implement a Genetic Algorithm for Policy Search For Solving Open AI ‘CartPole-v1’ Environment. Open AI ‘CartPole-v1’ Environment consists of a pole balanced on a cart which moves on a frictionless track. The system is controlled by applying a force of +1 and -1 on the cart. A reward of +1 is awarded for every time-step the pole remains upright. Episode ends when the pole is more than 15 degrees from the vertical or the cart is 2.4 units from the center. The goal was to collect a average reward of 300 over 100 population. This is considered the population from which the best strategy is extracted.

Tech Stack Used :

  1. ) Open AI Gym : https://gym.openai.com/ For simulating the cartpole environment in python
  2. ) DEAP : https://github.com/deap/deap For implementing Genetic Algorithm for policy search
  3. )PyTorch : https://pytorch.org/ For implementing neural networks in python

An untrained agent trying out the Cart Pole Environment is shown below:

The code is given below :

Untrained Agent for Cart Pole
from agent import Agent

agent = Agent(env)
frames = []

state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
frames.append(env.render(mode="rgb_array"))
for i in range(1,200):
state = torch.from_numpy(state).float()
with torch.no_grad():
action = agent.forward(state)
next_state, reward, done, _ = env.step(np.argmax(action.numpy()))
img.set_data(env.render(mode='rgb_array'))
frames.append(env.render(mode="rgb_array"))
plt.axis('off')
axisDisplay = 'Step-' + str(i)
text = plt.text(20, 0, axisDisplay)
display.display(plt.gcf())
display.clear_output(wait=True)
text.set_visible(False)
state = next_state
if done:
break

env.close()

Network used for the model was a single fully connected 32 layer network. A genetic algorithm was applied to calculate the weights and bias by applying cross-entropy method to calculate the best weights which are in turn used to generate the new population. The best weights are calculated as below :

def create_weights(best_Weights): 
weight ={} weight[‘fc1_W’] = best_Weights[‘fc1_W’] + (sigma * np.random.randn(agent.getfc1_W_size()))
weight[‘fc1_b’] = best_Weights[‘fc1_b’] + (sigma * np.random.randn(agent.getfc1_b_size()))
weight[‘fc2_W’] = best_Weights[‘fc2_W’] + (sigma * np.random.randn(agent.getfc2_W_size()))
weight[‘fc2_b’] = best_Weights[‘fc2_b’] + (sigma * np.random.randn(agent.getfc2_b_size()))
return weight

Fitness of a particular policy is evaluated by taking the mean over 100 episodes with that policy . Evaluation operator is classified as :

def evaluation(individual): 
mean_reward = []
for i in range(100):
reward = agent.evaluate(individual, gamma, max_t)
mean_reward.append(reward)
return np.mean(reward),

The crossover and the mutation operator for each individual weight is defined as :

def crossover(ind1, ind2): 
ind2[‘fc1_W’] = ind1[‘fc1_W’]
ind2[‘fc1_b’] = ind1[‘fc1_b’]
ind1[‘fc2_W’] = ind2[‘fc2_W’]
ind1[‘fc2_b’] = ind2[‘fc2_b’]
return ind1, ind2
def mutate(individual):
num = random.choice([1,2])
if num == 1 :
individual[‘fc1_W’] = best_Weights[‘fc1_W’] + (sigma * np.random.randn(agent.getfc1_W_size()))
individual[‘fc1_b’] = best_Weights[‘fc1_b’] + (sigma * np.random.randn(agent.getfc1_b_size()))
if num == 2 :
individual[‘fc2_W’] = best_Weights[‘fc2_W’] + (sigma * np.random.randn(agent.getfc2_W_size()))
individual[‘fc2_b’] = best_Weights[‘fc2_b’] + (sigma * np.random.randn(agent.getfc2_b_size()))
return individual,

The Pseudocode for Genetic Algorithm implemented is :

for i in NUM_GEN:
SELECT K Best Pop Out of Intial Pop
CHANGE the Best Weights To be the Mean Weights of This Population
APPLY Both Crossover and Mutation with probability over this polpulation
CREATE New Indinviduals out of the best Weights
ASSIGN Fitness
IF AVG FItness of Pop > 400.0 :
TERMINATE Generation
SELECT The Best Weights From the Last Generation

The Generation wise fitness statistics are as follows :

gen evals std            min   avg      max
0 50 [4.49639856] [16.] [19.68] [44.]
1 50 [41.98026203] [16.] [30.24] [256.]
2 50 [59.20237833] [16.] [46.72] [256.]
3 50 [127.49036669] [16.] [82.92] [816.]
4 50 [175.55573018] [16.] [129.84] [816.]
5 50 [224.06677576] [16.] [152.] [1000.]
6 50 [241.28171418] [16.] [189.88] [1000.]
7 50 [339.65140954] [16.] [305.8] [1000.]
8 50 [424.66975216] [16.] [401.96] [1000.]
Environment solved in 8 generations!
Environment solved in 87.6688 seconds

Final Agents performance at the Cartpole Environment :

Trained Agent For CartPole-v1

The full working code can be found at the repository as given below

Further Work which can be done :

Performance improvement can be very easily achieved using this policy search based approach by running the logic for calculating the fitness of each individual over different processes . DEAP library supports this , though this has not been attempted here . Simulated Annealing and Adaptive Noise scaling could be applied to the noise added to weights to form the population for further improvement.

--

--