Explaining Evolution Strategies for Reinforcement Learning

Published in

We Are Orb

8 min readApr 21, 2017

Sometimes simple ideas give great results!

Recently OpenAI a non-profit led by Elon Musk demonstrated that Evolution Strategy that was earlier used just as an optimisation technique can be used for making Agents that learn to interact with an environment based on Reinforcement Learning. It learned the environments much faster than the state-of-art A3C by Deepmind and is much more scalable because very less communication is needed between the threads.

I loved the idea and thought about implementing same as I had a plan to build an RL agent for a long time but all other algorithms are very performance intensive for my low-on-specs PC.

For understanding what comes afterwards you need to know these:

Reinforcement Learning:
Watch this video to see a real life application of reinforcement learning, much better than reading articles.
A little bit C++, because that is what I used for the code. It will still have a style of python but those curly braces are going to stay!

OK, so what is Evolution Strategy?
How I understand it is that you have an Agent with random weights, it adds noise to those weights and with each weight tries to do a task and collects rewards. The main weight is now shifted to the noise with the best performance. Now the whole process is repeated again and again, each iteration called a Generation.

Generation by Generation the weights keep getting better and after several Generations we get Jarvis! well.. not really, but we do get an agent that shows some charisma while playing rather than those random click-any-button bots.

Now let’s start working on a solution!

#include <iostream>
#include <vector>
#include <string>
#include <armadillo>
#include "../environment.hpp"// Convenience for eyes
using namespace std;
using namespace arma;
using namespace gym;

Just a bunch of headers, armadillo library gives us mathematical functions for matrices that will make code smaller. environment.hpp is another helper library from gym_tcp_api . This is a nice interface for interacting with OpenAI’s gym from any language, here I used C++ as API already contains most of the code for C++.

Now let’s define the class and it’s constructor,

class EvolutionStrategyAgent{
 public:
  mat model;
  size_t frameW, frameH, num_actions;  EvolutionStrategyAgent(size_t frameW,
                         size_t frameH,
                         size_t num_actions)
  :
  frameW(frameW),  // These make variables local
  frameH(frameH),
  num_actions(num_actions)
  {
    model = randu(num_actions, frameW * frameH * 3);
  }

As our Agent will be using game’s frame for making a decision, we get the frame width, frame height and number of actions from the user.
For simplicity, our model will be just 1 layer fully connected network. So, here we just create a random matrix of dimensions [num_actions, frameW * frameH * 3]. Here 3 means that we have an RGB value in each frame pixel.
Note that I used random uniform randu, as I want the values to be evenly distributed between 0 and 1.

We can then get the decision for that frame using,

ActionMatrix[6, 1] = Model[6, 10000] * Frame[10000, 1]

Now let’s make the Play function!

void Play(string environment,
            string host,
            string port)
  {
    size_t num_workers = 5;
    double sigma = 1; // Noise Multiplier
    double alpha = 0.005; // Learning Rate
    size_t input = frameW * frameH * 3;
    
    mat workerRewards(num_workers, 1);
    vector<mat> epsilons(num_workers);

We defined some constants for the learning algorithm, we will use 5 workers at each generation. At every generation we will create some noise and multiply it with sigma, it would make the noise larger (sigma > 1) or smaller (sigma < 1) or stay same(sigma == 1). Alpha is the learning rate which we would use later on. WorkerRewards is a matrix that will contain rewards we got from each worker, epsilons is a vector of matrices which would store the noise we used in each worker.

while(1) //Until I am bored!
{
   // Run the for loop in different threads
   #pragma omp parallel for
   for(size_t i = 0; i < num_workers; i++)
   {
     mat epsilon = randn(num_actions, input); // Random Noise
     mat innerModel = model + (sigma * epsilon); // Add it     epsilons[i] = epsilon;

Here we create the 5 workers each with its own thread, thanks to OpenMP it just took one line! #pragma omp parallel for

Now we create epsilon using a random normal distribution randnof same dimensions as the weights because we need to add them afterwards.

A pause point here… why did I use normal distribution?
The reason is that normal distribution gives random numbers closer to 0. This would protect the worker from jumping too far away. It can still do that but probability is lower. Learning is long process :-) No shortcuts to learning!

OK, now we have the noise, let’s multiply with sigma and then create another matrix which will work as our inner model for the worker. The inner model will be the main model + the noise.

Environment env(host, port, environment);
        
env.compression(9);
        
// Create a folder for Agent Files
string folder("./dummy/");
folder += i + '/';// Monitor its moves
env.monitor.start(folder, true, false);env.reset();
env.render();

Now we setup an environment for the worker and a folder where it will store the recording and game data.

size_t totalReward = 0;//Until the episode is complete
while(1)
{
  mat maxAction;
  mat action =  innerModel * vectorise(env.observation);  maxAction = action.index_max();  env.step(maxAction);  totalReward += env.reward;  if (env.done)
  {
    break;
  }
}
        
env.close();workerRewards[i] = totalReward;

This is the thinking part of model, quite easy on eyes too!

Get an action matrix by doing Model * Frame
Get the action with most magnitude
Do the action
Get the reward and add to total reward.
If the environment is done, stop playing, otherwise back to 1.
Close the environment and send the total reward to the reward matrix

Wow! That was easy :P

Now comes the Optimisation part!

Let’s add all the rewards and epsilons together

mat sumRxEpsilon = zeros(num_actions, input);for(size_t i = 0; i < num_workers; i++){
    mat stdReward = (workerRewards[i] -
                    mean(workerRewards)) /
                    stddev(workerRewards);
        
    sumRxEpsilon += epsilons[i] * as_scalar(stdReward);
}

There is a catch here, we’ll get there.

First, make a matrix of the same dimension as the model but this time filled with zeros
We calculate standard reward, which means for [2, 4, 6] we take the mean( 4 here ) and standard deviation i.e. mean of how much each element is away from mean ( 2 here ) as below,

Standard Deviation

Now we subtract the mean and divide by the standard deviation to get a standardised reward [-1, 0, +1]. This ensures that the values are not too large i.e. 500 reward when all are getting 100 reward and 5 reward when all are getting 1 reward are similar situations. Having larger reward doesn’t mean a larger jump, larger jump should happen if all rewards are low and one of them shines i.e. The One!

3. And we multiply StdReward with Epsilons/Noises, this is done to magnify the noise which gave a better reward, similar to multiplication with sigma but that was uniform for all noises, here it is based on the reward of that noise.

4. At last, we add them all into one master matrix that now contains the path to victory, the path to light!

Why did we do this? you may ask... here comes the formula,

This is the final step, after this, the agent will be able to learn! But you are confused, scared of what is happening here, don’t fear…

Let’s break it down! Ain’t no need to get scared of it!

Theta just means model, t just means step. So we are adding something to the old model to get the next model!

Alpha (that a symbol) if you may recall is the learning rate, it is very small number… so it is decreasing the stuff it is getting multiplied to. This is done so that our steps are small, otherwise we may jump from one hill to another but never getting to the bottom of hills where the the true treasure lies.

That large E means add whatever is on its right by filling j as 1, 2, 3….n, Here n means the num_workers! F[j] is the Reward for the worker, e[j] is the epsilon/noise for that worker. We have done this step earlier and have the sum matrix waiting to be used.

We have a sum of n numbers, if we divide it by n, we get… yes.. yessss! The mean! Also, you remember we had multiplied the epsilon with sigma? that was by mistake, we have to take it away now, so we divide the same with sigma again! hence the sum / (n * sigma)

Putting in simple words,
Add to model the mean of product of (reward, noise) for each worker.. duh.. you were getting scared of such a simple thing.

In code it looks like this,

model = model + alpha * (sumRxEpsilon / (num_workers * sigma));

Now for sanity’s sake, let’s print the rewards so we know learning is happening,

cout << "Worker Rewards: ";
for(double reward : workerRewards){
  cout << reward << " ";
}
cout << "\n";

Done!

Results:

This is sample from 4th generation’s recording, it learned to shoot for getting rewards and moving a little bit to get more aliens killed! In the reward log you can see that overall workers got quite better which shows a little bit of learning has happened.

Note that in 2nd Gen, the performance suddenly dropped, that is because the random initialisation we did in our model in 1st Gen takes, some extra generations just to stabilise itself. As we had a simple model of just 1 layer, it took just one generation and then things went even between all workers.

Your Takeaways!

This was a very simple implementation and quite bad one, there are so many things that can be improved. So here is some homework you all can do,

Make a similar agent in a language you love :-) I didn’t use any higher level ML library, so it is easy to remake in any language. For the interactions, there is an API reference in gym_tcp_api’s README.md
So you just have to send a JSON in that format and server does the interaction for you. Server then returns the gym’s state which you can use for thinking in next step.
Add a larger model! If you are comfortable with ML libraries, you can use a larger model and get amazing results, note that you will need a noise for each layer’s weights.
Try more environments! There are many in Gym, pick your poison here.
Start to read research papers on arXiv. Each algorithm can be broken down into smaller parts and you would see the true beauty of maths used in computer science once your brain starts to notice patterns in maths formulas.

That’s all for this article, so get your reinforcement learning agents rolling!
There is still a lot for us humans to find out, ride the boat and we can bring the true AI faster! :-)