Whenever I want to learn a new machine learning method, I write a single file example with a Minimum Viable Experiment (MVE). Last time I did an MVE to learn RL and Q-Learning. This time I want to learn Evolution Strategies.
Before I show you the code I came up with, let us talk a bit about a fun recurring equation. Say you want to create a model of a variable y (ex. price of a stock, say if an image shows a cat or a dog, the steering of a self driving car, etc). You go ahead and use a model f(x) to approximate y. Where x are inputs (ex. previous stock values, a jpeg image, a video of the road, etc). That model f also have parameters w to be tuned. Just like in a neural network that at this point everybody heard about.
A simple way to measure how well your model performs is to calculate the average value of L = (y-f(x))^2. Then, you can update the parameters w using information about the gradient of L: grad(L) = -2(y-f(x))*grad(f(x)).
Note that there are two main terms multiplied in the last equation. A quality measure -2(y-f(x)) and a model sensitivity grad(f(x)). If we forget precise formulations for a bit and just think about matching possible quality measures and model sensitivity, we can see that this same equation is used elsewhere, especially in RL. For example, the REINFORCE algorithm also updates model parameters according to R*grad(log(p(x)) where R is a reward measure and log(p(x)) is a log-probability of certain actions of a policy function.
Taking this to an extreme, where we assume that a model sensitivity is solely defined by small random jitters n we add to our model parameters w, we get grad(L) = R * n. Where again R are rewards defined by completing a given task. Obviously that approximation is not really precise and should have a lot of errors when estimating good values to update w. Good thing about it is we can usually compensate for lack of precision with exhaustive trial, error and lots of data.
And that is it. We can now try to solve the same catch game from the previous post, but this time with a much simpler formulation. The trade off is more compute. Now you can check out the code. It is pretty short and should be simple to understand, but let me know if something isn’t clear.