BYOBBO

Build your own black-box optimizer

19 min readMay 13, 2019

Design a controller to drive this car. (Image from OpenAI Gym.)

tldr: Train an RNN to be a super-efficient, custom black box optimizer. Motivation, background, and tutorial (with code).

The Story

You’re an automotive engineer. You’ve designed a controller to drive a car up a mountain. The controller has parameters you need to set. To be sure the controller functions properly with the parameters you choose, you must evaluate them via experiment.

To select parameters you could:

1) Set them by hand using your knowledge of the controller and the domain, then evaluate them experimentally. If they work well enough, stop. Don’t optimize the parameters because experiments are expensive. This is fine, but you suspect that better parameters exist, which means a better, more valuable controller exists. To find it you’d need to experiment with other parameter settings, but since you’ve already tried the ones that made sense based on your domain knowledge, further parameters would just be guesses. You don’t want to waste money running experiments on guesses.

2) Vary the parameters systematically using algorithms created for this purpose and run experiments. This helps a bit, but doesn’t get you to “optimal” because your budget for experiments runs out. The optimizer kind of wanders around in parameter space anyway — sometimes testing parameters that you just know won’t be that much different from what’s already been tested or that you just know aren’t likely to be very good.

3) Build a simulation. Simulations are much cheaper to run than experiments. You optimize the parameters in simulation, then validate the simulation-optimal parameters via experiment. Sometimes they work. You’re not convinced that “simulation-optimal” is “real-world optimal”. But maybe you can improve the simulator. As you study your simulation results more closely and try to get them to match the real world you begin to think that building a high-fidelity simulator is actually harder than designing the controller.

Not very satisfying.

Here’s a new idea: Maybe you could build a custom optimizer that contains your domain knowledge and it could design a sequence of experiments to tune your controller. Your domain knowledge could help keep the sequence short, and the use of experiments will keep the evaluations accurate. Let’s try that!

But first, some preliminaries.

Simulation is Hard

The allure of simulation is that you might evaluate a controller’s quality quickly and cheaply. More quickly and cheaply than if you built the controller into a physical system and really ran it. The catch is that your evaluations reflect performance in simulation. This might be correlated to performance in real life, but there will be discrepancies.

In the mountain car example above, perhaps when building the simulation we inaccurately modeled friction inside the car’s mechanical parts, drag due to the air, the shape of the hill, delays in the response of the car’s controls to commands, delays in communication from the car’s sensors to the controller, the time it takes for the controller to compute the next command, changes in ground friction due to the car rolling over the ground, etc.

Oh, and then there are all of the things we didn’t even think of.

Facebook published a nice overview of an ML engineering process called the Facebook Field Guide to ML. In Episode 6 they use the term “online-offline gap” to describe the difference between offline evaluation and online evaluation. The speaker notes that the predictions made by the model in production can influence the data that is collected, whereas in the historical data this doesn’t happen.

In literature discussing the design of robotics controllers by simulation optimization, they sometimes use the term “reality gap” [M. Palmer, D. Miller] [J. Tobin, et al.] to describe this problem. Simulator mismatch could lead to the design of a controller that damages a robot by forcing joints to their limits or just knocking the robot over.

Recently, this problem got wide coverage at Tesla Autonomy Day [see video at 2:02:40–2:05:30]. They discuss the problem of capturing the full complexity of the real world in a simulation. That would require the simulator to be as complex as the real world. They use a simulator, but it doesn’t make experimentation unnecessary. Elon Musk sums it up with: “You don’t know what you don’t know”.

Adding optimization into the mix just makes it worse. The optimizer can’t tell the difference between the accurate and inaccurate parts of the simulation. If there are modeling errors in the simulator that your controller can use to increase its evaluation, optimization will seek out those errors. Call this “simulation optimization bias” [F. Muratore, et al.].

You might argue that making a higher-fidelity simulation would make those advantageous mismatches less of a problem, but remember: You’re using an optimizer. It’ll find them!

Coping With Simulation Optimization Bias

There are many ways to attack this problem of simulation mismatch. One you might be familiar with is model-free reinforcement learning. Two less familiar ones are domain randomization, and meta-learning methods.

Model-Free Reinforcement Learning

This is the approach that was used by AlphaGo [Silver, et al.], the Atari-playing Deep Q Network [Mnih, et al.], OpenAI’s Dota2 competitor [OpenAI Five], and many, many other RL papers.

Model-free RL dispenses with the simulation entirely and optimizes controller parameters by direct interaction with the environment — by experiment. Note that all of the examples above actually applied model-free RL to environments that were fast and cheap — Go, Atari, Dota2 — just like simulations! Typically these approaches take too many experiments — hundreds, thousands, or millions depending on the problem — to be applied directly to physical systems.

Not always, though. This paper [M. Reidmiller, et al.], called “Learning to Drive a Real Car in 20 Minutes”, uses fitted Q iteration to tune a neural network. The controller took 5 state variables as input and produced one of 5 discrete actions as output. However, this network had orders of magnitude fewer parameters to tune than the ones used for Go, Atari, and Dota2.

In the RL literature, the traditional engineering process of building a simulator, optimizing a controller in that simulation, then applying the controller to the real system is given a machine learning treatment and called model-based reinforcement learning (MBRL).

This lecture slide [C. Finn. / S. Levine, page 5] compares the sample efficiency (i.e., how many experiments do I need to run?) of several classes of RL algorithms. It states that MBRL methods are around 10 times more sample efficient than Q-Learning, and 100 times more sample efficient than policy gradient methods (another model-free method; used, ex., in Dota2).

Conclusion: Let’s not ditch our simulator.

Domain Randomization

OpenAI produced two great papers describing and demonstrating a technique called domain randomization [J. Tobin, et al.] [M. Andrychowicz, et al.]. It’s superficially simple: Optimize your controller on a (randomized) family of simulations, each of which has the real dynamics mis-modeled in some way. Hope that whatever is correct about the simulations is consistent across them and that the incorrectness is contained in the variations between simulations. For the controller to do well in all of the simulations it will be forced to capitalize on the correctly-modeled dynamics and to cope with the variation from simulation to simulation. The authors of [J. Tobin, et al.] were able to optimize an object detector entirely on simulated images and apply it as part of a controller in a real robot. In [M. Andrychowicz, et al.] they train a robotic hand to manipulate a cube in simulation and then transfer the controller directly to a real robot.

In [F. Sadeghi, S. Levine] the authors train a quadrotor in simulation on randomized “hallways” with varying floorplans, lighting, texture maps for the walls, and so on. They were able to successfully run the resultant controller on a real quadrotor.

In two of these papers [J. Tobin, et al.][F. Sadeghi, S. Levine] the authors suggest that a next step might be to further improve the controller by incorporating real-world data. Training a controller to work on all of a set of randomized environments results in “overly-conservative” [Z. Xie., et al.] performance on a specific environment (i.e., the real one).

Conclusion: Use domain randomization, but incorporate real-world data to keep controller from being overly conservative.

Meta Reinforcement Learning

One way to fine-tune a controller to the real-world could be to use Meta Reinforcement Learning (MetaRL). MetaRL — and meta-learning algorithms in general — “learn to learn”. Rather than learning a task they might learn to tune the parameters of a controller (or a model in meta-supervised learning). A nice overview is available from BAIR.

A typical meta-learning problem setup consists of a set of training tasks and a set of test tasks (analogous to training data and test data in supervised learning). The meta-learner is trained to optimize learners on the training tasks (“in-task” training) and then tested for generalization by optimizing a learner on the test tasks (“out-of-task” testing). In our mountain car problem, the training tasks would be a set of domain-randomized simulators, and the test task would be the real world. The “meta-learner” would be an optimizer designed specifically to optimize the car’s controller.

Meta Black Box Optimization

In our mountain car scenario, we already have a controller. We just don’t know the best parameters to use. We’re not designing from scratch.

To tune our controller, we need a black-box optimizer. That’s an optimizer that takes in only objective function evaluations. (More to the point: It doesn’t know derivatives of the function being optimized.) In [Y. Chen, et al] they train RNNs to do black-box optimization on a broad class of functions. The RNN performs as well as or better than Spearmint, a generic black-box optimizer. We’ll follow a similar route, except we’ll create an optimizer just for the Mountain Car and hope that this customization will lead to a huge improvement in sample efficiency.

Our controller was built by hand using domain knowledge. It’s not a neural network or other function approximator. In contrast, the meta-learner is a neural network.

Is this getting confusing? Let’s try to lay out a clear plan.

The Plan

Take our existing mountain-car simulator and parameterize it to produce a family of simulators.

To determine the quality, q(x), of a set of controller parameters, x, run the simulation once.

Define a Meta-BBO function approximator. It’ll be a recurrent neural network (RNN) that takes as input the last-tried x and its corresponding q(x). Its output will be the next x to try:

The output*, x[n+1], is the plan for the next experiment. The RNN’s job is to design a sequence of experiments to tune the controller.

[* Hey, there. I don’t know how to do subscripts on Medium, so I’m going to write x[n] for “x sub n” in the text, even though I’m using a real subscript in the displayed equations.]**

[** I don’t know how to do footnotes, either.]

But wait, there’s more.

The RNN has parameters, too — its weights, w. These weights will be optimized via simulation. To evaluate one set of RNN weights we’ll get x₀ from the RNN (our RNN will provide this initial x₀ without any RNN input). Now randomize and fix the simulator’s shape parameters. This is the domain randomization part.

We’ll generate q₀ by running one simulation on the controller using controller parameters x₀. Then we’ll compute:

then find q[1] with the simulation, then (x[0], q[1]) with the RNN, then q[2] with the simulation. Each time we find a q[n], we’re using the same, fixed shape parameters.

We’ll run this entire three-simulation process 100 times with different randomly-generated shape parameter sets. The average of the 100 q[2]’s obtained will be the RNN’s weights’ objective value. Call that p(w). That’s what we need to maximize.

Once that’s done and the RNN’s weights are known, we’ll install the controller in the real car. Then we can run the car with x[0] while measuring q[0] via real-world experiment, calculate x[1] with the RNN, etc. until we know q[2].

Then we can ask, did it work? Did x[2] yield a high value of q[2]? Did we find good parameters for the controller by running just three real-world experiments?

Parabola, By Hand

To get a feel for how such a plan could possibly work, let’s work through a simple example.

Imagine that we’re building a controller for some simple physical system. The controller has just one parameter, and the quality metric returned by the real world just happens to be a parabola

(Imagine the parabola opening downward, so it has a maximum.)

We know it’s a parabola because we have some domain knowledge. We just don’t know which values of a, b, and c apply. We kind of don’t care, either, because what we really need is the x value that maximizes q. We plug that x into our controller and we’re done.

How could we tune this controller? We could try a generic black-box optimizer. It wouldn’t know that q(x) is a parabola, so it would some generic exploration rules and feel it away around parameter space until we tell it to stop. Fig. 1 shows the Nelder-Mead algorithm (from scipy.optimize.minimize; see code) optimizing a parabola. It finds a good answer after about 15 “experiments”.

Fig. 1 Nelder-Mead optimizes over a parabola in ~15 experiments

We can do it four.

Here’s how. Choose three random x values, x[0], x[1], x[2]. Run one experiment for each to measure q[0], q[1], q[2]. Define

Solve for 𝛽:

Now that we know a, b, and c, we can find the value of x that minimizes q:

We needed three experiments to determine the three parameters (a, b, c) plus 1 experiment to test the final x[max] for a total of four experiments.

We could solve it in only four experiments because we knew the function was a parabola. This kind of prior knowledge of a problem is termed inductive bias [M. Botvinick]. In short, inductive bias says that it’s easier to find an answer if you already mostly know the answer.

Analogy to MetaMountainCar

Let’s make the analogy to the parabola example above explicit.

The simulator computes q(x). The experiment also computes q(x).

The simulation parameters (ex., the shape-of-the-mountain parameters) play the role of a, b, and c. The real experiment has a single set of values for a, b, and c, but we don’t know them.

We’ll run many iterations of the simulation to tune the RNN. When we’re done, the RNN will know enough about q(x) to determine the optimal x by running only a few experiments — experiments that it (the RNN) prescribes.

You have embedded your domain knowledge in the simulator and the controller design. The RNN will learn that knowledge during its optimization and have it encoded in its weights. That’s its inductive bias.

The RNN weights encode a strategy for determining x[max] via a short sequence of experiments.

Are We Getting a Free Lunch?

No.

There’s a theorem called “No Free Lunch” [D. Wolpert, W. Macready] that says that all black-box optimizers are equivalently good when their performance is averaged over all possible problems.

But it also says that if you do better on one class of problems, you’ll do worse on others. We’re trying to do extremely well on one problem — the mountain car. There’s no reason to expect the optimizer (the RNN) we generate to do well on any other problem. No free lunch. No problem.

Parabolas and RNNs

To test out this RNN Meta-BBO plan, let’s first try it on the parabola. We know we could solve it by hand in four steps, so if this whole idea works the RNN should be able to solve it in four steps, too.

First we optimize the RNN on simulation — where “simulation” just computes the function

The RNN designs a sequence of four experiments for each of 100 random parabolas (where a “random parabola” means a randomly-chosen set of values for a, b, and c). The value that is plotted in Fig. 2 is p(w), the average of 100 values of q(x[3]) where q(x[3]) is the quality of the fourth and final experiment in each of the 100 runs.

Fig. 2 Optimizing the 1,138 weights of the RNN via CMA-ES.

Now watch a single sequence of “experiments” (Fig. 3) constructed by the RNN on a random parabola.

Fig. 3 A sequence of x values is generated by the optimized RNN. It maximizes q(x) in four experiments.

A sequence of four experiments. The value of x in each experiment is a function of all previous x’s and q’s. The red dot represents the final experiment, where q(x) = -0.00000095. We see that the RNN finds a very good estimate of x[max] in four tries.

For good measure, let’s try this with x being a vector. We’ll test cases where the dimension of x is n[dim]=3 and n[dim]=10.

When you read the code, note that I actually express q as

where q[0], s, and x[0] are the randomized parameters. It’s the same thing but easier to read, IMO, especially when x has more than one dimension. Also,

is usually called the “sphere function” when x has more than one dimension.

There are 2 + n[dim] free parameters in a sphere function (q[0], s, and the n[dim] values of x[0]), so it takes 2 + n[dim] + 1 to find and test x[max].

Optimizing the RNN

The RNN’s weights were optimized with CMA-ES, a state-of-the-art black-box optimizer [N. Hansen].

My implementation is a modified version of code from pycma. To speed up the calculations I retained only the code I needed for this project. In the end I included:

sep-CMA-ES code [R. Ros, N. Hansen]. This limits the covariance matrix calculation to just the diagonal elements, dramatically reducing memory usage and computation time. This is needed to optimize high-dimensional problems (more than, say, 1000 dimensions).
two-point step-size adaptation [N. Hansen]. This worked on the high-dimensional problems I tested, whereas cumulative step-size adaptation (an earlier method) did not. Alas, I have no further insight.

A discussion of the application of CMA-ES to high-dimensional RL tasks is in [N. Müller, T. Glasmachers].

Domain randomization makes the RNN objective very noisy. To combat this, I added noise-control code based on uh-cma-es. It differs in that (i) it only increases sigma and does not change the number of evaluations when the data are too noisy to rank the test parameters, and (ii) it assesses the “rankability” of parameters using bootstrap resampling. Please see [N. Hansen, et al.] for the uh algorithm, [V. Heidrich-Meisner and C. Igel] for an application to RL, and my code for this implementation.

Meta Mountain Car

The mountain car simulator code (MetaMountainCar) we’ll use is derived from OpenAI gym’s MountainCarContinuous-v0. The task is to drive a car up a hill using as little force as possible. The catch is that the engine is underpowered so that you can’t just drive straight up the hill. You have to rock back and forth a bit first.

I’ve modified it to represent a parameterized family of simulators:

shapeA, shapeB determine the shape of the hill.
The original values were shapeA=.0025, shapeB=3. We’ll choose shapeA in the range [.0005, .0025] and shapeB in [2, 6].
bFlip flips the sign of x before returning the state to the controller.
The car starts still at the bottom of the hill. The original gym version started in a random position.

The controller has two parameters, x_0, and x_1. (N.B.: x is the distance variable in the mountain car geometry. x_i are the parameters to the controller. Sorry.)

Objective Function

Since x has only two dimensions, we can visualize q(x). Here it is for two choices of the shape parameters (Figs. 5).

Fig. 5a MetaMountainCar(shapeA = 0.0025, shapeB = 3.0, bFlip=False)
q(x) is maximized at 98.43 at (x[0], x[1]) = (-0.10, -0.00).

Fig. 5b MetaMountainCar(shapeA = 0.0010, shapeB = 2.0, bFlip=True)
q(x) is maximized at 98.76 at (x[0], x[1]) = (0.10, 0.40).

Notice:

The optimal x[i] are different for each setting of the shape parameters.
bFlip=True generates a reflection in q(x). This moves the optimal parameters far from their values when bFlip=False.
q receives +100 for hitting the goal and loses a little for exerting a controlling force, so all q values are less than 100. Values less than 0 indicate that the goal was not hit.

MetaMountainCar Solution

The RNN’s job is to optimize q(x) in a very small number of iterations. Let’s fix that number at 3 to make it challenging.

I used an RNN with four hidden layers of 10 nodes each. Each node had hidden state, and a layer was updated with:

where n indexes time, l indexes layers. For layer l=0, h[n,l-1] is replaced with the input vector, which is the state vector of the mountain car simulation.

The output of the network is the force applied to the car:

The RNN code is in rnnLayer.hpp and rnn.hpp.

The RNN’s weights were optimized via CMA-ES (Fig. 6), as described above:

Fig. 6 Optimizing the RNN (794 weights) for MetaMountainCar.

Fig. 7 The final RNN running three experiments on each of 100 randomly-generated simulators

Fig. 7 shows the final RNN running its sequence of three experiments on 100 randomly-generated simulators. The first two experiments show wide variation in performance as the RNN samples the parameter space. The final experiment, however, yields a tight distribution around q(x)=94.9.

Out-of-Task Testing

I hate to break it to you so late in this blog post, but I didn’t build the controller into a real car and drive the car up a mountain.

We still need to evaluate the RNN’s generalization ability somehow, however, so I wrote a second simulator called RealMountainCar (ahem… “real”).

First, note that during optimization the RNN was evaluated on simulations with randomly-sampled shape parameters. These parameters change for every evaluation of q(x). The plot you see in Fig. 6 shows the performance on 100 different random simulators for each iteration. That plot is all in-task.

RealMountainCar is based on MetaMountainCar but with these changes:

set shapeA = .0015, shapeB = 4.5, bFlip = False
modified the range of allowed forces and velocities
moved the goal
added drag (a force proportional to -v²)

The sequence of “experiments” designed by the RNN and run on RealMountainCar produced:

Experiment 1: q = -89.682
Experiment 2: q = 91.719
Experiment 3: q = 92.222

It worked! The car got up the hill and hit the goal! Despite all the changes, apparently the “real” mountain car was similar enough to our simulator so that the RNN could optimize the controller on it.

Optimizing Directly

Ok. Reality check. The RNN optimized RealMountainCar in only three experiments. But did it do better than a generic black box optimizer would have?

I ran CMA-ES directly on the controller parameters and evaluated the controller on RealMountainCar for three iterations. That process was repeated 100 times. The mean and standard deviation of q was:

Experiment 1: q = 44.9 +/- 51.2
Experiment 2: q = 47.3 +/- 51.9
Experiment 3: q = 39.6 +/- 50.7

CMA-ES uses a default population size of 6 for this problem. That means CMA-ES generates 6 parameters at once, evaluates them all, then repeats. The results above are the first three in the list of the initial 6 parameters. They’re random guesses. That’s because random guessing is the best you can do on your first try — when you have don’t yet have any information about the problem.

Since the fraction of the mountain car parameter space that is good (where “good” is, say q > 90) is so large that these guesses sometimes get a good score. But, on average, they don’t produce good results. Since they’re random guesses, that shouldn’t be surprising, but it’s was still worth checking because now we know for sure we produced an RNN that does better than, at least, random guessing.

Additionally, note that one of the results of optimizing the RNN (part of its inductive bias) is that the RNN’s first guess isn’t random. It’s tailored to the problem.

What if we let CMA-ES continue optimizing — beyond just initial guesses? How many experiments would it take to get the car up the hill? How much of a speed up is MetaBBO producing?

Fig. 8 CMA-ES directly optimizing the controller on RealMountainCar. Mean and stddev of 100 runs is shown.

Fig. 8 shows CMA-ES applied directly to the controller parameters, x, on RealMountainCar. It takes almost 800 experiments to match the performance that our custom RNN achieves in only 3. That’s over two orders of magnitude speedup (266x, to be precise).

Discussion

We imagined an industrial controller design problem along with its engineering tools:

a simulator built with domain knowledge
a parameterized controller built with domain knowledge
a way to experimentally evaluate the controller

We asked whether we could produce a better controller by tuning in experiment than we could by tuning in simulation.

Experiments are costly, and generic black-box optimizers need too many experiments to produce a good result. Our solution was to build a new optimizer customized for our controller. The new optimizer (the MetaBBO / RNN) was tuned via simulation — which is cheap — then employed in the “real world”.

Could you apply a methodology like this to an existing, industrial problem? Maybe, but there are some open questions:

Could this process be scaled to controllers with more dimensions?
What about more complex quality functions (q(x))?
What about noise in q(x)?
How similar does the real world have to be to your simulator to make this work? In what ways? How do you define or quantify the relevant similarities?
Can your simulation run quickly enough to optimize the RNN in a reasonable amount of time?

The first two questions might simply require larger RNNs and more computation. The third might be addressed by simulating noise in the simulated q(x) when optimizing the RNN.

The last two questions are related in this way: If you figure out in which ways your simulator needs to match reality, perhaps you could build a simple simulator that captures only those similarities. Then, maybe domain randomization could compensate for the use of a very low-fidelity (but really fast) simulation. OpenAI’s dexterous robotic hand controller suggests that this is possible.

Thanks for reading. The source code for this project is available on github.