The Startup
Published in

The Startup

Differentiable Programming and Neural ODEs for Accelerating Model Based Reinforcement Learning and Optimal Control

Strategies learnt under a minute: 1-go swing up (left), resonant incremental swing up with force constraint (right)

Abstract

Motivation: (much) faster reinforcement learning

Toy Problem

Source: https://www.ashwinnarayan.com/post/cartpole-dynamics/

Magic ingredient: differentiable programming

Code

# physical params
m = 1 # pole mass kg
M = 2 # cart mass kg
L = 1 # pole length m
g = 9.8 # acceleration constant m/s^2
# map angle to [-pi, pi)
modpi(theta) = mod2pi(theta + pi) - pi
#=
system dynamics derivative
du: du/dt, state vector derivative updated inplace
u: state vector (x, dx, theta, dtheta)
p: parameter function, here lateral force exerted by cart as a fn of time
t: time
=#
function cartpole(du, u, p, t)
# position (cart), velocity, pole angle, angular velocity
x, dx, theta, dtheta = u
force = p(t)
du[1] = dx
du[2] =
(force + m * sin(theta) * (L * dtheta^2 - g * cos(theta))) /
(M + m * sin(theta)^2)
du[3] = dtheta
du[4] =
(-force * cos(theta) - m * L * dtheta^2 * sin(theta) * cos(theta) + (M + m) * g * sin(theta)) / (L * (M + m * sin(theta)^2))
end
# neural network controller, here a simple MLP
# inputs: cos(theta), sin(theta), theta_dot
# output: cart force
controller = FastChain((x, p) -> x, FastDense(3, 8, tanh), FastDense(8, 1))
# initial neural network weights
pinit = initial_params(controller)
#=
system dynamics derivative with the controller included
=#
function cartpole_controlled(du, u, p, t)
# controller force response
force = controller([cos(u[3]), sin(u[3]), u[4]], p)[1]
du[5] = force
# plug force into system dynamics
cartpole(du, u[1:4], t -> force, t)
end
# initial condition
u0 = [0; 0; pi; 0; 0]
tspan = (0.0, 1.)
N=50
tsteps = range(tspan[1], length = N, tspan[2])
dt = (tspan[2] - tspan[1]) / N
# push!(u0, 0)
# set up ODE problem
prob = ODEProblem(cartpole_controlled, u0, tspan, pinit)
# wrangles output from ODE solver
function format(pred)
x = pred[1, :]
dx = pred[2, :]
theta = modpi.(pred[3, :])
dtheta = pred[4, :]
# take derivative of impulse to get force
impulse = pred[5, :]
tmp = (impulse .- circshift(impulse, 1)) / dt
force = [tmp[2],tmp[2:end]...]
return x, dx, theta, dtheta, force
end
# solves ODE
function predict_neuralode(p)
tmp_prob = remake(prob, p = p)
solve(tmp_prob, Tsit5(), saveat = tsteps)
end
# loss to minimize as a function of neural network parameters p
function loss_neuralode(p)
pred = predict_neuralode(p)
x, dx, theta, dtheta, force = format(pred)
loss = sum(theta .^ 2) / N + 4theta[end]^2 + dx[end]^2
return loss, pred
end
i = 0 # training epoch counter
data = 0 # time series of state vector and control signal
# callback function after each training epoch
callback = function (p, l, pred; doplot = true)
global i += 1
global data = format(pred)
x, dx, theta, dtheta, force = data
# ouput every few epochs
if i % 50 == 0
println(l)
display(plot(tsteps, theta))
display(plot(tsteps, x))
display(plot(tsteps, force))
end
return falseendresult = DiffEqFlux.sciml_train(
loss_neuralode,
pinit,
ADAM(0.05),
cb = callback,
maxiters = 1000,
)
p = result.minimizer# save model and data
open(io -> write(io, json(p)), "model.json", "w")
open(io -> write(io, json(data)), "data.json", "w")
gr()
x, dx, theta, dtheta, force = data
anim = Animation()
plt=plot(tsteps,[modpi.(theta.+.01),x,force],title=["Angle" "Position" "Force"],layout=(3,1))
display(plt)
savefig(plt,"cartpole_data.png")
for (x, theta) in zip(x, theta)cart = [x - 1 x + 1; 0 0]
pole = [x x + 10*sin(theta); 0 10*cos(theta)]
plt = plot(
cart[1, :],
cart[2, :],
xlim = (-10, 10),
ylim = (-10, 10),
title = "Cartpole",
linewidth = 3,
)
plot!(plt, pole[1, :], pole[2, :], linewidth = 6)
frame(anim)
end
gif(anim, "cartpole_animation.gif", fps = 10)

Results

tspan = (0.0, 10.0)
...
loss = sum(theta .^ 2) / N + 4theta[end]^2 + dx[end]^2 + .1sum(x .^ 2) / N + .001maximum(force.^2)

Caveats

Discussion

  • Credit goes to automatic differentiation, namely Julia’s Zygote. It saves us from having to hard code the gradient function of (almost) arbitrary code, just as Tensorflow did so for the restricted case of neural network chains.
  • Neural ODE is a continuum model that requires no time discretization or interpolation of the dynamics. The ODE solver is instructed to sample at whatever time points necessary to compute the loss. Our method is not only well suited for continuous time physical systems, it also works on discrete time systems. AD would work the same on a computer game loop (even faster).
  • Third, we can make the system simulation data driven. There, we can replace the physics (or part thereof) with a system dynamics neural network. We train it on observation data, such as sensors time series from a vehicle or industrial process. Then, we stick in the controller neural network to compute optimal control.

Acknowledgment

References & Tutorials

Note

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Paul Shen

Engineer Entrepreneur | Stanford MS Electrical Engineering | Former MD Candidate IUSM