# Differentiable Programming and Neural ODEs for Accelerating Model Based Reinforcement Learning and Optimal Control Strategies learnt under a minute: 1-go swing up (left), resonant incremental swing up with force constraint (right)

## Code

# physical paramsm = 1 # pole mass kgM = 2 # cart mass kgL = 1 # pole length mg = 9.8 # acceleration constant m/s^2# map angle to [-pi, pi)modpi(theta) = mod2pi(theta + pi) - pi#=system dynamics derivativedu: du/dt, state vector derivative updated inplaceu: state vector (x, dx, theta, dtheta)p: parameter function, here lateral force exerted by cart as a fn of timet: time=#function cartpole(du, u, p, t)  # position (cart), velocity, pole angle, angular velocity  x, dx, theta, dtheta = u  force = p(t)  du = dx  du =    (force + m * sin(theta) * (L * dtheta^2 - g * cos(theta))) /    (M + m * sin(theta)^2)  du = dtheta  du =    (-force * cos(theta) - m * L * dtheta^2 * sin(theta) * cos(theta) + (M + m) * g * sin(theta)) / (L * (M + m * sin(theta)^2))end
# neural network controller, here a simple MLP# inputs: cos(theta), sin(theta), theta_dot# output: cart forcecontroller = FastChain((x, p) -> x, FastDense(3, 8, tanh), FastDense(8, 1))# initial neural network weightspinit = initial_params(controller)
#=system dynamics derivative with the controller included=#function cartpole_controlled(du, u, p, t)  # controller force response  force = controller([cos(u), sin(u), u], p)  du = force# plug force into system dynamics  cartpole(du, u[1:4], t -> force, t)end# initial conditionu0 = [0; 0; pi; 0; 0]tspan = (0.0, 1.)N=50tsteps = range(tspan, length = N, tspan)dt = (tspan - tspan) / N# push!(u0, 0)# set up ODE problemprob = ODEProblem(cartpole_controlled, u0, tspan, pinit)# wrangles output from ODE solverfunction format(pred)  x = pred[1, :]  dx = pred[2, :]theta = modpi.(pred[3, :])  dtheta = pred[4, :]# take derivative of impulse to get force  impulse = pred[5, :]  tmp = (impulse .- circshift(impulse, 1)) / dt  force = [tmp,tmp[2:end]...]return x, dx, theta, dtheta, forceend# solves ODEfunction predict_neuralode(p)  tmp_prob = remake(prob, p = p)  solve(tmp_prob, Tsit5(), saveat = tsteps)end
# loss to minimize as a function of neural network parameters pfunction loss_neuralode(p)  pred = predict_neuralode(p)  x, dx, theta, dtheta, force = format(pred)  loss = sum(theta .^ 2) / N + 4theta[end]^2 + dx[end]^2return loss, predend
i = 0 # training epoch counterdata = 0 # time series of state vector and control signal# callback function after each training epochcallback = function (p, l, pred; doplot = true)  global i += 1global data = format(pred)  x, dx, theta, dtheta, force = data# ouput every few epochs  if i % 50 == 0    println(l)    display(plot(tsteps, theta))    display(plot(tsteps, x))    display(plot(tsteps, force))  endreturn falseendresult = DiffEqFlux.sciml_train(  loss_neuralode,  pinit,  ADAM(0.05),  cb = callback,  maxiters = 1000,)p = result.minimizer# save model and dataopen(io -> write(io, json(p)), "model.json", "w")open(io -> write(io, json(data)), "data.json", "w")
gr()x, dx, theta, dtheta, force = dataanim = Animation()plt=plot(tsteps,[modpi.(theta.+.01),x,force],title=["Angle" "Position" "Force"],layout=(3,1))display(plt)savefig(plt,"cartpole_data.png")for (x, theta) in zip(x, theta)cart = [x - 1 x + 1; 0 0]    pole = [x x + 10*sin(theta); 0 10*cos(theta)]    plt = plot(        cart[1, :],        cart[2, :],        xlim = (-10, 10),        ylim = (-10, 10),        title = "Cartpole",        linewidth = 3,    )    plot!(plt, pole[1, :], pole[2, :], linewidth = 6)frame(anim)endgif(anim, "cartpole_animation.gif", fps = 10)

## Results

tspan = (0.0, 10.0)...loss = sum(theta .^ 2) / N + 4theta[end]^2 + dx[end]^2 + .1sum(x .^ 2) / N + .001maximum(force.^2)

## Discussion

• Credit goes to automatic differentiation, namely Julia’s Zygote. It saves us from having to hard code the gradient function of (almost) arbitrary code, just as Tensorflow did so for the restricted case of neural network chains.
• Neural ODE is a continuum model that requires no time discretization or interpolation of the dynamics. The ODE solver is instructed to sample at whatever time points necessary to compute the loss. Our method is not only well suited for continuous time physical systems, it also works on discrete time systems. AD would work the same on a computer game loop (even faster).
• Third, we can make the system simulation data driven. There, we can replace the physics (or part thereof) with a system dynamics neural network. We train it on observation data, such as sensors time series from a vehicle or industrial process. Then, we stick in the controller neural network to compute optimal control.

--

--