Model-Based Control using Neural Network: A Case Study
Control simulation of a mechanical system using a neural network-based model predictive control algorithm
Automation and control engineering plays a crucial role in the industry. The application ranges from robotics, manufacturing, process systems, medicine, finance, energy management, and even epidemiology. In the last decades, a lot of intelligent control methods have been developed. In this article, I will show you how to combine artificial neural networks and model predictive control. We will study the case of controlling a nonlinear Multi-Input-Multi-Output (MIMO) system by real-time (online) learning.
Motivation
Modern control systems often have multiple inputs and multiple outputs with nonlinear behavior. As a mechanical engineering undergrad, I learned that physical systems can be modeled as a mass-spring-damper system. Normally, such systems are controlled using a PID controller.
This work is inspired by Nagabandi et al., 2017. They used a neural network in an MPC loop to do MuJoCo locomotion tasks in robotics [2]. There is a popular toolkit for comparing and developing reinforcement learning algorithms called the OpenAI Gym, where you can experiment on standardized mechanical control tasks such as the inverted pendulum. I could not find any ready framework for mass-spring-damper control using neural networks. So, I created my own simulation in Python.
UPDATE (13/01/2021): The code for this is now available on my GitHub.
Problem
Suppose we have three identical point masses m = 0.5 kg, with initial positions x1, x2, and x3. The dampers have damping constant d= 0.25 N.s/m each. The springs are nonlinear, described by Newton’s law as:
with k = 217 N/m and kp= 63.5 N/m³. The forces “u1" and “u3" are the “control inputs”, sometimes also referred to as the “actuation forces” or “actions”. I will call it as actions from now on. The disturbance “dist” disrupts the motion of the system. Say, it is a random value between -100 N to 100 N.
This problem is very unrealistic. The system is nonlinear and control starts at an extremely high frequency. This is the main challenge.
The main goal here is to steer the masses m1 and m3 to its desired reference position values x1_ref and x3_ref. An artificial neural network will be used to predict its future positions x1 and x3 based on the forces u1, dist, and u3.
The challenge:
- We only get measurements of x1 and x3 in every timestep ( Δt= 0.001s).
- The system starts from an arbitrary position of x1, x2, x3 with random initial velocity v1, v2, v3.
- The system dynamics are unknown AND we are not able to calculate the equations of motion using Newton’s law.
Approach
Sometimes in the real world, it's impossible to calculate the dynamics of a system without any approximations. We will try to use machine learning techniques to solve this problem.
Control of complex systems involves two parts: system identification and controller design. In system identification, we create the dynamics model.
We know that neural networks can be used to recognize patterns. Thus, we can train the network to predict future behavior based on past information. This can be done using supervised learning regression algorithms. We classify this as a time-series forecasting problem in machine learning.
Additional challenge: the mechanical system might experience a sudden dynamics change. For example, if we reduce or replace the masses with different weights. Imagine you are a taxi driver. Your passengers are changing over time. Say, you got a new load of 200 kg. This will slightly change the dynamic behavior of the vehicle. Perhaps it is not the best of examples, but you get the idea.
Knowing that dynamics can change, the designed neural network must be able to perform adaptive learning. Thus, I will design a framework to do both dynamics learning and system control online. This can be realized using reinforcement learning. Now, let’s dive deeper into the concepts!
Model
The term model here can be understood as the dynamics function, a set of “rules” which determines how the states x (position of the masses) change concerning time t. Mathematically, it predicts the next states x based on the actions u:
where Z defines the constraints (limitations) of the states and actions.
Why use a model?
Because it increases explainability.
Many real-world physical systems also have model constraints concerning safety and system compatibility. Furthermore, model-free algorithms are not sample-efficient compared to model-based algorithms.
Neural Network Model
The neural network’s goal here is to be the model: learn the dynamics function of our mechanical system. It’s easy…
We give the neural network real-time state measurements. Then train it with some algorithm. All this without actually knowing Newton’s law!
I won’t get into the details of how and why it works. There are enough articles and videos on the internet that explain neural network training. In this section, I will just present my neural network design.
At first, I considered using recurrent neural networks (RNN) such as the NARX. But, I found that the online training of RNN would take a longer time in the reinforcement learning algorithm. So, I used here the simplest architecture: Multilayer Perceptron (MLP).
More layers will need more computation power, which is undesirable in real-time control. I modified the network to include past output measurements of the mechanical system, giving it the so-called time-delayed inputs.
Additionally, the neural network will predict the change in position Δx, instead of predicting the actual position of the next states x. This might sometimes be referred to as residual connections. Or a data manipulation technique. This exploits the knowledge that in our dynamical system the state changes Δx are very small since one timestep is only 0.001 seconds.
Neural Network Design
Remember: The only available real-time measurements are x1 and x3, every 0.001 seconds. Assume there are no measurements available for x2, or any state velocities to train the network.
For the experiment, I set the time delay as 2 timesteps. This results in ‘new’ information for learning: the past states (xt-1 and xt-2).
Inputs: The neural network inputs are the actions u (u1 and u3), the current states xt-0 (x1 and x3), as well as the past changes in states diff1 and diff2 (Δxt-1 and Δxt-2).
Outputs: The neural network outputs are the predicted change of states x (Δx1 and Δx3). All the neural network layers are fully-connected.
Since we have 2 measured states (x1 and x3), 2 actions (u1 and u3), and a disturbance input (dist), I will need in total (1x2) + (3x2) + 1 = 9 input neurons and 2 output neurons for my design.
I set 40 hidden neurons within the MLP architecture. The leaky rectified linear unit (LReLU) activation function was chosen with a= 0.5.
I experimented with a lot of configurations for our case. I achieved the best training performance using Adam optimizer with a learning rate of 0.001 and batch size 16. The data (input and output pairs) are preprocessed before training to have a mean 0 and standard deviation 1. All the weights and biases of the neural network were initialized with Glorot uniform initialization.
These settings are purely experimental. But just to get you the idea…
The accuracy of the model plays a crucial role in controlling a dynamic system. So the model had to be trained well before control is applied… Speaking of which, let’s now discuss a popular control technique: model predictive control.
Model Predictive Control (MPC)
MPC is one of the most used methods to control multivariable systems. As the name suggests, control is applied based on the predictions made by the model. The optimization is usually done using the receding horizon strategy.
This strategy provides a simple method for determining future actions by optimizing its future trajectories... See the figure below.
We can express this strategy as the control “objective function”:
Now the above equation might look complicated, but it is actually very simple. We want to pick the actions u(t) which produces the least J(t).
To apply this strategy, we need to know three things:
- desired reference values of y (trajectory goal, set by the user)
- predicted values y over some time horizon N (acquired from the model)
- actions u over some time horizon Nu (calculated / random guess)
J(t) here is referred to as the “cost”. Larger cost means bad control performance.
The first quadratic term is the total predicted error. You calculate this by summing the squared differences between future trajectories from desired trajectories.
The second quadratic term represents the change of actions in each timestep over the control horizon Nu. It penalizes large change in actions. Smooth (less jerky) actions are desirable because i.e. real motors can’t change their value by much over a small period.
Evaluation Metrics
Before going deeper into the control algorithm, we must design the evaluation metrics. In this experiment, I define the “control performances” as two things — 1.) trajectory deviation, 2.) action smoothness and total impulse. The simulation is done in discrete time with Δt= 0.001s.
- The trajectory deviation,
with yt being the predicted values and wt denoting the reference values in the future trajectories. Meanwhile, s is the total number of simulation timesteps.
2. Total impulse and action smoothness
We want both of these values and the RMSE to be as small as possible. However, it is often the case of compromising between the two.
Control Strategy
Combined with the neural network-based MPC, we will apply a simple control method called random shooting.
How does it work?
In the MPC algorithm, 300 candidate actions will be randomized for each timestep. Repeating this for N times and passing it through the model, we get 300 candidate trajectories for each state. Finally, the candidate trajectories will be used to select the best action sequence which minimizes the objective function.
Additionally, I will limit the change of possible actions u in each timestep. By doing this, we can omit the second quadratic term in the objective function. Instead, I added horizonpenalty in the equation:
where the horizonpenalty is… a term I completely made up. It can be whatever you want! In general, you should design it to achieve better control performance.
I experimented with some terms that penalize high system velocities. How?
The trick is by substracting the last states with the first states of the predicted system trajectories. Change of distance over time = velocity. We will come back to this later.
Now you might be asking, how is this all related to reinforcement learning?
The Reinforcement Learning (RL) Framework
So in RL, the learning is done by continuous interaction of the agent with the environment. This can also be called “online learning”. In “model-free” RL algorithms, the agent learns by doing.
In model-based reinforcement learning, the agent models the environment and plans the best action based on the model. If you want a detailed explanation, please read this amazing article by Jonathan Hui.
Shortly explained: in control engineering, the “agent” is a controller. The “environment” is the dynamical system + environment (where the disturbances come from). We can get the states from our “observations”.
The “reward“ can be seen as the objective function, which flows from the agent to the environment. Finally, the “action” represents the control inputs.
In general, there are two types of behavior in RL:
- Exploration: the goal is to learn as much as possible
- Exploitation: the goal is to get as much reward as possible
The major work in this article is online learning of the local dynamics. This means that the neural network is not pre-trained. Instead, it is continuously learning the dynamics function of the system.
Here, I designed the framework for our task:
Experiment design for validating the framework:
And the pseudo algorithm is:
We can see from the algorithm that control is applied when enough data was collected. There are two phases in this algorithm: exploration and exploitation.
Data Generation
First, we will randomize the actions for some period to collect initial data. In RL terms this is called “exploration”. We train the neural network in batches. Some rules for this exploration period:
- The maximum force will be 500 N.
- It can only change at most by 500 N in one timestep.
- We define one timestep as 0.001 seconds.
If we derive the equations of motion using Newton’s law, we can simulate this mechanical system. The initial conditions are steady-state (zero velocities).
We can simulate the system behavior using the library scipy.integrate.odeint. The resulting output trajectories for the first 20 seconds are shown below.
We can see the system is extremely unstable and very hard to predict. The steps above can also be repeated to generate test sets for offline training.
NN Prediction Capability
This is not related to the algorithm above, but we want to test if the neural network can predict accurately.
As discussed before, the data will be “standardized” before being fed into the NN model. Both the mean and standard deviation are sampled every num_calib=20000 data points.
Take a look at the resulting data distribution below for 20000 data points:
All input states have the mean 0 and standard deviation 1. This balances the input states to a common scale, enabling fast training of the neural network.
For this testing purpose, I trained the neural network with 20000 data points (input-output pairs). The error is barely visible when it predicts one timestep into the future, as shown in the trajectory plot below:
Next, we can also propagate the learned dynamics forward to make multistep predictions. In fact, this is what we are really doing in the framework. Below, we can see example results for predicting N= 10 timesteps into the future:
Despite visible errors, this network can predict the trend of the trajectories. The prediction accuracy will get worse if we increase the prediction horizon N. This is why it is called “local dynamics learning”.
Control Algorithm
As explained before, we will use random shooting in combination with MPC.
To figure out the best “horizonpenalty”, the algorithm was first simulated with the pre-trained network. Below we can compare some examples of horizonpenalty and its resulting control performances.
RMSE100 is a metric to compare the controller performances, which gives the controller 0.1s to react before being benchmarked. In this model-based RL algorithm, I implemented the horizonpenalty (b) for the objective function.
Now let's see the simulation results of this framework!
Exploration
Random actions are applied to the system in the first 20 seconds of simulation. At this point, we get the first value of mean and standard deviation. Having this information, online training of the neural network starts.
At 40 seconds, the neural network would have been trained with 2 seconds of old dynamics and 18 seconds of new dynamics.
Exploitation
At this point, we can be fairly confident that the neural network learned the dynamics function, so we can stop exploring start “exploiting” our model by applying the control algorithm.
Due to the exploration, the system now oscillates very fast. The state conditions of the system at 40 seconds are shown in the table below.
In this exploitation phase, I will increase the maximum possible action to 1000 N. This was done to speed up the control process. In general, we could achieve control of the desired values within 0.3 seconds using this algorithm.
Final Results
The results here show the continuation of the simulation from 40 seconds.
I set the desired values [x1, x3] = [1,2] m for the next 0.3 seconds and then changing it back to the origin [0,0] m.
The simulation was repeated twice: with and without disturbance dist.
In the first simulation, no disturbance was applied to the system:
Second simulation, disturbance dist between -100 N to 100 N randomly applied to the system:
TL;DR
I developed a model-based reinforcement learning framework using neural network-and model predictive control. I used this framework to simulate the control of a nonlinear MIMO system.
In general, this model-based RL algorithm could:
- learn the local dynamics of the system in real-time.
- achieve good control performances, regardless of the initial state condition after exploration.
- handle sudden dynamic changes and possible disturbances to the system.
Requirements:
Possible Improvement:
- Use input convex neural networks to learn the whole system dynamics. This solves the MPC problem to global optimality [3].
- Implementation of the control algorithm on computing platforms for a faster run time (e.g. OpenCL).
Future Work
The mechanical system presented here is purely fictional. An exciting direction would be to simulate real-world problems. The presented framework above needs to be modified depending on the domain problem, but the general idea stays the same.
If you are interested in robotics movement control, you should try out the MuJoCo physics engine.
Thank you for reading and have fun exploring!
References
- Hagan, M.T. ; Demuth, H.B. ; Jesus, O.d.: An introduction to the use of neural networks in control systems. In: Int. J. Robust Nonlinear Control Bd. 12. John Wiley and Sons Ltd, (2002), S. 959-985
- Nagabandi, A. ; Kahn, G. ; Fearing, R.S. ; Levine, S.: Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. In: CoRR abs/1708.02596 (2017)
- Chen, Y. ; Shi, Y. ; Zhang, B.: Optimal Control Via Neural Networks: A Convex Approach. In: arXiv: Optimization and Control (2019)
- Sutton, RS. ; Barto, A.G.: Reinforcement Learning: An Introduction. Cambridge, MA, USA : A Bradford Book, (2018) — ISBN 0262039249
- Wong, W. ; Chee, E. ; Li, J. ;Wang, X.: Recurrent Neural Network-Based Model Predictive Control for Continuous Pharmaceutical Manufacturing. In: Mathematics 6 (2018), S. 242
- Rawlings, J. ; Mayne, D.Q. ; Diehl, M.: Model Predictive Control: Theory, Computation, and Design 2nd Edition. Santa Barbara, California: Nob Hill Publishing, LLC, (2019)