# Which one is better: Reinforcement Learning or Model Predictive Control? Inverted Pendulum — Case*

*“Have you ever thought about that? If so, then which one is better?”*

*Disclaimer: This post is written based on my experience and knowledge. So if you think something’s not written properly here feel free to contact me.

If you came from a control engineering background, surely that’s a tough question. As a control engineering student in the past, before designing a controller for a system I always do a system modeling by implementing the mathematical model that best represents the system. With the gathered model of the system, we can design a control system in which we can assure 100% that the system will act according to our design to follow its reference point.

When designing a control system, there’s a lot of frameworks and methods that we can use, ranging from classical control to modern control theory. In classical control where we deal with Single-Input Single-Output (SISO) system and transfer function, we can use bang-bang control, PID control, pole placement, etc.

In modern control where we deal with Multi-Input Multi-Output (MIMO) system, we need to convert the system into state-space, which breaks multiple order systems into a set of one order systems. There’s a lot of methods to control this case like Model Predictive Control (MPC), Linear Quadratic Regulator (LQR), Robust Control, Pontryagin Max/Min Principle (PMP), Kalman Filter, etc.

In recent days, many people deal with Artificial Intelligence (AI) and Machine Learning (ML) try to implement this to control theory. Actually this is not a new thing, because some researchers had tried a neural network to control a system but it still has a limitation as a neural network can handle a control task on a specific condition only where it’s trained. If we have other condition we need to retrain it. Surely this very differs from classical/modern control where we can set control system robustness to tackle this case.

Later, scientists try to use PID control framework where the PID parameters were tuned with a neural network. Since then, AI application in control is growing fast and many new methods are used to control a system like an ant colony optimization, genetic algorithm, fuzzy control, etc. And now as the AI and ML world are growing fast, there’s a method called reinforcement learning.

In general, we can divide control world into a

conservative control system(classical and modern control)andintelligent control system (AI and ML).

And now, we will try to answer the aforementioned question above about which one is better for controlling inverted pendulum, model predictive control, or reinforcement learning?

REINFORCEMENT LEARNING

**Reinforcement learning** (RL) is an area of **machine learning** which has two components: agents and environments. It is considered as one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

How it works?

This agent will give actions into environments in order to maximize the notion of cumulative reward. When the agents give action to the environment, then the environment will update its state as the input to the agents to calculate the reward and update the action. In the control term, we can imagine agents as the controller and environments as the system we want to control.

In order to model the environment for inverted pendulum tasks, we’ll be using a toolkit developed by OpenAI called OpenAI Gym. The OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms and it provides several pre-defined environments for training and testing reinforcement learning agents, including those for classic physics control tasks.

So here are the steps to implement RL with OpenAI Gym in Python [1]:

Step1: Install OpenAI Gym & Call the libraries

!pip install gymimport tensorflow as tf

import numpy as np

import base64, io, time, gym

import IPython, functools

import matplotlib.pyplot as plt

from tqdm import tqdm

Step2: Initiate the environment, we will use the cart pole environment as it best represents the inverted pendulum case

`env = gym.make(“CartPole-v0”)`

env.seed(1)

Step3: Define cart pole agent

### Define the Cartpole agent ###

# Defines a feed-forward neural networkdef create_cartpole_model():

model = tf.keras.models.Sequential([# First Dense layer

tf.keras.layers.Dense(units=32, activation=’relu’),# Define the last Dense layer, which will provide the network’s output.

tf.keras.layers.Dense(units=n_actions, activation=None) # TODO])return model

cartpole_model = create_cartpole_model()

This cart pole agent is described as a feed-forward neural network with 32 hidden dense units and 2 outputs.

Step4: Define the agent’s action

### Define the agent’s action function ####

# Function that takes observations as input, executes a forward pass through model, and outputs a sampled action.# Arguments:

# model: the network that defines our agent

# observation: observation which is fed as input to the model# Returns:

# action: choice of agent actiondef choose_action(model, observation):

# add batch dimension to the observation

observation = np.expand_dims(observation, axis=0) # Feed the observations through the model to predict the log probabilities of each possible action

logits = model.predict(observation) # pass the log probabilities through a softmax to compute true probabilities

prob_weights = tf.nn.softmax(logits).numpy() # Randomly sample from the prob_weights to pick an action.

action = np.random.choice(n_actions, size=1, p=prob_weights.flatten())[0] # TODOreturn action

Step5: Define the agent’s memory

### Agent Memory ###class Memory:

def __init__(self):

self.clear()# Resets/restarts the memory buffer

def clear(self):

self.observations = []

self.actions = []

self.rewards = []# Add observations, actions, rewards to memory def add_to_memory(self, new_observation, new_action, new_reward):

self.observations.append(new_observation)

self.actions.append(new_action)

# Update the list of rewards with new reward

self.rewards.append(new_reward)memory = Memory()

Step6: Define reward function

### Reward function ###

# Helper function that normalizes an np.array xdef normalize(x):

x -= np.mean(x)

x /= np.std(x)

return x.astype(np.float32)# Compute normalized, discounted, cumulative rewards (i.e., return)

# Arguments:

# rewards: reward at timesteps in episode

# gamma: discounting factor# Returns:

# normalized discounted reward

def discount_rewards(rewards, gamma=0.95):

discounted_rewards = np.zeros_like(rewards)

R = 0

for t in reversed(range(0, len(rewards))):

# update the total discounted reward

R = R * gamma + rewards[t]

discounted_rewards[t] = R return normalize(discounted_rewards)

Step7: Define loss function

### Loss function ###

# Arguments:

# logits: network’s predictions for actions to take

# actions: the actions the agent took in an episode

# rewards: the rewards the agent received in an episode# Returns:

# lossdef compute_loss(logits, actions, rewards):

# Compute the negative log probabilities

neg_logprob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=actions # Scale the negative log probability by the rewards

loss = tf.reduce_mean( neg_logprob * rewards ) # TODO

return loss

Step8: Use loss function to define a training step for the learning algorithm

### Training step (forward and backpropagation) ###def train_step(model, optimizer, observations, actions, discounted_rewards): with tf.GradientTape() as tape:

# Forward propagate through the agent network

logits = model(observations)

# Call the compute_loss function to compute the loss’’’

loss = compute_loss(logits, actions, discounted_rewards)

# Run backpropagation to minimize the loss using the tape.gradient method grads = tape.gradient(loss, model.trainable_variables)

optimizer.apply_gradients(zip(grads, model.trainable_variables))

Step9: Run the cart pole

### Cartpole training! #### Learning rate and optimizer

learning_rate = 1e-3

optimizer = tf.keras.optimizers.Adam(learning_rate)# instantiate cartpole agent

cartpole_model = create_cartpole_model()# to track our progress

smoothed_reward = mdl.util.LossHistory(smoothing_factor=0.9)

plotter = mdl.util.PeriodicPlotter(sec=2, xlabel=’Iterations’, ylabel=’Rewards’)

if hasattr(tqdm, ‘_instances’): tqdm._instances.clear() # clear if it existsfor i_episode in range(500):

plotter.plot(smoothed_reward.get())

# Restart the environment

observation = env.reset()

memory.clear()

while True:

# using our observation, choose an action and take it in the environment

action = choose_action(cartpole_model, observation)

next_observation, reward, done, info = env.step(action)

# add to memory

memory.add_to_memory(observation, action, reward)

# is the episode over? did you crash or do so well that you’re done?

if done:

# determine total reward and keep a record of this

total_reward = sum(memory.rewards)

smoothed_reward.append(total_reward)

# initiate training — remember we don’t know anything about how the

# agent is doing until it has crashed!

train_step(cartpole_model, optimizer, observations=np.vstack(memory.observations),actions=np.array(memory.actions),discounted_rewards = discount_rewards(memory.rewards)) # reset the memory

memory.clear()

break # update our observatons

observation = next_observation

In general, the below chart conclude all the workflow:

MODEL PREDICTIVE CONTROL

**Model Predictive Control **(MPC) is widely known as a process control’s advanced method that is used to control a process while satisfying a set of constraints. But in recent years it has also been used in controlling electrical and mechanical systems.

How it works?

MPC uses a model of the system to make predictions about the system’s future behavior. MPC solves an online optimization algorithm to find the optimal control action that drives the predicted output to the reference. MPC can handle multi-input multi-output systems that may have interactions between their inputs and outputs. It can also handle input and output constraints [3].

In a nutshell, MPC is an optimization method where you iterate the optimization input for every finite-time horizon. It is called online because the optimization is done iteratively until the system reaches its set point. This differs from other optimization methods where the calculation of the control gain only once before the process starts.

So here are the steps to implement MPC with OpenAI Gym in Python [2]:

Step1: Determine the mathematical model of cart pole

Actually there’s a lot of cart pole mathematical models that we can use, it depends on the assumption because there’s a model that ignores the moment inertia of the pole. So we need wisely choose the model.

Step2: Determine the cost function

Step3: Call the libraries

`import matplotlib.animation as animation`

import numpy as np

from mpc import MPC

import numpy as np

import gym

import mitdeeplearning as mdl

Step4: Run the MPC models

`env = gym.make(‘CartPole-v0’)`

env.seed(1)

start_theta = 0

mpc = MPC(0.5,0,start_theta,0)

action = 0

for i_episode in range(1):

observation = env.reset()

for t in range(500):

env.render()

observation, reward, done, info = env.step(action)

a = mpc.update(observation[0] + 0.5, observation[1], observation[2]+np.pi, observation[3])

env.env.force_mag = abs(a)

#print(a)

if a < 0:

action = 0

else:

action = 1

if done:

pass

RESULTS

From the results above, given an objective to keep the pole straight with a maximum deviation of 15 degrees, we can see that RL gives more satisfying results qualitatively rather than MPC. RL successes to keep the pole straight since the beginning and MPC fails at the beginning even though it starts to keep the pole straight after that.

Surely this results astonishing us as there’s a lot of effort that we spend doing MPC calculation as we need to have a mathematical model of the system that tends to be a pain point for control engineer. Of course we can overcome the problem by finding another mathematical model or MPC parameters. But the most important thing here is RL gives us better results even we don’t know the mathematical model of the cart pole, by using a policy-based strategy this algorithm success to control the cart to keep the pole straight.

DISCUSSION

For sure, from a control background, I cannot say that RL overcomes MPC completely so that we don’t need any control theory in the future. The model of the system is still important as with it we can check the stability of the system with Lyapunov’s theorem [4]. If you can guarantee the stability of the system than you can sure that your controller will not fail in the future if there are unexpected things happens.

Even though RL gives us a satisfactory result, there’s still some constraint as we cannot prove the success of the RL algorithm until we simulate it. The good part is now the research on interpretable machine learning is being developed. Hopefully, this will be an answer for the skeptical conservative control engineer about the implementation of RL in real cases, especially in industrial cases which is very strict with safety [5].

REFERENCE

[1] The code is adapted from MIT Deep Learning Bootcamp: https://github.com/aamini/introtodeeplearning

[2] The code is adapted from Philip Zucker’s blog: http://www.philipzucker.com/model-predictive-control-of-cartpole-in-openai-gym-using-osqp/

[3] https://en.wikipedia.org/wiki/Model_predictive_control