Which one is better: Reinforcement Learning or Model Predictive Control? Inverted Pendulum — Case*

Lucas Suryana
May 23 · 9 min read

“Have you ever thought about that? If so, then which one is better?”

*Disclaimer: This post is written based on my experience and knowledge. So if you think something’s not written properly here feel free to contact me.

If you came from a control engineering background, surely that’s a tough question. As a control engineering student in the past, before designing a controller for a system I always do a system modeling by implementing the mathematical model that best represents the system. With the gathered model of the system, we can design a control system in which we can assure 100% that the system will act according to our design to follow its reference point.

The mathematical model represents: Electric, Fluid, Mechanical systems

When designing a control system, there’s a lot of frameworks and methods that we can use, ranging from classical control to modern control theory. In classical control where we deal with Single-Input Single-Output (SISO) system and transfer function, we can use bang-bang control, PID control, pole placement, etc.

Classical Control

In modern control where we deal with Multi-Input Multi-Output (MIMO) system, we need to convert the system into state-space, which breaks multiple order systems into a set of one order systems. There’s a lot of methods to control this case like Model Predictive Control (MPC), Linear Quadratic Regulator (LQR), Robust Control, Pontryagin Max/Min Principle (PMP), Kalman Filter, etc.

Modern Control

In recent days, many people deal with Artificial Intelligence (AI) and Machine Learning (ML) try to implement this to control theory. Actually this is not a new thing, because some researchers had tried a neural network to control a system but it still has a limitation as a neural network can handle a control task on a specific condition only where it’s trained. If we have other condition we need to retrain it. Surely this very differs from classical/modern control where we can set control system robustness to tackle this case.

Later, scientists try to use PID control framework where the PID parameters were tuned with a neural network. Since then, AI application in control is growing fast and many new methods are used to control a system like an ant colony optimization, genetic algorithm, fuzzy control, etc. And now as the AI and ML world are growing fast, there’s a method called reinforcement learning.

In general, we can divide control world into a conservative control system (classical and modern control) and intelligent control system (AI and ML).

And now, we will try to answer the aforementioned question above about which one is better for controlling inverted pendulum, model predictive control, or reinforcement learning?


Reinforcement learning (RL) is an area of machine learning which has two components: agents and environments. It is considered as one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

How it works?

This agent will give actions into environments in order to maximize the notion of cumulative reward. When the agents give action to the environment, then the environment will update its state as the input to the agents to calculate the reward and update the action. In the control term, we can imagine agents as the controller and environments as the system we want to control.

Reinforcement Learning Workflow

In order to model the environment for inverted pendulum tasks, we’ll be using a toolkit developed by OpenAI called . The OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms and it provides several pre-defined environments for training and testing reinforcement learning agents, including those for classic physics control tasks.

OpenAI Gym

So here are the steps to implement RL with OpenAI Gym in Python [1]:

Step1: Install OpenAI Gym & Call the libraries

!pip install gymimport tensorflow as tf
import numpy as np
import base64, io, time, gym
import IPython, functools
import matplotlib.pyplot as plt
from tqdm import tqdm

Step2: Initiate the environment, we will use the cart pole environment as it best represents the inverted pendulum case

env = gym.make(“CartPole-v0”)
Cart Pole System in OpenAI Gym

Step3: Define cart pole agent

### Define the Cartpole agent ###
# Defines a feed-forward neural network
def create_cartpole_model():
model = tf.keras.models.Sequential([
# First Dense layer
tf.keras.layers.Dense(units=32, activation=’relu’),
# Define the last Dense layer, which will provide the network’s output.
tf.keras.layers.Dense(units=n_actions, activation=None) # TODO])
return model
cartpole_model = create_cartpole_model()

This cart pole agent is described as a feed-forward neural network with 32 hidden dense units and 2 outputs.

Cart Pole Agent description

Step4: Define the agent’s action

### Define the agent’s action function #### 
# Function that takes observations as input, executes a forward pass through model, and outputs a sampled action.
# Arguments:
# model: the network that defines our agent
# observation: observation which is fed as input to the model
# Returns:
# action: choice of agent action
def choose_action(model, observation):
# add batch dimension to the observation
observation = np.expand_dims(observation, axis=0)
# Feed the observations through the model to predict the log probabilities of each possible action
logits = model.predict(observation)
# pass the log probabilities through a softmax to compute true probabilities
prob_weights = tf.nn.softmax(logits).numpy()
# Randomly sample from the prob_weights to pick an action.
action = np.random.choice(n_actions, size=1, p=prob_weights.flatten())[0] # TODO
return action
The agent’s action

Step5: Define the agent’s memory

### Agent Memory ###class Memory:
def __init__(self):
# Resets/restarts the memory buffer
def clear(self):
self.observations = []
self.actions = []
self.rewards = []
# Add observations, actions, rewards to memory def add_to_memory(self, new_observation, new_action, new_reward):
# Update the list of rewards with new reward
memory = Memory()

Step6: Define reward function

### Reward function ###
# Helper function that normalizes an np.array x
def normalize(x):
x -= np.mean(x)
x /= np.std(x)
return x.astype(np.float32)
# Compute normalized, discounted, cumulative rewards (i.e., return)
# Arguments:
# rewards: reward at timesteps in episode
# gamma: discounting factor
# Returns:
# normalized discounted reward
def discount_rewards(rewards, gamma=0.95):
discounted_rewards = np.zeros_like(rewards)
R = 0
for t in reversed(range(0, len(rewards))):
# update the total discounted reward
R = R * gamma + rewards[t]
discounted_rewards[t] = R
return normalize(discounted_rewards)

Step7: Define loss function

### Loss function ###
# Arguments:
# logits: network’s predictions for actions to take
# actions: the actions the agent took in an episode
# rewards: the rewards the agent received in an episode
# Returns:
# loss
def compute_loss(logits, actions, rewards):
# Compute the negative log probabilities
neg_logprob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=actions
# Scale the negative log probability by the rewards
loss = tf.reduce_mean( neg_logprob * rewards ) # TODO
return loss

Step8: Use loss function to define a training step for the learning algorithm

### Training step (forward and backpropagation) ###def train_step(model, optimizer, observations, actions, discounted_rewards):   with tf.GradientTape() as tape:
# Forward propagate through the agent network
logits = model(observations)
# Call the compute_loss function to compute the loss’’’
loss = compute_loss(logits, actions, discounted_rewards)
# Run backpropagation to minimize the loss using the tape.gradient method
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

Step9: Run the cart pole

### Cartpole training! #### Learning rate and optimizer
learning_rate = 1e-3
optimizer = tf.keras.optimizers.Adam(learning_rate)
# instantiate cartpole agent
cartpole_model = create_cartpole_model()
# to track our progress
smoothed_reward = mdl.util.LossHistory(smoothing_factor=0.9)
plotter = mdl.util.PeriodicPlotter(sec=2, xlabel=’Iterations’, ylabel=’Rewards’)
if hasattr(tqdm, ‘_instances’): tqdm._instances.clear() # clear if it exists
for i_episode in range(500):
# Restart the environment
observation = env.reset()
while True:
# using our observation, choose an action and take it in the environment
action = choose_action(cartpole_model, observation)
next_observation, reward, done, info = env.step(action)
# add to memory
memory.add_to_memory(observation, action, reward)
# is the episode over? did you crash or do so well that you’re done?
if done:
# determine total reward and keep a record of this
total_reward = sum(memory.rewards)
# initiate training — remember we don’t know anything about how the
# agent is doing until it has crashed!
train_step(cartpole_model, optimizer, observations=np.vstack(memory.observations),actions=np.array(memory.actions),discounted_rewards = discount_rewards(memory.rewards))
# reset the memory
# update our observatons
observation = next_observation

In general, the below chart conclude all the workflow:

Reinforcement Learning Workflow


Model Predictive Control (MPC) is widely known as a process control’s advanced method that is used to control a process while satisfying a set of constraints. But in recent years it has also been used in controlling electrical and mechanical systems.

How it works?

MPC uses a model of the system to make predictions about the system’s future behavior. MPC solves an online optimization algorithm to find the optimal control action that drives the predicted output to the reference. MPC can handle multi-input multi-output systems that may have interactions between their inputs and outputs. It can also handle input and output constraints [3].

In a nutshell, MPC is an optimization method where you iterate the optimization input for every finite-time horizon. It is called online because the optimization is done iteratively until the system reaches its set point. This differs from other optimization methods where the calculation of the control gain only once before the process starts.

So here are the steps to implement MPC with OpenAI Gym in Python [2]:

Step1: Determine the mathematical model of cart pole

Cart pole mathematical model

Actually there’s a lot of cart pole mathematical models that we can use, it depends on the assumption because there’s a model that ignores the moment inertia of the pole. So we need wisely choose the model.

Other cart poles mathematical model in state-space

Step2: Determine the cost function

Quadratic cost function

Step3: Call the libraries

import matplotlib.animation as animation
import numpy as np
from mpc import MPC
import numpy as np
import gym
import mitdeeplearning as mdl

Step4: Run the MPC models

env = gym.make(‘CartPole-v0’)

start_theta = 0
mpc = MPC(0.5,0,start_theta,0)
action = 0
for i_episode in range(1):
observation = env.reset()
for t in range(500):
observation, reward, done, info = env.step(action)
a = mpc.update(observation[0] + 0.5, observation[1], observation[2]+np.pi, observation[3])
env.env.force_mag = abs(a)
if a < 0:
action = 0
action = 1
if done:


Simulation result

From the results above, given an objective to keep the pole straight with a maximum deviation of 15 degrees, we can see that RL gives more satisfying results qualitatively rather than MPC. RL successes to keep the pole straight since the beginning and MPC fails at the beginning even though it starts to keep the pole straight after that.

Surely this results astonishing us as there’s a lot of effort that we spend doing MPC calculation as we need to have a mathematical model of the system that tends to be a pain point for control engineer. Of course we can overcome the problem by finding another mathematical model or MPC parameters. But the most important thing here is RL gives us better results even we don’t know the mathematical model of the cart pole, by using a policy-based strategy this algorithm success to control the cart to keep the pole straight.


For sure, from a control background, I cannot say that RL overcomes MPC completely so that we don’t need any control theory in the future. The model of the system is still important as with it we can check the stability of the system with Lyapunov’s theorem [4]. If you can guarantee the stability of the system than you can sure that your controller will not fail in the future if there are unexpected things happens.

Even though RL gives us a satisfactory result, there’s still some constraint as we cannot prove the success of the RL algorithm until we simulate it. The good part is now the research on interpretable machine learning is being developed. Hopefully, this will be an answer for the skeptical conservative control engineer about the implementation of RL in real cases, especially in industrial cases which is very strict with safety [5].


[1] The code is adapted from MIT Deep Learning Bootcamp:

[2] The code is adapted from Philip Zucker’s blog:




Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store