State-wise Safe Reinforcement Learning: A Survey

Two classes, existing methods, and some trade-offs

Published in

Towards NeSy

9 min readMar 22, 2023

In this blog, we would like to share an interesting survey paper that discusses the recent advances in the field of safe reinforcement learning. Although RL demonstrates its impressive performance in many continuous control tasks, one of the major blocks for applying RL to real-world applications is how to provide RL with a state-wise safety guarantee. Enforcing state-wise constraints is necessary and essential to many challenging tasks such as autonomous driving and robot manipulation. This blog will summarize the existing state-wise safe RL works.

If you’re interested in a detailed survey, please check this out directly.

Introduction

State-wise safe RL?

Reinforcement learning (RL) has made impressive strides in games and control tasks, but the lack of safety assurance is a major obstacle to its application in real-world scenarios. RL agents prioritize maximizing reward over ensuring safety, which can lead to unsafe or even catastrophic outcomes. Safe RL approaches have emerged to provide safety guarantees during or after training, but early attempts have mainly focused on cumulative or probabilistic constraints. However, many critical constraints in real-world applications are instantaneous and deterministic, depending only on the current state of the system. State-wise safe RL is therefore a challenging but crucial step toward applying RL to real-world scenarios. The RL community should pay more attention and effort to state-wise safe RL.

Math Formulation

Then we can dive into the more mathematical formulation of this problem. I will try to simplify the explanation and avoid overly abused equations. We will discuss three parts:

What is Markov Decision Process (MDP)?
What is Constrained Markov Decision Process (CMDP)?
What is State-wise Constrained Markov Decision Process (SCMDP)?

Markov Decision Process (MDP) is a mathematical framework used to model decision-making problems where outcomes are partly random and partly under the control of a decision-maker. In an MDP, an agent makes decisions in a sequence of states, where each state represents a situation in which the agent can take a specific action. The result of an action in a state is determined by a probability distribution, and the agent receives a reward based on the outcome. The key assumption of an MDP is the Markov property, which states that the probability distribution of the next state and reward depends only on the current state and action, and not on the past history of the process. MDPs can be used to solve various decision-making problems, such as robotic control, finance, and healthcare. The goal of an MDP is to find a policy, which is a mapping from states to actions that maximizes the expected sum of rewards over time. Here is an example:

Here I assume readers have a basic knowledge of RL, such that discount factor, reward function, state action transition pairs, etc.

Constrained Markov Decision Process (CMDP) is a variant of the standard Markov Decision Process (MDP) that incorporates constraints into the decision-making problem. In a CMDP, the goal is to find a policy that maximizes the expected sum of rewards subject to a set of constraints. The solution to a CMDP involves finding a policy that maximizes the expected sum of rewards subject to the constraint(s), and this can be done using a variety of methods, such as Lagrangian relaxation, linear programming, or dynamic programming. Following is an example of CMDP objective

Here $\pi$ means policy, and $\Pi_C$ means the set of policies that satisfy cost constraint.

State-wise Constrained Markov Decision Process (SCMDP) is a special type of CMDP, where the allowable policy means that the state-wise safety constraint must be satisfied at every time step. For example, in autonomous driving, the state-wise safety constraint means that the policy should output action such that the vehicle and surrounding obstacles are collision-free at every time step. Mathematically, the feasible SCMDP policy could be expressed as:

At every time step, every state-wise cost should satisfy a hard constraint.

Existing Methods for Solving SCMDP

Two Classes

With the mathematical background, we can introduce the representative methods to solve the SCMDP. And exisiting methods can be divided into two categories: (i) achieving state-wise safety after policy convergence (After Convergence), (ii) achieving state-wise safety during policy training (During Training).

After Convergence: it refers to methods that have infeasible policies during training but feasible converged stationary policies in the end. Essentially, it involves using soft penalties, called safe critic Q_c(s,a), to guide the training of policies. The safe critic can be provided by humans or learned from the environment. This approach does not pose hard constraints on the exploration space, which allows for more efficient policy learning, but the policy may still be unsafe during training and convergence to a safe policy may be challenging in some cases. This figure demonstrates the process:

state-wise safe RL policy after convergence.

During Training: this approach is stricter than achieving state-wise safety after convergence because all intermediate policies during training must be feasible. two classes of methods for ensuring state-wise safety during training exist.

The first class guarantees safety by projecting policies back into the subset whenever they go outside of it. However, this approach requires a significant amount of prior knowledge about the environment and may be more conservative in exploration, which is illustrated as following:

The second class of methods involves progressively exploring the policy space from a safe initial policy and maintaining an up-to-date subset of safe policies set by gathering information. Policy updates are only made if there is enough information to confirm that the updated subset is a subset of the safe policies set. This approach requires less prior knowledge than the first class but is more conservative in exploration. The illustration is demonstrated as following:

state-wise safe policy via progressive safe exploration

Next, I will introduce one representative method for each category.

State-wise Safety After Convergence

One typical approach is to use a hierarchical policy for ensuring state-wise safety constraints. The upper layer of the policy is a reinforcement learning policy, while the lower layer is a safety monitor that filters actions to ensure the policy satisfies the safety constraints. To do this, the paragraph proposes a safe critic $Q_C(s,a)$ that estimates the chance of future safety violations. Conservative Safety Critics (CSC) is then introduced as an optimization algorithm that trains both $Q_C$ and the policy to provide a conservative estimate of the likelihood of being unsafe given a state-action pair. If the safety violation exceeds the predefined safety violation threshold, new actions will be re-sampled from the policy until the safety critic agrees the proposed action is safe. Mathematically,

As the policy convergence, the safety critic will be trained pretty well, so that the state-wise safety is guaranteed as long as we sample a critic agreeable action.

State-wise Safety During Training (Projection)

Although seem similar to the approach we introduced in the previous section, During Training approaches satisfy state-wise safety constraints via correcting actions at each step by projecting them onto a feasible action set such that the cost constraints are satisfied. This approach involves constructing a hierarchical safe agent with an upper layer that generates reference actions and a lower layer that performs the projection. The solution to the problem is also a hierarchical-structured policy, with the difference that

The lower layer in this approach usually has prior knowledge about the cost function via system dynamics assumption and is therefore more capable of providing state-wise safety guarantee during policy training compared to the hierarchical policy discussed in a previous section.

One good approach is called Implicit Safe Set Algorithm (ISSA), which is a hierarchical-structured safe policy that consists of an upper layer and a lower layer. The upper layer uses the PPO algorithm to generate task-oriented reference actions, while the lower layer filters out unsafe control actions by projecting them to a set of safe control. The policy is constructed using an energy function safety index, which ensures that state-wise safe policy corresponds to all policies that make safety index non-positive for all time steps. ISSA does not require explicit knowledge of system dynamics, only the kinematic limits, which makes it scalable. To ensure state-wise safety guarantee with a black-box dynamics model, ISSA solves safe control at every time step via a constrained optimization problem, which is solved via multi-directional line search. However, the line search requires a one-step simulation through a digital twin simulator, which is a pretty strong assumption.

State-wise Safety During Training (Progressive)

It is possible to construct state-wise safe reinforcement learning agents without prior knowledge of the environment. One representative method proposed by Berkenkamp, which learns a single policy that generates actions with high-probability safety guarantees at every step during training. This is achieved by defining safe policies that keep the system within a subset of the state space, known as the region of attraction (ROA), and learning the unknown system dynamics through Gaussian process modeling. The algorithm iteratively updates the policy and the dynamics model, while taking safe actions only. The policy optimization with ROA estimation is formulated as an optimization problem that maximizes the size of the ROA while ensuring that the policy reduces the Lyapunov energy function within the ROA. This approach achieves full exploration within reachable parts of the state space and is proven to work on an inverted pendulum simulation without letting the pendulum fall. However, the method has some drawbacks, including the difficulty of obtaining a proper Lyapunov function for defining the ROA and the possibility of non-stationary system dynamics.

Some End Notes on Trade-offs

So far we have introduced three different categories of methods that can solve state-wise safe reinforcement learning, and we are now ready to give some conclusions about the trade-offs of the current methods. There are two major trade-offs

The first trade-off is between guarantee and scalability, where strong assumptions can lead to solid safety guarantees for limited systems, but weaker assumptions can make it harder to derive theory from complicated dynamics. However, weaker assumptions lead to better scalability for different dynamics systems.

The second trade-off is between asymptotic and in-training safety, where asymptotic safety is desired when the agent’s actions have a low probability of causing harm in the early stages of training, but the risk increases as the agent becomes more proficient. Strict safety from the beginning of training can be more difficult to implement, as it often requires prior knowledge about the environment or task. The best approach depends on the application and level of safety required.

About author

Weiye Zhao is a fourth-year Ph.D. candidate at the Robotics Institute of Carnegie Mellon University, with a primary research focus on providing provable safety guarantees for robots. Specifically, he is interested in constrained reinforcement learning, nonconvex optimization, and safe control. He has published in numerous renowned venues, including ACC, CDC, AAAI, CoRL, L4DC, RCIM, LCSS, and others. Additionally, he serves on the program committee for several conferences, including IV, AAAI, CoRL, CDC, ACC, L4DC, and more.

Reference