Verifiable Program Triggered Deep Reinforcement Learning Agents

Briti Gangopadhay
CodeX
Published in
8 min readAug 30, 2021
The Hierarchical Program triggered Reinforcement Learning (HPRL) Framework. The right side of the figure depicts the asynchronous training of RL agents with individual observation(oi), action space(ai) and reward signal(ri). The left part depicts the program-controlled execution of a driving task where the program checks for functional safety and triggers RL agents as per the route plan generated by the mission planner or as per the control strategy recommended by the safety module.

Deep Reinforcement Learning (DRL) is a popular technique of using neural networks for mapping states to actions for complex tasks, such that a reward signal is maximized. Given their superhuman performance in-game playing tasks, such as AlphaGo [1], these algorithms are being extensively studied for learning policies that can autonomously drive intelligent vehicles [2,3]. The use of DRL agents in autonomous driving leads to a smooth human-like driving experience, but DRL networks have two major caveats:-

  1. They suffer from a cold start, that is the initial policies during learning are non-optimal and the network requires a large number of training samples to converge to a reasonable policy.
  2. Functional safety, which is a vital requirement for certification of safety-critical systems cannot be guaranteed, due to the non-interpretable and opaque nature of neural networks.

We address each of the aforementioned problems in our Hierarchical Program Triggered Reinforcement Learning (HPRL) framework.

“If you’re overwhelmed by the size of a problem, break it down into smaller pieces” — Chuck Close

Human beings often use a delicate combination of traffic/road rules and experience while driving. Human driving policies are semantic, rather than being geometric and numerical, for example, “drive straight till the junction and turn right” rather than “drive 50 meters at the current speed and steer 90⁰”. These semantic instructions have sub-goals that can be contoured into hierarchical abstractions. The sub-goals are heavily regulated by symbolic rules that are applied at various points of time and space (for example, stop at the red light). Motivated by this, we break the driving task into a set of basic driving manoeuvres; namely straight drive, right/left turn, right/left lane change. We asynchronously train the RL agents for learning each of these driving manoeuvres using manoeuvre-specific actions, observation space, and reward signals. Thus making them more sample efficient than a single flat DRL policy. The agents are thereafter triggered in execution by a structured program to complete a driving task. Functional safety requirements are incorporated as embedded assertions in the program, thereby facilitating formal proof of the correctness of the driving strategy. We then closely look at the formulation of one of the DRL agents, which shall be responsible for straight drive manoeuvre.

Straight Driving Agent: The objective of this agent is to track speed, maintain lanes and sustain a safe distance from leading vehicles. The observation space for the Straight driving agent is (δx_t,δx_c,v_ego,δx_lon) where δx_t is the L2 norm distance from the sub-task target way-point, δx_c is the distance from the centre of the lane, v_ego is the current speed of the ego vehicle and δx_lon is the distance from the leading vehicle. The actions can control acceleration, deceleration, and restricted steering. The reward function is a combination of the following rewards:

The sampling efficiency compared a flat and hierarchical DRL and the reward obtained by each of the DRL agents are shown in the plots below:-

(a) Comparison between individual agents learning separate policies, Hierarchical DQN choosing between two policies and flat DQN learning the overall policy for the task of turning right at an intersection after a straight drive. (b) Average Rewards per step for all manoeuvre specific DQN models

The Structured Program P and Embedded Assertions

The role of the structured program P is to take a sequence of sub-routes for a driving task as input and to invoke suitable (pre-trained) RL agents for executing each sub-route. Besides acting as the sequencer and trigger for the RL agents, program P also acts as a safety shield developed from safety specifications gleaned from well-known driving rules. These specifications are represented in Linear Temporal Logic (LTL) [4]. We specifically use the following LTL operators for representation of safety property φ; Always operator ▢φ, specifying property φ should hold at all timesteps, Next operator 〇φ specifying property φ should hold at the next time step and Release operator φ₁Rφ₂ meaning property φ₂ always hold up to the time when φ₁ becomes true. We elucidate this by encoding the requirement: The ego vehicle shall not make any longitudinal movement if its distance with the lead vehicle falls below a specified safety threshold. In order to track the safe distance between the ego vehicle(vehicle driven by HPRL framework) and the lead vehicle, we use proposition Dlon. The proposition Dlon becomes true when the distance between the ego vehicle and the lead vehicle is safe. If the distance is safe at time t and becomes unsafe at sample time t + 1, then from t +1, the ego vehicle should start braking longitudinally and come to a stop till the stop manoeuvre estop is released by distance becoming safe at some future time. The requirement is applicable at all times, which is expressed using the always operator ▢. The requirement is encoded as following property φ :

This simply translates to always (safe distance and if in next time step not safe distance then from next time step stay stopped until released by safe distance). We encode five such assertions inside the check_functional_safety() function, which are checked before triggering the neural network agent for the given route option. Thereafter, we formally verify the correctness of the structured program using a program verification tool Nagini [5]. Nagini is an automated tool for Python, capable of formally proving the validity of embedded assertions over all possible runs of the program.

In order to formally prove that the structured program guarantees the property φ, the following assertions are embedded in the structured program and are verified by using Nagini:-

1. Assertion to guarantee that longitudinal safety hazard is correctly captured: The function, check_functional_safety(), in Algorithm 1 invokes method is_vehicle_hazard(…), which is responsible for checking whether there is a violation of longitudinal safety constraint with leading vehicles. This method inspects if the L2 distance between the leading vehicle and the ego vehicle is hazardous, and if so, returns True. In order to formally prove that the method is_vehicle_hazard(…) returns True when the L2 distance is less than the specified safe distance the following assertion is embedded:-

Assert(Implies(L2_distance <= safe_distance,Result()==True))

The return value of this method is stored in the flag variable long_hazard_detected. Therefore this assertion guarantees that the flag long_hazard_detected is set when the antecedent of φ is true, namely Dlon ∧ ¬Dlon.

2. Assertion to guarantee safe override: In any execution of the structured code, if the flag long_hazard_detected is found to be set then the structured program must override the RL agent and set brake = 1.0, steering = 0.0 and throttle = 0.0 (high-level actuation models for autonomous driving simulator Carla [6]). The assertion embedded in the function, check_functional_safety(), is, therefore:

Assert(Implies((long_hazard_detected and not lane_change),
vehicle.brake == 1.0 and
vehicle.steer == 0.0 and
vehicle.throttle == 0.0))

This assertion accounts for the temporal property Dlon ∧ ¬Dlon → estop.

3. Assertion to guarantee the release of override: If the structured program detects no longitudinal hazard, then it must allow check_functional_safety() to return the control suggested by the RL controller, namely the function neural_control(). This is expressed by the following embedded assertion:

Assert(Implies((not(long_hazard_detected)),control == neural_control()))

In our example, if the structured program satisfies all the assertions outlined above, then we have proven the following:

a. The structured program correctly sets the flag, long_hazard_detected, when the distance to the lead vehicle falls below the safety margin.

b. The structured program overrides the RL agent to guarantee that the brake is actuated when long_hazard_detected is True.

c. The structured program releases the override (namely, the braking) when long_hazard_detected is not true anymore. Together with the previous assertion, this guarantees the target safety property : ▢(Dlon ∧ ¬Dlon → 〇(Dlon R estop)

The HPRL framework has been tested on different benchmark driving scenarios on CARLA [6], an open-source dynamic urban simulation environment. The video below shows an autonomous vehicle driving safely while following traffic rules using the HPRL framework. More details on this work can be found in the published version[7].

Conclusion

There is growing consensus on the use of RL-based controllers for imitating human-like behaviour on autonomous driving tasks, but it is too ambitious in general to expect a single RL agent to learn all complex manoeuvres. The HPRL framework provides a platform for training several RL agents to learn different manoeuvres and orchestrate their use with a supervisory program. The supervisory program is formally verifiable, and can therefore provably override the action chosen by the RL model when mandated by the specified formal safety properties.

References

[1] D. Silver et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.

[2] E. Perot, M. Jaritz, M. Toromanoff, and R. De Charette, “End-to-end driving in a realistic racing game with deep reinforcement learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 474–475.

[3] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement learning framework for autonomous driving,” Electron. Imag., vol. 2017, no. 19, pp. 70–76, 2017.

[4]A. Pnueli, “The temporal logic of programs,” in Proc. 18th Annu. Symp. Found. Comput. Sci. (SFCS), Sep. 1977, pp. 46–57.

[5] M. Eilers and P. Müller, Nagini: A Static Verifier for Python (Lecture Notes in Computer Science: Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10981. Springer, 2018, pp. 596–603.

[6] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proc. 1st Annu. Conf. Robot Learn., 2017, pp. 1–16.

[7] Hierarchical Program-Triggered Reinforcement Learning Agents for Automated Driving, B Gangopadhyay, H Soora, P Dasgupta, July 2021 IEEE Transactions on Intelligent Transportation Systems PP(99):1–10

--

--