Reinforcement Learning-based Strategies for Traffic Light Control with an Emphasis on Cyclic Patterns

Zhentao Fan
AI4SM
Published in
30 min readOct 24, 2023

By Zhentao Fan, Xueqi Duan, and Hexuan Zhang as part of the course project of ECE1724H: Bio-inspired Algorithms for Smart Mobility. Dr. Alaa Khamis, University of Toronto, 2023.

Generated by DALL-E 3

Background & Significance:

With the process of urbanization, efficient transportation systems have become indispensable to the functioning of modern cities. At the heart of these systems lies the complicated coordination of vehicles, pedestrians, and cyclists, all managed by a seemingly simple yet crucial component: the traffic light. It plays a significant role in ensuring the smooth flow of traffic, reducing congestion, minimizing travel time, and most importantly, ensuring the safety of all road users.

Figure1. Traffic accidents in Toronto 2007–2017

With the advent of autonomous vehicles and smart city initiatives, the traditional paradigms of traffic light control are being challenged. The integration of advanced technologies and data-driven approaches offers the potential to revolutionize how traffic lights respond to dynamic road conditions in real-time. Utilizing these advancements can lead to more resilient, adaptive, and intelligent transportation systems that cater to the evolving needs of modern urban environments.

Description & Challenges:

In this article, we will investigate some traffic light control algorithms that can optimize the traffic volume, and most of them are Reinforcement Learning-based. The agents are traffic lights and their signals are the actions. Their policies are the target that we want to train and optimize. Various objective functions or evaluations may be considered [3]:

  • Queue Length: This measures the total number of vehicles waiting in line within the road system.
  • Travel Time: This refers to the duration between a car’s entry and exit from the system. A prevalent aim is to reduce the average time vehicles spend in the network.
  • Throughput: This represents the count of vehicles that have successfully navigated through the road network within a specific timeframe.

Similarly, the reward can also have different setups. For example, it can be the number of vehicles stopped at each intersection or the sum of the vehicle waiting time at each lane. The formal mathematical definition and modeling will be presented in the next sections.

Challenges:

One of the challenges is to ensure the safety of reinforcement learning agents in tangible settings. Since RL techniques rely on a trial-and-error approach, mistakes during the learning phase could have severe or even deadly consequences in real-world scenarios, such as traffic signal failures causing accidents. Thus, integrating risk management into RL is vital to mitigate undesirable outcomes both during and post the RL training phase. One possible cause of this risk may include the distribution shift of the state-action space between the simulation environment and the real world. This also results in the difference between the actual performance and simulation performance. For example, in order to accelerate the RL training, the simulation would like to simply the model and assume all vehicles have iid distributions and configurations so that all vehicles share the same max speed and the same default gap and acceleration [11].

Interests & Possible Breakthrough:

One of the specific topics that we are interested in is cyclic actions, as it has not been heavily discussed previously. Not many traditional RL-based methods focus on this one. So why it is important?

Cyclic Signal Patterns:

In urban traffic management, the implementation of cyclic patterns in traffic signals plays a pivotal role in reflecting realistic traffic flows and enhancing safety. Unlike non-cyclic or dynamic patterns, which may adjust signal phases based on real-time traffic conditions, cyclic patterns follow a fixed sequence of light changes. For example, A → B → C → D → A →…… This predictability is crucial in busy urban intersections where the complexity and volume of both vehicular and pedestrian traffic demand a structured approach to ensure smooth and safe movement.

Figure2. From DataLight Paper [12]

Advantages of Cyclic Traffic Signals

  1. Predictability and Order: Cyclic patterns provide a predictable and orderly switch between phases, reducing the likelihood of confusion among drivers and pedestrians. This predictability is essential for safety, as road users are aware of the expected signal changes and can act accordingly.
  2. Pedestrian-Friendly Approach: Cyclic signals often include dedicated pedestrian crossing phases, ensuring that pedestrians have sufficient and regular opportunities to cross roads safely. This aspect is particularly important in urban settings, where pedestrian traffic is as significant as vehicular traffic.
  3. Enhanced Safety: Consistency in signal patterns reduces the risk of accidents caused by sudden or unexpected signal changes. Drivers and pedestrians familiar with the signal sequence at a regular intersection can navigate it with greater confidence and reduced risk of error.

The General Formulation & Modeling:

As described in the above section, we suppose the traffic signals of an intersection are controlled by one agent.

Multi-intersections with Multi-agents:

For the multiple agents problem, we still model it as a MDP problem. It can be formulated as such: ⟨S, O, A, P, R, π, γ⟩.

For multi-agent cases, the state space is bigger and we assume that each agent can only observe part of the system[1].

  • S: the system state space. It encompasses the entire state of the system. Given that there are N intersections within the system, each state s in S captures the comprehensive status of these intersections.
  • O: the observation space. It is a subset of the system state space that an agent can perceive. For a specific agent i at time t, its observation is denoted as oᵗᵢ. For specific implementation, it may include: 1. the current phase of the intersection, indicating which direction has the green light. This is usually expressed using a one-hot vector; 2. the count of vehicles present on each lane linked to the intersection.
  • A: the action space. At each step t, aᵗ = (aᵗ₁, aᵗ₂, …, aᵗₙ), where aᵗᵢ denotes the action that agent i takes at time t.
  • P: the transition. P(sᵗ⁺¹| sᵗ, aᵗ ): S × A₁ × · · · × Aₙ → Ω(S), where Aᵢ denotes the action space of agent i and Ω(S) denotes the state distributions.
  • R: the reward system. The reward function: S × A₁ ×· · · × Aₙ → R. For some specific implementations, it is the negative value of the sum of the length of queues of all lanes of all agents.
  • γ: the discount factor. The agent aims to identify a policy that optimizes the anticipated cumulative reward. This cumulative reward is calculated as ∑ γᵗ rᵗ⁺ⁱ.
  • π: the policy. S→ A.

The process is to train the Q-function for each agent i and use the Qᵢ(θₙ) to approximate the total reward Gᵢᵗ.

Given these components, we also want to define two value functions:

  • Q-function: action-value function. Q^π(s, a), represents the expected return (cumulative discounted reward) of taking action a in state s and then following a certain policy π. It essentially quantifies the value of performing a specific action in a given state. […formula here]
  • V-function: state-value function. V^π(s), represents the expected return of being in state s and then following policy π. It provides a measure of the value of a state without specifying which action to take.

The relationship between these two functions:

Specifically, the goal is also equivalent to finding the optimal Q function Q*(s, a) or the optimal V function V*(s), which represents the maximum expected return achievable for any state-action pair or state.

Categories of Existing Work in TSC:

In this article, instead of differentiating different categories of RL methods and algorithms based on their choice of rewards, action schemes, or policy learning, we categorize them based on the cooperation strategies among their agents.

Coordination: Coordination, as its name suggests, refers to the cooperation among traffic lights, which can enhance signal control in situations involving multiple intersections.

  • Hard-coded methods: The easiest way to achieve coordination is to simply set a fixed time interval between the beginnings of green lights among all intersections. This doesn’t require the RL formulation. However, this simple treatment won’t give us global optimization as such pre-calculated offsets can not manipulate dynamic traffic environments [1, 7].
  • Single Global Agent: As mentioned in previous sections, this is also called the centralized method that uses one single global agent to see the states of all intersections and to control all the intersections. The dimension of the action space for this agent would explode, as the joint action space grows exponentially with the number of intersections. Therefore one main disadvantage of this strategy is that it is not scalable.

With the recent progress in RL enhancing performance for standalone traffic signal control, there have been initiatives to develop strategies that work in tandem with multi-agent reinforcement learning (MARL) agents. Claus and Boutilier [2] divide MARL into two types in their work: Joint action learners and independent learners. This categorization can be further adapted specifically for the traffic signal control problem here[5].

  • Joint action learners: Instead of calculating max_a Q(s, a) or Q(o₁,.., oₙ, a), this method “factorizes the global Q-function as a linear combination of local subproblems” [5]. The objective function would become max_ai,aj Σ i,j Qi,j (oi, oj, ai, aj ), where i, j are indexes of neighboring agents. Some work suggests adding pre-defined weights [9] so that the objective function will become max_ai,aj Σi,j wi,j Qi,j (oi, oj, ai, aj).
  • Independent learners: This modeling method is how we define the problem in the previous section of the “multi-intersection with multi-agents” scenario. It uses independent RL (IRL) agents to control the traffic lights, and each agent is responsible for an intersection. Each agent trains its own policy without getting the reward values of other agents, which is different from joint action learning methods.

Independent RL (IRL) can be further divided into two categorizations:

  • Without communication: This modeling method treats the traffic control of each agent as an individual work so that the objective function becomes max_ai,aj Σᵢ Qᵢ(oᵢ, aᵢ). That is, the observation space of each agent only includes information about its own local environment and can not resolve conflicts among each agent through certain ways of communication. Therefore, in a uniform road net, this method can demonstrate the coordination among agents. However, in a complicated and non-stationary network, the policy trained can not easily converge nor show a clear pattern of coordination.
  • With communication: Agents using this method can communicate with each other, share the information of their observations, and behave like a group. The corresponding objective function becomes max_ai ΣᵢQᵢ(Ω(oᵢ, Nᵢ), aᵢ), where Ni is the neighborhood representation of intersection i and Ω(oᵢ, Nᵢ) is the function modeling local observations from neighborhoods. CoLight [1], which is one of these methods, uses Graph Attentional Networks [6] to learn the dynamic interactions between the target agent and the hidden states of neighboring agents.

The Baseline Method & Benchmark: DataLight [12]

As mentioned in the section “Interests & Possible Breakthrough”, we want to optimize the strategies that would output cyclic actions. Also, in the “Challenges” section, we discussed that online training in the real world has lots of ethical issues, and the training in the simulation in the ideal circumstances may train a policy that overfits the simulation environment. That’s why we are looking at DataLight [12] as it is using offline training based on the dataset from the real world, and has modified the traditional action-choosing strategy so that the actions/phases are cyclic.

Key Features of DataLight

  1. Offline Reinforcement Learning: DataLight [12] operates on the principle of offline RL, learning effective control policies from pre-collected datasets. This approach negates the need for real-time environmental interaction, making it more practical and less risky for real-world traffic scenarios.
  2. Cyclical Offline Dataset (COD): The model is trained on a Cyclical Offline Dataset, designed to emulate common real-world traffic conditions. This dataset facilitates the ease of data collection and ensures the model’s training is grounded in realistic traffic scenarios.
  3. Arbitrary to Cyclical Transformation (ATC): A notable feature of DataLight is its ability to convert various RL-based methods from arbitrary phase sequences to a cyclical structure, aligning with the standard practice in most real-world intersections. This transformation ensures the model’s compatibility with prevalent traffic signal patterns.

The Network of DataLight:

The neuron network that the DataLight used can be briefly summarized into four steps.

  • Lane feature embedding
  • Phase feature construction
  • Phase correlation
  • Phase score prediction

This can be visualized in the following diagram.

Credit to DataLight Paper [12]

As shown above, each phase will get a phase score. Since it is training a Q-network, the phase score here is just a Q-value for the corresponding state-action pair.

The Action-Choosing Strategy of DataLight:

The ATC strategy can transform any non-cyclic phases into a cyclic sequence. It can be briefly explained with the following code:

# ATC:
# Choose action based on traditional non-cyclic RL method, either 0, 1, 2, 3
action_next = np.argmax(q1, q2, q3, q4) # q-values are obtained from non-cyclic RL methods
if action_next == action_prev:
action_predict = action_next
else:
# make the action sequence cyclic
action_predict = (action_next + 1) % 4

New Method: CyclicLight

Analysis for Possible Improvement from the DataLight:

If we look at the ATC transformation that DataLight proposed, it seems to be an ad-hoc method, as the network itself is not cyclic and the action-choosing strategy seems to be naive.

To analyze the problem quantitatively and to see the possible rooms for optimization, for the convenience of doing math calculations, instead of evaluating the total traveling time of all vehicles, we are evaluating the total waiting time of vehicles at each intersection. So the objective function would be:

Here, Ti,j means the time interval that vehicle i waited at jᵗʰ intersection. S_init refers to the initial states of the system. 𝜋 is the given policy that the agents in the system need to follow. D is the distribution for all initial states. That is, this expression evaluates the expected sum of waiting time given the policy and the initial state.

Due to the constraints in the simulation and also for simplification, the state of the system is captured with a fixed interval T. Therefore, we re-evaluate the Ti,j with T times Ii,j,t where Ii,j,t is the indicator with a value of either 0 or 1 to indicates whether vehicle i is waiting at intersection j at timestamp t.

This is equivalent to

Now, if we set the rewards R(s, a) (or r_t) of this MDP to the negative value of the sum of queue length for all intersections, that is, we define r to be

where q(j) means the sum of the queue length of all lanes at intersection j and the k is a scalar for tuning the network, then based on the definition of the indicator Ii,j,t:

The objective function would be the following if we let c = T/k, which is a positive constant.

which is equivalent to optimizing this objective function:

Therefore when γ → 1, the objective would be

Recall, in Reinforcement Learning,

The hat over r just means the empirical reward under the policy π

By substituting the accumulated rewards with V-value,

As we are using the same policy over the distribution of the initial state globally after we trained the Q-network and the policy is something we want to optimize to maximize the, we can remove the π as the superscript of V.

To conclude the transformation for the objective function, we got

This objective function just means that we need to optimize the Q-values over the state distribution.

For the constraints part of this optimization problem for the DataLight’s algorithm, it can be formalized into the following:

The Q_hat is the empirical Q-values obtained from traditional non-cyclic RL Q-network, either CoLight, AttentionLight, or some other known methods. Also, notice that unlike the traditional RL for MDP where the π takes only one parameter of state s, the π of the DataLight needs to pass the previous actions a into the next action-choosing strategy. Further, Q(s, a) and V(s), unlike the empirical Q_hat, are not being calculated and used in DataLight. They are only for the purpose of the analysis for this optimization. However, they can be still calculated if we initialize it with random parameters in the Q-network (or more traditionally, initialized with zeros in the Q-table) and update it through the Bellman Operator for each iteration.

Here π_hat is fixed.

As we can see, the action-choosing strategy is guided by other non-cyclic Q-networks (or equivalently say, the non-cyclic policy) but not the Q-values on its own. But normally, the correct sense of using RL to solve problems and to update the Q-values should be like the following. Notice the difference is that when updating the Q-values using π, the π is also updated iteratively.

As the policy and the Q-values are decoupled in DataLight, the V-values of the state distribution that we want to maximize would be underrated in DataLight. This gives us some opportunity to optimize. Based on this, we propose a new cyclic Q-network.

From DataLight to CyclicLight:

The transformation from DataLight to a network that outputs actions cyclically is not trivial because you can not restrict the Q-values that it returns would follow the cyclic pattern. You cannot simply deduct the Q-values for certain state-action pair <s, a> or add an extra negative reward r’(s, a) for it if it doesn’t satisfy the cyclic condition after the previous action, because it may still be cyclic for other previous state-action pairs like <s’, (a-1) mod 4> and <s’, a> (suppose s’ is the previous state, and the primitive actions are 0, 1, 2, and 3). Further, modifying the reward system would break the derivation of the objective function shown in the Analysis for Possible Improvement from the DataLight section.

Meta-Action:

The first step to make the Q-network itself cyclical is to introduce the meta-action that CyclicLight proposes:

Suppose the original actions/phases are A, B, C, and D. Meta-actions, either 0 or 1, indicate whether to keep the same action or move to the next action at timestamp t.

Therefore, we modified the network based on the DataLight so that it would only output the meta-action which will guide the agent to choose actions cyclically. The base model is the AttentionLight [13] and the output would be the Q-values for meta-actions.

The Cyclic Network:

  • Input layer: The network takes two inputs. That is the state-action pair <s, a>. To have the current action included is important because the current action would decide whether the next action is cyclic or not and can provide information for either choosing meta-action 0 or 1.
  • Lane embedding: It first encodes the actions and lane features respectively. The lane features that would be used and encoded include the number of vehicles present in each incoming (entering) lane, the number of vehicles that are actively moving in each incoming lane, and the number of vehicles in certain segments in each incoming lane.
  • Feature correlation: It used multi-head self-attention [15] to learn the correlation from lane features, which is supposed by the AttentionLight [14].
  • Feature & Phase fusion: Then we concatenate two embedded inputs and fuse them together.
  • Correlation: It uses multi-head self-attention again to encode the fused phases and features themselves.
  • MLP & Flatten: It is then applied with the multi-layer perceptions(two layers in CyclicLight) after the MHA and flatten it into 1xN neurons.
  • Output layer: The network outputs two q-values for meta-actions 0 and 1. Then the action-choosing strategy would be straightforward — if q-values[0] is higher than q-values[1], then it is meta-action 0 and keeps the same signal actions, and vice versa.

Analysis of the Optimization:

Notice for this Cyclic Network, the policy π and the Q-values are being updated iteratively during the training.

ma refers to the meta-action

Also, the reward is maintained the same, so the derived objective function still holds its meaning.

Now let’s look at how we optimized the problem. Here are the optimized constraints:

Notice, here the actions ma are the meta-actions. More importantly, this can only be achieved under the circumstances that the state s itself is incorporated with current actions.

And if we translate it into the format of the primitive actions:

Suppose phase A is action 0, B is action 1, C is action 2, and D is action 3. Then ‘(a+1) mod 4’ just means the cyclic action for a. This evaluation for policy π is optimal since ‘being cyclic’ is the only requirement that the policy needs to satisfy.

Offline Training for CyclicLight:

Dataset for CyclicLight :

As mentioned when introducing the baseline method, DataLight uses a Cyclical Offline Dataset (COD), which stores the data that has a cyclic pattern in terms of the actions.

However, this COD still does not satisfy our need for offline training. The reason is that these datasets only record the data when there is a change in the actions. Therefore, even though the state-action pairs stored in the dataset are cyclic, if we translate them into meta-actions, meta-action 1 dominates meta-action 0 in the COD that DataLight proposed. Meta-action 1, which refers to the next cyclic primitive action in the action sequence, is over 95% in the COD distribution. This is problematic for training CyclicLight, as we are training Q-values regarding meta-actions. If meta-action 1 overwhelms meta-action 0, the Q-values for meta-action 0 can not be properly updated. Recall,

As the rewards of meta-action 1 accumulate and propagate much faster, the Q-network trained won’t converge in this case. Therefore, when generating the dataset for CyclicLight, the first idea would be to have meta-action 0 and meta-action 1 roughly equally weighted in the dataset distribution.

# random cyclic agent for generating cyclic dataset
def choose_action(self, state):
action = np.random.randint(2)
if action:
action = (self.last_action + 1) % 4
else:
action = self.last_action
self.last_action = action
return action

Based on our experiment, this could help the training for the CyclicLight and make the corresponding Q-network cyclic. Let’s call it a Weighted Meta-action Cyclic Offline Dataset (WMCOD).

Content of the Dataset Supposed by CyclicLight:

The lane features that would be included in the dataset are the number of vehicles present in each incoming (entering) lane, the number of vehicles that are actively moving in each incoming lane, the number of vehicles in certain segments in each incoming lane, the queue lengths in a designated segment of the incoming lanes, the queue lengths in the outgoing lanes, the number of vehicles in each outgoing lane of the intersection as well as the current action space.

For each entry, it is a list of [current state, current phase(primitive action), next state, queue length reward]. As discussed earlier, for the Q-network of CyclicLight, both the state and the next state should be incorporated with their current phase(primitive action). Therefore, we can check whether each entry is cyclic or not as actions are included in the state.

Conservatism for Offline Learning:

The conservatism is also introduced in the DataLight since it uses offline training. Using this conservative Bellman Equation would avoid overestimation for the Q-values of the actions that are Out-Of-Distribution(OOD).

For the CyclicLight, we further deploy a regularizer for the empirical policy μ we trained as we don’t want the μ to be too abnormal compared to the realistic traffic light signals during the training. For example, we don’t want agents stuck in the same phase (that is, choosing meta-action 0) all the time which stops the policy from improving nor we don’t want the Cyclic Agent to change the phases in each stage as the transition cost of changing from one phase to another is high (in the experiment, the interval for the yellow light is 5 seconds).

The regularizer CyclicLight uses for μ is the KL-divergence of it against a fine-tuned or a chosen prior distribution ρ for actions given states [13]. That is, R(µ) = −Dₖₗ(µ, ρ), given ρ(a|s). To make the Q-network robust and safe in the real world, the ρ can be carefully chosen based on the distribution in real-world traditional fixed-time traffic lights and can be specifically adjusted for specific regions.

Experiment & Analysis for Improvement:

Here is the CyclicLight’s test result on a 4 by 4 roadnet in Hangzhou after the network trained in each round. It is being trained with the WMCOD dataset we compiled and shows a clear pattern of convergence, while the test plot of the DataLight actually doesn’t show a clear pattern of convergence due to the cyclic policy being de-coupled with the Q-network training.

45 rounds for CyclicLight at Downtown Hangzhou, trained with WMCOD

Then we compared it to the best configures that DataLight uses for training [12]. (According to their paper, they are supposed to use a cyclic dataset instead of the random dataset and expert dataset for training the model based on AttendLight and AttentionLight, and to make policy cyclic, apply the ATC strategy.)

We extracted the last 5 rounds and compared them to the DataLight. There is a 1.5% improvement/reduction on average in terms of the total waiting time and therefore reduces the traffic meanwhile enabling the cyclic pattern, which benefits other factors like the safety issue and the waiting time for pedestrians.
While this improvement seems to be small, for the non-cyclic baseline model (which DataLight uses and is based on AttendLight and AttentionLight), the average waiting time the of last 5 rounds only outperforms the DataLight by 11.5%. That is a relaxed and overestimated upper bounds for the possible improvement because it is not even cyclic.

The shadow part is the distance between DataLight and the non-cyclic baseline RL model that DataLight uses, which gives one of the upper bounds for all possible improvements.
DataLight → CyclicLight (→ Non-Cyclic Baseline)

Therefore, CyclicLight achieved a 13% improvement within this possible room for optimization and improvement.

Here is the Github Repo For this Final Project.

Future Work & Investigation:

For the cyclic pattern, we assume that it can benefit more in a scaled road net that uses agents that are incorporated with coordination mechanisms like CoLight. Due to the limited amount of time, the Cyclic-CoLight we implemented doesn’t converge for now. We have already compiled a new dataset for training the Cyclic-CoLight.

Conclusion:

In this article, we investigated the waiting time reduction for the offline RL-based method in TSC and analyzed the possible room for optimization of DataLight and ATC strategy. To achieve the improvement, we first introduced a new concept of Meta-Action that can make the Q-network cyclic. For the training of CyclicLight, we further integrated and compiled a new type of dataset WMCOD to balance the distribution for the meta-actions in the dataset so that the Q-values are properly propagated/updated during the training. Lastly, to deploy the model more safely in a real-world setting, we imposed the KL divergence as the regularizer for the unseen actions of the empirical policy during the training. We got a 1.5% improvement in terms of waiting times compared to the DataLight. That is a 13% improvement within the possible room for optimization benchmarked with a non-cyclic RL model that DataLight uses.

Reference:

[1]H. Wei, N. Xu, H. Zhang, G. Zheng, et al. CoLight: Learning Network-level Cooperation for Traffic Signal Control. In CIKM, 2019.

[2]C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998.

[3] H. Wei, G. Zheng, V. Gayah, Zhenhui LiA. A Survey on Traffic Signal Control Methods. arXiv, 2020.

[4] L. A. Prashanth and S. Bhatnagar. Reinforcement learning with average cost for adaptive control of traffic lights at intersections. ITSC, 2011.

[5] Wei, Hua & Zheng, Guanjie & Gayah, Vikash & Li, Zhenhui. (2021). Recent Advances in Reinforcement Learning for Traffic Signal Control: A Survey of Models and Evaluation. ACM SIGKDD Explorations Newsletter. 22. 12–18. 10.1145/3447556.3447565.

[6] P. Velickovi ˇ c, G. Cucurull, A. Casanova, A. Romero, P. Lio, ´ and Y. Bengio. Graph attention networks. ICLR, 2018.

[7] Chen, Chacha & Wei, Hua & Xu, Nan & Zheng, Guanjie & Yang, Ming & Xiong, Yuanhao & Xu, Kai & Li, Zhenhui. (2020). Toward A Thousand Lights: Decentralized Deep Reinforcement Learning for Large-Scale Traffic Signal Control. Proceedings of the AAAI Conference on Artificial Intelligence. 34. 3414–3421. 10.1609/aaai.v34i04.5744.

[8] Varaiya, Pravin. (2013). Max pressure control of a network of signalized intersections. Transportation Research Part C: Emerging Technologies. 36. 177–195. 10.1016/j.trc.2013.08.014.

[9] T. Tan, F. Bao, Y. Deng, A. Jin, Q. Dai, and J. Wang. Cooperative deep reinforcement learning for large-scale traffic grid signal control. IEEE transactions on cybernetics, 2019.

[10] https://github.com/Chacha-Chen/MPLight/blob/master/baseline/maxpressure_agent.py

[11] https://github.com/traffic-signal-control/sample-code/blob/master/data/hangzhou_1x1_bc-tyc_18041607_1h/flow.json

[12] Liang Zhang & Jianming Deng. (2023). Data Might be Enough: Bridge Real-World Traffic Signal Control Using Offline Reinforcement Learning arXiv:2303.10828

[13] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. (2020). Conservative Q-Learning for Offline Reinforcement Learning.
arXiv:2006.04779

[14] Liang Zhang, Shubin Xie, and Jianming Deng. (2023). Leveraging Queue Length and Attention Mechanisms for Enhanced Traffic Signal Control Optimization.

[15] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)

Appendix I

Simulation:

Before we look at the dataset, let’s first list the simulation that we use in our project.

Dataset for Traditional RL Method & Visualization

The link below is the main dataset that we are investigating and gaining insight into. Most of them include both the road network file and the traffic flow file in JSON format. Some of them have signal_intersection_i_j.txt saved to record the actions of each agent and the corresponding timestamp.

Let’s look at a certain road net configuration and data from the real world to see the importance of the coordination of traffic lights. We take a portion of the road net of downtown Hangzhou as an example and we specifically look at 16 intersections that are highlighted in red below.

import osmnx as ox
import matplotlib.pyplot as plt

GuDang = (30.284698, 120.109238)
graph = ox.graph_from_point(GuDang, network_type='drive', dist=2100)

#Intersec: Osmid lat lon row col (bottom-to-up, left-to-right)
node_1_1 = 27127956 # 30.278049 120.096021 1 1
node_1_2 = 26496837 # 30.278366 120.106039 1 2
node_1_3 = 3382408994 # 30.278606 120.115016 1 3
node_1_4 = 26197702 # 30.279129 120.125075 1 4
node_2_1 = 332556593 # 30.283895 120.094802 2 1
node_2_2 = 26607405 # 30.284117 120.10446 2 2
node_2_3 = 26197709 # 30.284443 120.114585 2 3
node_2_4 = 26197716 # 30.284997 120.124611 2 4
node_3_1 = 332556598 # 30.288751 120.093849 3 1
node_3_2 = 33720424 # 30.289040 120.102869 3 2
node_3_3 = 262910008 # 30.289688 120.114313 3 3
node_3_4 = 26197727 # 30.290408 120.124187 3 4
node_4_1 = 2942981091 # 30.296996 120.093084 4 1
node_4_2 = 2908449025 # 30.295688 120.101862 4 2
node_4_3 = 30067206 # 30.296274 120.113794 4 3
node_4_4 = 31885920 # 30.296754 120.123847 4 4

targetIntersection = [
node_1_1, node_1_2, node_1_3, node_1_4,
node_2_1, node_2_2, node_2_3, node_2_4,
node_3_1, node_3_2, node_3_3, node_3_4,
node_4_1, node_4_2, node_4_3, node_4_4
]

# Create a list of node colors: default color for most nodes and a special color for the highlighted nodes
nc = ['r' if node in targetIntersection else 'none' for node in graph.nodes()]

# Create a list of node sizes: larger size for the highlighted nodes and default size for the rest
ns = [100 if node in targetIntersection else 10 for node in graph.nodes()]

# Plot the graph with the specified node colors and sizes
fig, ax = ox.plot_graph(graph, node_color=nc, node_size=ns)

Then we load the dataset that includes the roadnet data and the traffic data. With some simplification of the model, for each intersection, there are four incoming lanes and four outgoing approaches, where each approach has three lanes. The traffic flow data below is collected based on camera data in Hangzhou.

import json
import matplotlib.pyplot as plt
import argparse
import sys
from google.colab import drive

drive.mount('/content/drive')
FOLDERNAME = 'ECE1724/hangzhou_4x4_gudang_18041610_1h'

sys.path.append('/content/drive/MyDrive/{}'.format(FOLDERNAME))

%cd /content/drive/MyDrive/$FOLDERNAME

with open("roadnet_4X4.json", 'r') as road_file:
road_data = json.load(road_file)

with open("hangzhou_4x4_gudang_18041610_1h.json", "r") as file:
new_volume_data = json.load(file)

We here visualize it and demonstrate the traffic volume of each lane to gain insight into the dataset.

import numpy as np
import matplotlib.pyplot as plt

# Extract roads and intersections from roadnet_data
intersections = road_data['intersections']
roads = road_data['roads']

# Create a dictionary to store traffic volume for each road
traffic_volume = {road['id']: 0 for road in roads}

# Update traffic_volume based on traffic_data
for entry in new_volume_data:
for road_id in entry['route']:
traffic_volume[road_id] += 1

def add_arrow(ax, start_x, start_y, end_x, end_y, color):
"""Add an arrow to the plot to indicate road direction."""
offset = 50 # Offset to shift the arrow for visibility
shrink_factor = 0.7 # Factor to shorten the arrow

# Calculate the direction of the road
dx = end_x - start_x
dy = end_y - start_y

# Normalize the direction
length = np.sqrt(dx**2 + dy**2)
dx /= length
dy /= length

# Calculate offset starting and ending points for the arrow, and shorten the arrow
arrow_start_x = start_x + dy * offset + dx * (1 - shrink_factor) * length / 2
arrow_start_y = start_y - dx * offset + dy * (1 - shrink_factor) * length / 2
arrow_end_x = end_x + dy * offset - dx * (1 - shrink_factor) * length / 2
arrow_end_y = end_y - dx * offset - dy * (1 - shrink_factor) * length / 2

# Plot the arrow
ax.arrow(arrow_start_x, arrow_start_y, arrow_end_x - arrow_start_x, arrow_end_y - arrow_start_y,
head_width=50, head_length=50, fc=color, ec=color)

def add_text(ax, start_x, start_y, end_x, end_y, text):
"""Add text to the plot."""
offset = 80 # Offset to shift the text for visibility

dx = end_x - start_x
dy = end_y - start_y

# Normalize the direction
length = np.sqrt(dx**2 + dy**2)
dx /= length
dy /= length

# Calculate offset position for the text
text_x = (start_x + end_x) / 2 + dy * offset
text_y = (start_y + end_y) / 2 - dx * offset

ax.text(text_x, text_y, text, fontsize=10, ha='center', va='center', color='grey')

def visualize_traffic_red_with_shortened_arrows(intersections, roads, traffic_volume):
fig, ax = plt.subplots(figsize=(13, 10))
max_volume = max(traffic_volume.values())

# Draw intersections
for intersection in intersections:
x, y = intersection['point']['x'], intersection['point']['y']
ax.plot(x, y, 'o', color='red')

# Draw roads, overlay traffic volume, and add shortened offset arrows and offset text
for road in roads:
start_x, start_y = road['points'][0]['x'], road['points'][0]['y']
end_x, end_y = road['points'][1]['x'], road['points'][1]['y']

color_intensity = traffic_volume[road['id']] / max_volume

# Add shortened direction arrow with an offset
add_arrow(ax, start_x, start_y, end_x, end_y, plt.cm.Reds(color_intensity))

# Add offset text
add_text(ax, start_x, start_y, end_x, end_y, str(traffic_volume[road['id']]))

# Set a colormap for traffic volume
norm = plt.Normalize(min(traffic_volume.values()), max(traffic_volume.values()))
cmap = plt.cm.Reds

# Add colorbar to indicate traffic volume
sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
sm.set_array([])
plt.colorbar(sm, ax=ax, label="Traffic Volume")

ax.set_title("Traffic Volume Visualization with Direction")
ax.set_xlabel("X Coordinate")
ax.set_ylabel("Y Coordinate")
ax.grid(True)
plt.tight_layout()
plt.show()

visualize_traffic_red_with_shortened_arrows(intersections, roads, traffic_volume)
Here is a visualization of the dataset detailing traffic volume across 16 specific intersections, arranged in a 4x4 grid. The data captures one hour of traffic in the Gudang sub-district of Hangzhou, China. Traffic density is represented by shades of red, with darker hues indicating higher volumes. Arrows denote the direction of traffic flow on each road.

The plot demonstrates traffic distribution in the area and delineates the flow in each direction. Some edges, where one of the arrows has a value of 0, indicate that the real road is one-way(the road net and action phase have been simplified). Notably, bottlenecks are evident at the intersections of node_4_1 and node_4_4, as well as the Yuhangtang and Gudun Roads. Through this spatial distribution, we can see a propagation trend in traffic volume. That is, roads or intersections with high traffic often neighbor others with similarly high traffic volumes.

For example, let’s examine one of the bottlenecks Yuhangtang Road, and its traffic flow along each direction.

import networkx as nx
from optalgotools import routing

start_node = node_4_4
end_node = node_4_1

# Generate a BFS tree from the start node
bfs_tree = nx.bfs_tree(graph, source=start_node)

# Traverse the BFS tree to get the route from the start node to the end node
bfs_route = [end_node]
while bfs_route[-1] != start_node:
bfs_route.append(list(bfs_tree.predecessors(bfs_route[-1]))[0])
bfs_route = bfs_route[::-1] # reverse the route to start from the start node

# Plot the route
routing.draw_route(graph, bfs_route)

We can see that the traffic volume decreases gradually after each intersection in each direction. This subplot shows the relationship between neighbor intersections and implies the importance of the coordination of traffic lights. Suppose these intersections don’t cooperate with each other and can not see the observation from each other, it may not be able to predict the incoming traffic and plan out the corresponding efficient policy, and vehicles may suffer from many stopping at each intersection. With a well-trained coordination mechanism, signals of neighbor traffic lights can have good timing so that vehicles may pass through intersections with fewer stops, which also reduces the length of the queue at each lane.

Intersections (4, 1), (4, 2), (4, 3), (4, 4). Recall we index the intersection (row_id, col_id) from bottom to top and from left to right.

Here is the visualization of the naive signal policy from real-world data. There is no coordination for each agent. Each action is just a combination of signals of each intersection.

Illustration of actions. Note that these four actions don’t correspond to the actions below. Imaged by the author through simulation.

The following is the segment of the signal history of the MaxPressure implementation[10], which can maximize network throughput based on local information at each intersection, given knowledge of mean turn ratios and saturation rates [9].

phase_1 = max((np.sum(state['coming_vehicle'][3:6])+np.sum(state['coming_vehicle'][12:15]) - \
np.sum(state['leaving_vehicle'][3:6]) - np.sum(state['leaving_vehicle'][12:15])),0)
phase_2 = max((np.sum(state['coming_vehicle'][21:24])+np.sum(state['coming_vehicle'][30:33]) - \
np.sum(state['leaving_vehicle'][21:24]) - np.sum(state['leaving_vehicle'][30:33])), 0)
phase_3 = max((np.sum(state['coming_vehicle'][0:3]) + np.sum(state['coming_vehicle'][9:12])-\
np.sum(state['leaving_vehicle'][0:3]) - np.sum(state['leaving_vehicle'][9:12])), 0)
phase_4 = max((np.sum(state['coming_vehicle'][18:21]) + np.sum(state['coming_vehicle'][27:30])- \
np.sum(state['leaving_vehicle'][18:21]) - np.sum(state['leaving_vehicle'][27:30])), 0)

self.action = np.argmax([phase_1, phase_2, phase_3, phase_4])

Since it uses np.argmax() to choose the best action, the default action is 1. That is why action 1 seems to be the dominant best action in the plot but actually is not.

import matplotlib.pyplot as plt
import os

file_paths = [
"./signal_plan_MP_intersection_1_1.txt", "./signal_plan_MP_intersection_1_2.txt", "./signal_plan_MP_intersection_1_3.txt", "./signal_plan_MP_intersection_1_4.txt",
"./signal_plan_MP_intersection_2_1.txt", "./signal_plan_MP_intersection_2_2.txt", "./signal_plan_MP_intersection_2_3.txt", "./signal_plan_MP_intersection_2_4.txt",
"./signal_plan_MP_intersection_3_1.txt", "./signal_plan_MP_intersection_3_2.txt", "./signal_plan_MP_intersection_3_3.txt", "./signal_plan_MP_intersection_3_4.txt",
"./signal_plan_MP_intersection_4_1.txt", "./signal_plan_MP_intersection_4_2.txt", "./signal_plan_MP_intersection_4_3.txt", "./signal_plan_MP_intersection_4_4.txt"
]

# Function to read and parse the files
def read_signal_file(file_path):
with open(file_path, "r") as file:
lines = file.readlines()
timestamps, signals = [], []
for line in lines:
timestamp, signal = line.strip().split(",")
timestamps.append(float(timestamp))
signals.append(int(float(signal)))
return timestamps, signals

# Extracting signal plans from all files
signal_data = {}
for path in file_paths:
intersection_name = os.path.basename(path).replace("signal_plan_MP_", "").replace(".txt", "")
signal_data[intersection_name] = read_signal_file(path)

def plot_signal_phases_with_steps(intersection_name, timestamps, signals, ax):
"""Plot the signal phases using step functions on a given axis."""
ax.set_title(f"Intersection {intersection_name}")
ax.set_xlabel("Time (s)")
ax.set_ylabel("Phase")
ax.set_ylim(0.5, 4.5)
# Iterate over the timestamps and signals to plot the durations
for i in range(0, len(timestamps), 2):
if (i+1 == len(timestamps)):
break;
start_time, end_time = timestamps[i], timestamps[i+1]
signal_phase = signals[i]
ax.plot([start_time, end_time], [signal_phase, signal_phase], color='k', lw=1.5)

# Create a 4x4 grid of plots with the step function visualization method
fig, axs = plt.subplots(1, 4, figsize=(20, 5), sharex=True, sharey=True)

for i in [4]:
for j in range(1, 5):
intersection_name = f"intersection_{i}_{j}"
timestamps, signals = signal_data[intersection_name]
plot_signal_phases_with_steps(intersection_name, timestamps, signals, axs[j-1])
axs[j-1].set_xlim(300, 600)

plt.tight_layout()
path_to_save = './intersection_signals.png'
fig.savefig(path_to_save, dpi=300, bbox_inches='tight')
This is the visualization of MaxPressure Signals of a time period from 300 seconds to 600 seconds.

The length of each line segment represents the duration of each action(or combination of signals). For this plot, let’s focus on action 3 of the intersections (4, 3), (4, 2), and (4, 1). Action 3 is a combination of signals that allow vehicles to pass through intersections from East to West. From this, we see that there is a predictable delay in terms of traffic signals. This delay is caused by the travel time of vehicles from one intersection to another. Furthermore, the duration of this action decreases through the intersections. This can also be easily explained, as the traffic volume would partially flow to other roads after going through intersections. As MaxPressure has been proven to maximize the throughput of the global road net and it is an instant reflection of the traffic volume at each intersection, we can see that a reasonable coordination mechanism should give us a policy that has a similar pattern to reduce the length of queues and reduce the sum of the travel time.

This includes the agents’ actions over one hour using MaxPressure.

This is the plot over the whole dataset. Recall the default action (or the ‘Null’ action) is action 1. So to get a better intuition, it’s better to ignore action 1 and just focus on the top three lines in each subplot. Since this visualization is over one hour of data, some information such as the duration of each signal can not be displayed in detail like the previous plot. But as MaxPressure is an algorithm that optimizes the throughput based on the observation of ‘pressure’, which is a certain evaluation of traffic volumes, we can still infer the relative traffic pressure at each intersection by inspecting the corresponding phases. For example, through intersections (1, 1), (1, 2), (1, 3), and (1, 4), as action 2 is the most dominant combination of signals that these agents take, we can infer that most vehicles are moving from West to East on the Wensan Road.

However, we still need to notice that the analysis above uses a simplified model of a 4 by 4 grid. There are more traffic lights along the roads other than these 16 intersections if we examine them in more detail. That is why, in the plot of the traffic volume above, you cannot calculate the inflow to an intersection based solely on the outflow of its neighbors.

import overpass
import geopandas as gpd
import pandas as pd
import folium

gu_dang_location = (30.284698, 120.109238)
gu_dang_location_lat_lon = (120.109238, 30.284698)

def getGeoDataPointsWithRadius(highway_type, center_location, radius):
lat, lon = center_location
query = f"""
node
["highway"="{highway_type}"]
(around:{radius}, {lat}, {lon});
out center;
"""
response = api.get(query)
gdf = gpd.GeoDataFrame.from_features(response)
return gdf[gdf.geometry.type == 'Point']

api = overpass.API()

traffic_light_gdf_points = getGeoDataPointsWithRadius("traffic_signals", gu_dang_location, 1300)

traffic_light_map_options = {
"location": gu_dang_location,
"zoom_start": 15
}
traffic_light_map = folium.Map(tiles="OpenStreetMap", **traffic_light_map_options)

folium.CircleMarker(location=gu_dang_location, radius=3, color="red", fill=True).add_to(traffic_light_map)
folium.Circle(gu_dang_location, radius=1300, color='red', fill=True, fill_opacity=0.1).add_to(traffic_light_map)

for _, row in traffic_light_gdf_points.iterrows():
folium.CircleMarker(location=(row['geometry'].y, row['geometry'].x),
radius=4,
color="blue",
fill=True).add_to(traffic_light_map)

traffic_light_map

Note, this Colab is just for the midterm paper not for the final project.

Appendix II

Here is the Github Repo For this Final Project.

--

--