d*classified

design | develop | defend. DISCLAIMER OF ENDORSEMENT: Reference to any specific product, service, process, or method by trade name, trademark, manufacturer or otherwise on this web site does not constitute an implied or expressed recommendation or endorsement by DSTA.

Reinforcement Learning in Dynamic Resource Allocation: Optimizing Humanitarian Assistance and Disaster Relief

--

HADR missions are a tough nut to crack from a resource allocation perspective — incomplete information, stochasticity of a dynamic system, uncertainty of future needs, conditions and constraints, mixed short- and long-term objectives. Megan Seah Li Ping built a simulator imbued with a Markov Decision Process model to simulate asset allocation, further applying reinforcement learning to learn optimal resource allocation policies amidst uncertainties and mixed time horizon objectives. This article is a derivative of her original work.

Background

Dynamic resource allocation is a formidable challenge which encompasses a blend of complexities — information gaps, stochastic systems, uncertain future scenarios, multi-temporal objectives, and the deferred impact of decisions. This article embarks on a exploration of asset management and resource allocation within the realm of Humanitarian Assistance and Disaster Relief (HADR) missions, where these intricacies converge. Additionally, we uncover the potential of Reinforcement Learning (RL) as an instrumental framework for devising optimal resource allocation strategies.

taking our new tool for a spin — Runway.ML

Project Objective

The primary goal of this project is to construct a simulation tool using a Markov Decision Process (MDP) framework. This simulation tool will replicate the process of assigning resources and will integrate reinforcement learning techniques to derive the most optimal decision-making strategy. Complementary to this, a rewards function calculator will be integrated, aiding in assessing the effectiveness of different actions. A monitoring dashboard will also be developed to visualize key metrics and results. This simulation tool serves as a critical component for both testing and showcasing the implemented reinforcement learning algorithm. Furthermore, it provides valuable insights into the complexities of resource allocation within the context of asset management.

Photo by Google DeepMind on Unsplash

Solution Overview

Figure 1: An overview of what we’ll unpack in detail

Data Ingestion & Processing

The initial input encompasses three distinct datasets:

  1. Mission Data: This dataset contains comprehensive information about each mission, detailing its unique attributes and characteristics.
  2. Supply Data: Here, the dataset provides an intricate account of individual assets, outlining their specific attributes and capabilities.
  3. Task Data: This dataset outlines the essential prerequisites for assets to effectively fulfill specific tasks.

In the process of Data Ingestion and Processing, the three datasets are ingested and read into the system. To ensure data quality, thorough validation checks are executed, ensuring the presence of all essential fields while adhering to predefined constraints. Since certain variables within the datasets may be encoded as strings, a technique called feature encoding is employed. This conversion transforms data into more computationally efficient formats such as integers or lists. This transformation enhances the efficiency of subsequent calculations and comparisons conducted during both the training and simulation phases.

Moreover, the datasets undergo additional preprocessing steps, including merging operations. The purpose of these steps is to create a comprehensive, well-structured dataset that can serve as the foundation for the simulation. This formatted dataset takes the form of a grid-like structure, resembling a coordinate system. Within this structure, every task associated with each mission is precisely represented. Attributes corresponding to the assets allocated to each task, as well as indicators of task fulfillment, are included within this representation. To visualize this arrangement, refer to Figure 2 for an illustrative depiction of the general layout of this state representation.

Figure 2: A general layout of the state

Training & Simulation

Both the training and simulation phases of this project employ the Q-Learning algorithm, differing primarily in the number of iterations performed. During training, 10,000 iterations are executed, while in simulation, only a single iteration takes place. Each iteration consists of 20 discrete time steps, where each step corresponds to a specific period of time.

The Q-Learning algorithm, a prominent reinforcement learning technique, is harnessed for its efficacy in training the Q-Function. This Q-Function essentially manifests as a Q-Table, which operates as a lookup table. Within this table, states are matched with corresponding sets of actions that are explored in those states. Additionally, the table stores projected future rewards, known as Q-values.

In the training phase, the agent actively interacts with the environment, refining Q-values through iterative learning. Over time, this process culminates in the generation of an optimized Q-Table that embodies optimal decision-making strategies. In contrast, during the simulation phase, the agent employs this optimized Q-Table to meticulously select the most advantageous set of actions at each time step based on the current state’s Q-value.

The inception of the Q-Learning algorithm involves the initialization of a Q-Table. This Q-Table is adeptly represented as a 2D dictionary. Here, string representations of both states and actions serve as keys. The state string is associated with multiple action string keys, each mapped to a corresponding Q-value representing the reward expectation for that particular state-action pair. For an illustrative representation, refer to Figure 3, which provides a visual overview of the Q-Table’s general layout. At the onset of each time step, the ongoing process depicted in Figure 4 is set into motion.

Figure 3: A general layout of the Q-Table
Figure 4: Process outline at each time step of training & simulation

The strategic employment of the Epsilon Greedy Strategy addresses the inherent tradeoff between exploration and exploitation. Here, epsilon (ε) signifies the exploration probability, while (1 − ε) signifies the exploitation probability. During the training phase, ε is initially set to a higher value (ε > 0.5) to foster exploration across state and action spaces. In contrast, during the simulation phase, ε is initialized as 0 (i.e. pure exploitation) to gauge the learning progress and assess the Q-Table’s quality.

In the exploration process, the agent systematically acquires all feasible actions for each asset within the current state. These actions must adhere to the organization’s defined policies and constraints. In the ongoing implementation, the agent randomly selects distinct actions from the pool of available options, characterized by a designated integer value (y). Future iterations could potentially model this selection process using a distribution function aligned with the organization’s policies.

Conversely, during exploitation, the agent, grounded in the current state, chooses actions from the Q-Table that correspond to the highest associated Q-value. Subsequently, this chosen set of actions, arising from the exploration/exploitation approach, triggers the generation of a new state and the corresponding immediate rewards. These rewards are computed by evaluating the compatibility of asset attributes with task prerequisites, thus determining the asset’s suitability for the given task.

Upon obtaining the new state and immediate rewards resulting from the chosen action set, a fresh Q-value is calculated, leveraging the equation presented in Figure 5. This iterative process of action selection, state transition, reward computation, and Q-value update forms the core dynamic underlying the reinforcement learning process.

Figure 5: Computation of new Q-value

In the course of each time step, we measured these key metrics:

  1. Aggregate training rewards recorded at the conclusion of each iteration.
  2. Mean simulation rewards assessed at every 100th iteration, determined by extrapolating from the current Q table and 100 simulations. This evaluation serves to gauge the agent’s acquisition of intelligence throughout the learning process.
  3. The average ratio of exploration to exploitation computed during each training iteration, scrutinized to ascertain its alignment with the predefined explore/exploit ratio.
  4. The strategic allocation of assets, ensuring they are optimally assigned to tasks (compared against asset-mission fit)
  5. Cumulative duration for which each task remained unresolved, enabling a comparative analysis between scenarios with training and those without.
  6. The percentage of mission fulfillment, discernible at the onset and culmination of the simulation, thereby encapsulating the efficacy of the implemented approach.

Monitoring Dashboard

Figure 6: Monitoring Dashboard (partial)

The monitoring dashboard, developed utilizing the Dash framework, effectively translates the collated metrics from both training and simulation phases into insightful graphical representations. Stakeholders can leverage this dashboard to critically assess the validity of outcomes and ascertain the proficiency of the reinforcement learning (RL) algorithm’s performance. An excerpt of this monitoring dashboard is depicted in Figure 6.

Figure 6: Excerpt from the Monitoring Dashboard Within this visualization, specific aspects warrant consideration. In particular, the graph labeled as (a) in Figure 6 depicts notable oscillations in training rewards as iterations progress. This phenomenon aligns with expectations, attributed to the agent’s strategic alternation between exploration and exploitation at each timestep, with a notable inclination towards exploration.

Conversely, graph (b) in Figure 6 illustrates the discernible trajectory of average simulation rewards. Notably, these rewards exhibit rapid advancement during the initial iterations, eventually culminating in a plateau phase around the 1300th iteration. This pattern is indicative of the agent’s evolving proficiency, primarily attributable to its amplified exploration of an expanded state and action space, facilitated by extended training iterations. This graphical depiction concurrently serves as a valuable reference point for future enhancements and refinements.

Challenges

Given the intricate nature of this challenge, numerous obstacles emerged. Key challenges are elucidated as follows:

Q-Table Representation. Conventionally, the Q-Table is initialized as a two-dimensional (2D) array, with values like 0 or arbitrary assignments for each state-action combination. This Q-Table setup is typically viable due to the modest dimensionality of states and the compact, unchanging state-action space. However, in our case, states take on high-dimensional attributes due to the inclusion of tasks and asset attributes. Furthermore, both the state and action spaces are substantial and dynamic due to the presence of continuous-type attributes and an immense array of action combinations. The 2D array form, resembling a sparse matrix, hampers space efficiency. Consequently, adopting a 2D dictionary format for the Q-Table was the suitable alternative. Here, the state and action operate as keys, and the corresponding Q-value serves as the associated value.

State and Action Representation. Vital to guaranteeing uniqueness in the state is the judicious selection of asset attributes. These attributes are pivotal for facilitating updates and reward computation. Simultaneously, state definition should prioritize succinctness for smoother process efficiency and shorter training spans. Similarly, each action must stand out as distinct, all while harboring adequate data for precise state updates. Misrepresentations could lead to several quandaries, including:

  • Actions valid during exploration might not hold during exploitation, potentially encompassing invalid actions.
  • The state might be erroneously updated.
  • An intertwined issue encompassed the data schema, data ingestion, and processing pipeline. Datasets necessitated meticulous design based on their intended functions to stave off duplications and processing befuddlement. Merging these datasets to obtain an all-encompassing state depiction without data loss was a pivotal step.

Handling State Updates. Unforeseen state shifts could emerge, spanning alterations in task prerequisites and even changes in asset capabilities over time. Consequently, the agent must meticulously accommodate these external shifts. A promising resolution involves triggering state updates at each time step’s onset during both training and simulation.

Crafting the Rewards Function. Critical to the rewards function is its portrayal of policy efficacy — essentially the set of selected actions. This formulation necessitates weighing several factors that contribute to an asset’s suitability for a given task, alongside the assignment of corresponding weights and the scaling of rewards from these factors.

Potential Future Directions

One promising avenue for further exploration involves enhancing the modeling of unpredictable changes. Requesting user input for state modifications would prove inefficient during training, particularly if the problem scales up or more training iterations are conducted. An interesting observation from the current implementation is the initial emphasis on asset allocation in the initial time steps. This pattern likely stems from a lack of accounting for changes in asset capabilities. This oversight significantly inflates the cost of allocating assets to tasks that demand advanced capabilities. An alternative strategy for introducing uncertainty could involve internal modeling of changes, utilizing metrics like asset replacement rate.

A related line of future work concerns enhancing the simulator’s performance. Conducting hyperparameter tuning could yield higher rewards, and increasing the number of training iterations could yield a more comprehensive Q-Table. Due to existing resource constraints, the training iterations were capped at 10,000. However, with access to enhanced computational resources, expanding the training iterations could amplify the efficacy of policies. Furthermore, investigating the incorporation of transformers or neural networks could expedite the training process.

Photo by Marek Szturc on Unsplash

Conclusion

Overall, working on this project has been an enriching and enjoyable experience. Unlike the AI projects I have worked on previously, this project involved building a simulator from scratch, which was a novel experience for me. Through this process, I also gained a more in-depth understanding of RL and acquired new technical skills such as Dash. Furthermore, it was interesting to learn about some domain knowledge before working on implementing the various components.

Megan seah (second from left) with her team from Digital Hub

References

Simonini, T. (n.d.). An introduction to Q-learning part 2/2. Hugging Face — The AI community building the future. https://huggingface.co/blog/deep-rl-q-part2

Icons in figures were obtained from https://icons8.com/.

--

--

d*classified
d*classified

Published in d*classified

design | develop | defend. DISCLAIMER OF ENDORSEMENT: Reference to any specific product, service, process, or method by trade name, trademark, manufacturer or otherwise on this web site does not constitute an implied or expressed recommendation or endorsement by DSTA.

No responses yet