Deep Reinforcement Learning for Network Design in Marine Transportation

Published in

InstaDeep

18 min readMar 6, 2023

As the international economy of scale faces new challenges and markets continue to grow, designing efficient maritime routes has become a major logistical issue involving huge costs for shipping companies¹ and has been extensively studied in the literature². Network Design in Marine Transportation is an example of an essential but complex optimization problem.
This article is structured the following way, we first explain the 3 interdependent problems built on each other: the Empty Container Repositioning (ECR), the Fleet Deployment (FD), and finally the Network Design (ND) problems. We then present and extend the Configure & Conquer approach to the Network Design settings. Moreover, we show how Reinforcement Learning can be used for Network Design and what is its unique benefit compared to other optimization methods. Finally, we show the experimental results.

Work conducted during an internship at InstaDeep

In 2021, an estimated 90% of the world’s goods were transported by sea, 60% of that being packaged in large steel containers³. This makes the choice of routes, the deployment of vessels on those routes, and the repositioning strategy adopted by these vessels very important for shipping companies. There are 3 hierarchical levels of optimization for this problem⁴ :

Network Design (ND): Given a set of ports (with some demand for goods between ports that varies over time), how do I create a network n of maritime routes?
Fleet Deployment (FD): Given a network of routes between ports and a fleet of vessels, how do I choose the assignment R of the vessels to the routes?
Empty Container Repositioning (ECR): Given a network of ports, routes, and vessels, what Repositioning strategy π do I choose for my vessels to maximize the fulfilment percentage f? The Repositioning strategy is the strategy that will dictate to our ships how many empty containers to load/unload when they arrive at a port.

These three interdependent problems can be formalized as follows :

The three-stage Marine Transportation optimization problem

1. The Empty Container Repositioning problem

The efficient allocation of empty containers across the ports of a marine transportation network (having the right amount of containers in the right place at the right time) is a vital problem to be able to fulfil the order of goods, as one needs big empty steel containers to transport those ordered goods (in the right place and at the right time). This is called the Empty Container Repositioning (ECR) problem, which is a difficult one to deal with due to the complexity of its intertwined processes (market demand and supply of goods, weather conditions, shipping technology, etc) and unpredictable environment as it happened in March 2021 when the Suez Canal was blocked for six days after the grounding of the Ever Given ship.

An instance of the ECR problem is an environment configured by a network i.e. a set of ports connected by the vessel’s routes (denoted n) and the deployment of vessels on routes (denoted R). The objective for ECR agent A is to minimize the shortages of goods due to imperfect repositioning of empty containers, i.e. to maximize the fulfilment ratio:

The fulfilment percentage for the ECR problem

Where “shortages” are the shortages obtained on network n with assignment R and with ECR strategy A.

This is done by an ECR algorithm we will denote as A. There are various algorithms we can use for this :

Random: takes a random number of containers to load/discharge each time a vessel arrives at a port
No Repositioning: does not load/discharge any empty containers
Import Export: load or discharge a certain quantity (random) depending on whether the port is an exporting port (discharge, because the port will need those containers) or an importing one (load, because the port has too many containers).
Operation Research: Traditionally, ECR problems were solved with Operation Research (OR) approaches. We use Linear Programming (LP) to model the constraints and objectives of the repositioning problem, such as the number of empty containers available or the demand for containers at different locations. LP can then be used to find the plan that minimizes the total cost of repositioning.
Operation Research Iterative: We apply the OR plan for a certain number of steps, then re-compute a new plan, and iterate like this until the end of the simulation. This helps to prevent the plan generated by OR from drifting away from reality as time goes by.
Multi-agent RL: We can model the problem as a multi-agent environment and use multi-agent RL on it (where agents are either the ports or the vessels). This gives the advantage of generalization for the agents. This strategy was used during previous work at InstaDeep⁵. For preventing any confusion, in this article we are using RL for Network Design, and not for the ECR as multi-agent RL.

For conducting our experiments, we used a simulator for Marine Transportation dynamics, MARO⁶.

The framework, published by Microsoft, contains several appealing features for researchers and companies that want to conduct experiments on a list of problems. MARO provides the designer with different logistic networks on which you can test your methods. Simple toy domains are available together with global trade scenarios. More information about both the MARO simulator and the ECR formulation can be found in the previous article⁵.

This is the fulfilment percentage obtained by various ECR algorithms on WTT1, a big topology (46 vessels and 22 ports) provided by MARO itself and supposed to represent the actual maritime routes:

Fulfilment percentage for some ECR algorithms

Import Export and in particular ORI are costly algorithms but give better results in general (among all topologies). Random and No Repositioning are faster but give the worst results. OR is also fast but is hardly applicable online because it only finds an initial plan, that quickly moves away from reality.

Note that it is not necessary for the reader to understand the ECR problem perfectly, as this article’s main purpose is to use RL in the two higher-level problems, and an ECR agent performance can be seen as a black box function on a network n.

2. The Fleet Deployment problem

Global trade environments in which service routes among ports are highlighted

Even if you have a very good ECR agent, the assignment of vessels to routes is also decisive in our problem. You can have very smart vessels, but if almost none of them are taking the most important routes, you won’t satisfy the orders. That is why we need to optimize the assignment of vessels to the network. For the previous ECR results, we used the assignment of the original MARO topology, which is supposed to reflect the reality of maritime networks, but it should be possible to assign vessels in a better way than in the default MARO topology.

The problem is known as Fleet Deployment (FD): given a network of ports and routes n, how do I assign the vessels from my fleet to these routes? Once you have assigned your vessels and obtained the final network n and deployment (denoted R), your run an ECR agent A on the network to obtain your score: f(A, n, R). The problem can be formalized the following way: if you have k routes numbered 1, 2, … k and d vessels V1, V2, …, Vd, you aim to :

This problem can be solved as a one-stage optimization process using Mixed Integer Linear Programming⁷ or as a two-stage optimization process² using a metaheuristic at the top level and an ECR agent A at the low level.

This approach can be seen as a black box optimization problem and can be solved by any local search algorithms (i.e. metaheuristics) such as Genetic Algorithm (GA) or Tabu Search. In particular, this problem has been dealt with Reinforcement Learning (RL) as the metaheuristic in this paper⁸. The reason why using RL instead of a classical metaheuristic will be explained in section 5. In our case, we will extend the use of RL as a metaheuristic not only to Fleet Deployment but also to Network Design, the top-level optimization problem.

3. The Network Design problem

Network Design is a generalization of the FD and ECR problem and consists of also optimizing the routes themselves and not only the assignment of vessels to those routes. We are only given a set of ports (without the routes between them this time) and we must create the routes, then assign the vessels, and finally choose the strategy of our vessels.

Similarly, it gives us a three-stage optimization process:

But this 3-stage approach makes the global optimization process very slow and is not the most practical way to proceed. It is much more convenient to build the routes at the same time we are assigning the vessels: i.e. to build a new route for each vessel. Therefore, we will denote ND+FD and still call Network Design the problem of creating a route for each vessel to maximize the ECR performance, and we will denote “n” as the chosen assignment. Additionally, we won’t optimize at the ECR level but rather always use a certain ECR algorithm denoted A.

This gives us a black-box optimization problem :

An instance of Network Design is a set of ports and a set of vessels (a fleet). The ND algorithm will assign each vessel to a route in the network. A route is an initial port as well as the next port. For example route [A, B, C, D] means the vessel starts at A and follows A -> B -> C -> D -> A -> …

Once every vessel is assigned to a route, we obtain a network (denoted n) and use an ECR algorithm to compute the performance. The goal is to find the network n* that maximizes f(A,n) where A is the ECR algorithm used. From the perspective of the ND agent, the signal n -> f(A,n) appears as a black box function to optimize, so we are reduced to the blind optimization problem of maximizing f(A, ). Therefore, the family of the method to use to solve this is Metaheuristics, i.e. Local Search methods. We can use any metaheuristic (denoted M) to solve this, such as Hill Climbing, Simulated Annealing, or Genetic Algorithm. However, we do not use a classical metaheuristic but rather a Reinforcement Learning process. The reasoning for that is explained below in section 5.

4. Configure and Conquer approach

The choice of the A algorithm is a trade-off between computational/sample efficiency and performance. Indeed, since A is used to train the ND algorithm, the ideal method should be fast to compute and provide good approximations of the performance of the optimal ECR strategy. The main issue is that the faster the method, the more networks we can evaluate in a reasonable amount of time to find n*. However, this usually comes at the cost of precision, which might impact the optimization process of f(A,n*). For these reasons, depending on the problem, one might use heuristics, mathematical programming, experts, multi-agent reinforcement learning agents (as vessels or as ports) trained on limited data, and so on.

The Configure and Conquer (CC) approach⁸, consists of using a fast ECR algorithm (denoted Ac) for the signal sent to the ND metaheuristic so that it optimizes f(Ac,n), and a performant ECR algorithm (denoted Ae) for the deployment of the actual ECR policy in the network found by the metaheuristic M. Ac is, therefore, the signal algorithm and Ae the evaluation algorithm. We denote this general approach as CC(M-Ac-Ae), and we give the general pseudo-code here :

Require : ND metaheuristic M, ECR algorithms Ac & Ae

1) networks <- get_initial_networks(M)
2) fitnesses_Ac <- f(Ac, networks)
3) for n_iterations iterations :
      3.1) networks <- train_step(M, networks, fitnesses_Ac)
      3.2) fitnesses_Ac <- f(Ac, networks)
      3.3) if need evaluation :
              fitnesses_Ae <- f(Ae, networks)
4) network* <- best network in networks according to fitnesses_Ac
5) fitness* <- f(Ae, network*)

5. Reinforcement Learning as a Metaheuristic for Network Design

We first explain why RL gives a unique advantage for this problem in comparison to classical metaheuristics.

The main difference between RL and metaheuristics such as GA for example is that once the RL policy is trained, it can be used instantly for assigning vessels to the network, while with GA you need to re-train it from the start each time you have a different instance of the ND problem (different ports or different vessels). Imagine now that the shipping company has a new vessel in its fleet, or adds a new port in its network, it means that it has to re-train GA, which can be very inconvenient since its training time scales very badly with the size of the networks. With RL however, you could train your agent not to assign a particular fleet to a particular network, but any fleet to any set of ports, and once trained, you obtain a very robust policy that can be used almost instantly to give the shipping companies a Network along with its performance. This could also be used by shipping companies to evaluate how cost-effective it would be to purchase new vessels, and even create an additional optimization level this way.

Additionally, defining the local search parametric space for the ND problem requires taking into account constraints and scales poorly. For example, we could define a parameter is_edge(vessel, port1, port2) for each vessel and each port. Then the parametric space would be of dimension n_vessels*n_ports² = 22264 for the WTT1 topology. With RL, it’s easier to build a valid network using action masking for illegal actions as it is described below. The fact that action-masked-MDP is an extremely general framework makes RL a very good candidate for many optimization problems.

In summary, we are using Deep RL for Network Design because:

It can be trained on many ND instances, with different sizes, and then instantly deployed on any of those (unlike metaheuristics)
By building the network sequentially, it scales extremely well with the size of the problem
Constraints are easy to implement using action masking

Reinforcement Learning is both trainable and adapted to black-box functions. Additionally, because of its general nature, it is very easy to implement the environment for many optimization problems.

6. The RL environment for Network Design

We now describe how to create an RL environment in the OpenAI’s Gym⁹ framework for our problem. RL has already been used for Network Design for real-world graphs¹⁰, we use it here for our particular problem of Marine Transportation.

The transitions of the Network Design RL environment

In each episode, the agent will have to assign a fleet of vessels in a random order to a set of ports. The idea is to start with an empty graph (a set of unconnected ports) and sequentially build the route of vessels. The first vessel will be sampled randomly from the set of vessels, and the first action will correspond to choosing a starting port for this vessel. The next actions will be the next ports on the vessel’s route. There is an additional route-ending action that ends the route, making it cycle back to its starting port. Once the episode ends, i.e. when all vessels are assigned, the Ac agent is used for computing the sparse final reward f(n,Ac).

During the episode, the observation must give both information about the episode characteristics (the ports and fleet of vessels), and the current state of the being-built network (the remaining vessels, the current vessel, and the assigned vessels in the network).

Actions

The action space is a discrete space of size n_ports + 1, each action corresponds to choosing the corresponding port as the next port of the current vessel in its route. The last action corresponds to ending the route by closing the port cycle. Illegal actions in a given state (selecting a port already present in the route) are masked by an action mask so that they are never taken.

Reward

A sparse reward is obtained only at the end of the episode, once the whole network has been created. A reward of 0 is returned for any other steps of the episode.

For normalizing the reward, we divide the Ac signal of the network built by the agent by the Ac signal of the reference network provided by MARO, which is supposed to correspond to the actual maritime networks.

The goal become therefore to obtain a reward of 1 or higher. Indeed, if we manage to have a network at least as good (in the sense of Ac) as the reference network, it means our RL agent is generating more efficient networks (because they are used in the actual world) and so it would prove that RL can be used in Network Design, with all the advantages of robustness and potential generalization previously mentioned.

Observations

To have a fleet-agnostic agent, we must provide information about the current episode but also the state of the network. Relevant features that can be added to the observation are:

The vessels: a batch of vectors defining characteristics (the vessel features, in blue) of each vessel in the fleet (e.g. speed, capacity…). This is useful if the fleet changes between episodes. This was not the case in our experiments though, only the order of assignment was changing, but this may be a generalization of our problem.
The remaining vessels: a batch of vectors corresponding to vessels yet to be assigned. This gives information about how much will the agent be able to impact the network later.
The assigned vessels in the ports: this gives information about how is the set of ports affected by the already assigned vessels. This time we not only need to give the vessel features but also their status in the network, so for each vessel, we have (in green):
- The adjacency matrix of the vessel’s route
- The vessel’s starting port, one-hot encoded
- The vessel feature
The current vessel: the 3 same components as the previous feature, but only for the current vessel.
Port features: a batch of vectors describing the ports. This is useful if the set of ports is changing through episodes. Again, this was not the case for our experiments but could become the case for a more general setting.

These features are then flattened and concatenated, and given as input to the RL agent along with an action mask. The RL agent we used is PPO¹¹ with action masking. The network architecture is a classical Dense Neural Network, with policy and state value sharing the same backbone.

The masking PPO agent for Network Design

There is one more difficulty in this way of creating an environment. You might have noticed that the size of some features is not of constant shape. If you change the number of vessels, it changes the “Vessels” observation space size for example. The “Remaining Vessels” and “Assigned Vessels” features even change shape during an episode. This is a problem because classical RL agents can’t train on observation of non-constant shapes. We could define a maximum number of vessels and use zero padding to fill the missing vessels, but this limits our framework (while the point of doing RL is to be able to generalize), and most importantly, it does not respect the fact that the way the agent see the fleet should be permutation invariant: if we switch 2 vessels, the observation should not change.

The way we treat the information of “batchable” features, such as the remaining vessels, should be:

Permutation invariant: if you switch two vessels, the agent should interpret it the same way
Size agnostic: the agent should be able to deal with any number of vessels

A solution for this is to find operations that are both permutation-invariant and with a fleet-size-agnostic output shape. Such operations include max, min, sum, and mean. We are not using these operations directly on the vessel features, but rather on the encoding of those vessels by a Dense Neural Network. The idea is that the encoder will learn to transform the vessel features into something whose max, min, sum or mean will be able to extract useful information.

A batch of V vessels to order agnostic and size-agnostic features

We use one encoder for each “batchable” feature, i.e. a feature that is supposed to be taken as a batch of vessels or something else. In our settings, it means one encoder for Vessels, one for the Remaining Vessels, and one for the Assigned Vessels.

A more sophisticated architecture for the batchable features could be the encoder’s transformer and its attention mechanism¹². Indeed, transformers are an architecture that deals with batches of any size as input (e.g. a sequence of words, or a fleet of vessels) and can output a constant size (e.g. sentiment classification, or a vector describing the fleet). The observation features could also be improved: instead of transforming the observation to a vector, we could keep it as a graph and use Graph Neural Network¹³ for dealing with it.

7. Results

We show the Ae evaluation through the training of the RL agent which is trained using the Ac signal as the reward. All of this is done on the realistic and complex WTT1 topology.

Remember our goal: reaching a reward of 1 for algorithm A means our agent generates as efficient networks as the one used in reality.

Signal algorithm Ac = random — Evaluation algorithm Ae = OR(I)

Signal algorithm Ac = random — Evaluation algorithm Ae = Import Export

Signal algorithm Ac = no repositioning — Evaluation algorithm Ae = Import Export

Signal algorithm Ac = no repositioning — Evaluation algorithm Ae = OR(I)

We see that the RL agent manages to generate as good networks as the reference ones for both Import Export and OR(I) ECR algorithms, using a signal as poor (but fast) as the Random agent or the No Repositioning (NR) agent. Only the NR signal appears a bit insufficient in the Ae = ORI setting, mostly due to a lack of training time.

The results are summarized in the following table :

8. Conclusion

As we can see, it is possible to use Reinforcement Learning agents for such optimization problems. In our precise case of Marine Transportation, the agent was able to generate networks of the same quality as the default networks used in the actual world, and can even produce better networks if given enough training time. So we proved that RL can be used for Network Design as well as any other metaheuristic, with the advantage of being trainable, and thus being re-used on variations of the problem. This supports the idea that RL is a very general framework that can adapt to a lot of problems.

It should be possible to build an agent adapting to any set of ports and any fleet, but in our work, we restrained ourselves to always the same set of ports and the same fleet (however, vessels are assigned in different orders each episode). This is only the first step in the creation of a port & fleet-agnostic RL agents. In future work, we aim to train a more general agent that can assign almost instantly any fleet to any set of ports. And eventually, this approach could be extended to other optimization problems than Network Design, and other domains than Marine Transportation.

References

Song, Dong-Ping, and Jing-Xin Dong. “Empty container repositioning.” Handbook of ocean container transport logistics (2015): 163–208.
Marielle Christiansen, Erik Hellsten, David Pisinger, David Sacramento, Charlotte Vilhelmsen, Liner Shipping Network Design, European Journal of Operational Research (2019), doi: https://doi.org/10.1016/j.ejor.2019.09.057
Our economy relies on shipping containers. This is what happens when they’re “stuck in the mud” https://www.weforum.org/agenda/2021/10/global-shortagof-shipping-containers/
Rahime Neamatian Monemi, Shahin Gelareh, “Network design, fleet deployment and empty repositioning in liner shipping” Transportation Research Part E 108 (2017) 60–79
Riccardo Poiani. “Multi-Agent Reinforcement Learning for Resource Balancing in Marine Transportation” (2021) https://medium.com/instadeep/multi-agent-reinforcement-learning-for-resource-balancing-in-marine-transportation-2c2b3513eee5
Arthur Jiang et al., “MARO: A Multi-Agent Resource Optimization Platform”(2020) | Github page
Huang, Y.-F., et al. “Liner services network design and fleet deployment with empty container repositioning”. Computers & Industrial Engineering (2015), http://dx.doi.org/10.1016/j.cie.2015.01.021
Riccardo Poiani, Ciprian Stirbu et al. “Optimizing Empty Container Repositioning and Fleet Deployment via Configurable Semi-POMDPs” (2022) https://arxiv.org/abs/2207.12509
OpenAI’s Gymnasium : a standard API for reinforcement learning. https://gymnasium.farama.org/
Darvariu, Hailes, and Musolesi. “Goal-directed graph construction using reinforcement learning” (2020). https://arxiv.org/pdf/2001.11279
Schulman, John, et al. “Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017).
Vaswani & al. “Attention is All You Need” (2017) https://arxiv.org/abs/1706.03762
Drivedi et al. “Benchmarking Graph Neural Networks” (2020) https://arxiv.org/abs/2003.00982

Many thanks to my supervisor Ciprian Stirbu and InstaDeep for providing me with the opportunity and resources to explore such a challenging and interesting topic.