# AI: Linear Reward Inaction Applied to Vehicle Routing Problem

# Introduction

From April to September 2020, I focused my artificial intelligence research work on Urban Logistics with Paris Saclay University’s lab for intelligent smart cities, Laboratory DAVID. In collaboration with the startup Urban Radar, I have focused on a growing problem that cities are facing : congestion and air quality impact from Urban Freight. Given the growing need to improve operations in increasingly complex city logistics, it is worth comparing how different optimization and AI algorithms perform under real scenarios. The subject was then to apply and compare two methods of specific vehicle routing problem instances in order to improve delivery efficiency by identifying the optimal set of routes taking into account the total length of all routes. There was one method of optimization and one reinforcement learning method. I will present how I chose to implement linear reward inaction to a vehicle routing problem, I will make the journey simple and explain every step, complex or not.

# The Vehicle Routing Problem

The Vehicle Routing Problem or VRP is one of the most studied classes of combinatorial optimization problems since its introduction in 1959 by Dantziget and Ramser. The VRP is a complex nomination for a simple task: “Finding the optimal set of routes for vehicles” applied here to satisfy the customers’ demands or simpler delivery. Multiple variants such as capacity, time-windows or maximum length of a route exist.

In our model we decided to take into consideration a situation in which several vehicles from the same company :

- Are initially distributed among multiple warehouses
- Must deliver a set of parcels distributed among these depots

We then wanted to take into account the constraints of :

- Volumes and weight
- Also time-window constraints.

Furthermore, there is a possibility for any vehicle to pick up any package from any warehouse. However, a vehicle cannot transfer a package from a warehouse to another one.

# Reinforcement learning

Reinforcement learning (RL) methods are part of the machine learning methods. RL can be described as “the problem faced by an agent that must learn behaviours through trial-and-error interactions” which can be complexified with a dynamic environment.

There are many RL methods and we decided to implement the Linear Reward Inaction (LRI) method, described by Sastry et al. in 1994.

# Linear Reward Inaction

Linear Reward Inaction (LRI) is a reinforcement method based on a reward system in which several players (agents) are driven to minimize or maximize a common cost, each agent using a stochastic vector of strategy. Each strategy vector represents the possibilities of actions that a player has. All players learn at the same time and try strategies to achieve their objective. When a solution is found to be good given a strategy, its utility improves, and that strategy will be rewarded. However, if a strategy is wrong and doesn’t improve the utility, the vector is simply not updated (or only slightly). LRI requires a large number of iterations before converging to good results. Here is a representation of the algorithm I used:

At each iteration, each player draws a strategy among itsown strategy vectors. Then for each player, we calculate its gain (also called utility) which is dependent on the strategy used. Finally each player updates its strategy vector following the update rule of the LRI :

With :

# Implementation

We modeled our VRP as an N-players game, with N packages in the instance. At each iteration the players will pick a vehicle randomly. We initiate each value of each strategy vector by 1/K, with K the number of vehicles.

Each iteration begins with each player choosing a strategy randomly drawn according to the values of the strategy vector. We then solve for each vehicle a VRP with capacity, time window and time limit constraints. In order to do this for each vehicle, the best possible choice of addition to each iteration is added to the route of that vehicle. This process is repeated until all the deliveries that have chosen that vehicle have been added to its route. At the same time as each delivery is added to a route, a pass through its original depot is added if necessary and is taken into account when calculating the best possible choice. The addition of a passage through a depot is necessary if the vehicle’s route does not yet pass through this depot or if the capacity of the vehicle is already saturated.

Once all vehicle routes have been built, we have the total cost of the routes. This total is used to calculate the utility of the current LRI iteration that is necessary to update the vehicle strategy vectors according to the update rules described above. This process of relaying between the LRI and the resolution of these VRPs at each iteration, makes it possible to converge towards an equilibrium between the players.

# Results

We can easily plot the value found by our algorithm at each iteration of our LRI script. The value representing the total length in metres travelled by all vehicles and since we are in a minimizing problem here, we hope these values will converge to the lowest possible level.

Here is an example of a graph from an instance with 100 Deliveries spread initially among 3 warehouses. 5 vehicles are used to satisfy those deliveries. We then simulated 500.000 iterations of our script to let him the time to converge with a learning parameter b = 0.003.

We observe in red the evolution of the value of LRI through iterations. With the green line representing the best value found by LRI and the blue line is the value found by a simple descent method. It’s important to note that we’re registering the minimal value and not the convergence value. We can also take a comparison between LRI and an other method named simulated annealing :

Here, simulated annealing converges faster to a value similar to LRI but it’s not better.

It takes a little bit less than one hour to execute those 500 000 iterations of LRI but we could have stopped the execution when we saw that the values were already converging (between 300 000 and 400 000 iterations). This execution time is already the result of a few optimizations and I’ll be glad to talk about those in a future article.

Focusing back on the results, the LRI is already better than a simple method of descent and in the worst case equivalent to simulated annealing. It might be interesting to compare results to more methods on existing open data instances however it is hard to find exactly an instance matching our constraints.

# Conclusion

Nowadays, and especially in France, urban logistics of the last mile represents up to 50% of the delivery cost and represents up to 30% of the total cost of a product. Last mile deliveries are also responsible for 30% of greenhouse gas emissions in the cities. That’s why further research about VRP is important : There are lots of economical and environmental potential gains for cities and the logistics enterprises.

Also, the representation of the VRP as a multiplayer game seems to be a good intuition and will open this old problem to new methods using artificial intelligence such as deep learning or neural networks. Here the LRI got some good results but took a really long time to execute. When you’re solving instances of VRP, you really want to have the results in less than 10 minutes for any size of instances. In fact the goal of those methods are to be applied to real life instances to manage a huge fleet for multiple logistics actors. Fortunately this time can be optimized with further research and I stay positive about this method.