Reinforcement Learning in the Supply Chain

A proof of concept example for inventory management

Andy McMahon
streamba
5 min readMay 30, 2018

--

Supply chains are complicated. Very complicated. In fact they aren’t even chains at all, they are intricate networks of information and material flows, constrained in a myriad of ways and containing actors pushing against those constraints to maximise their own gain whilst still maintaining some sort of continuity in the system. The fact that these networks don’t constantly come crashing down is a miracle in itself, but in the age of machine learning, surely we can do better than helping this network survive. How do we make it thrive?

At Streamba, we have recently been applying reinforcement learning to the supply chain in order to create artificially intelligent agents which can look at the state of the world and act in a way to help optimise the performance of these networks — taking us from surviving to thriving. Here, we’ll briefly explain what reinforcement learning is and describe a simple supply chain application.

Reinforcement Learning

Reinforcement learning can be thought of as an agent interacting with an environment. The agent performs actions and receives rewards and builds up a picture of the environment (and how its actions effect the environment) through observations. The key element of reinforcement learning — and what makes it different from other machine learning methods — is that learning occurs through an interaction loop (Fig 1) where the agent learns which actions lead to high/low rewards, given its current state. In this way, learning equates to an agent attempting to maximise its expected future reward.

Fig 1: Reinforcement learning model interaction loop.

Inventory-Ordering Problem

To test the potential of reinforcement learning for supply chain problems, we started with a simple ‘toy problem’. In reinforcement learning there are many pitfalls and it is wise to slowly introduce complexity to make sure things are working how they should be. At Streamba we call this ‘test-driven AI’ and think it is an important part of taking reinforcement learning from concept to production in a feasible timescale.

Our simple problem is a variant of a problem often encountered in supply chains, especially in oil and gas, that of inventory management. Orders come in from a customer and our agent’s role is to meet the customer’s demands, i.e. supply the customer with the number of items they want. These items can be anything (beer bottles, drill bits, laptops; for example) as long as the items pass through a link in the supply chain our agent can control. To meet the customer’s demand, our agent must ensure it has enough items in its inventory and it does this by ordering a number of items from a supplier.

Fig 2: Supply chain inventory-management model.

Obviously, not meeting customer demand is bad and our agent should be penalised (i.e. given a negative reward) when it fails to meet demand. One way to avoid this penalty would be for the agent to order lots of items, but this also carries a cost (negative reward again) as there will be excess inventory which (in real-life) would be sitting there unused.

In supply chain jargon we would expect an intelligent agent to implement ‘just-in-time’ (JIT) delivery — order the stuff it needs at the last possible moment while still meeting customer demand. Now we just need to formulate this properly as a reinforcement learning problem.

Building an Environment

To capture all of the relevant aspects of our problem we engineered a simulation environment. A graphical representation of which is shown in Figure 3. We used the notion of effective inventory to capture, both, excess inventory (positive effective inventory) and unmet demand (negative effective inventory) in one variable. Items supplied by the supplier are added to the effective inventory, and items ordered by the customer are taken away, and the agent’s negative reward is simply how far the effective inventory is from zero. A smart agent would want to keep its effective inventory close to zero, but our agent doesn’t know that yet, it needs to learn that for itself.

Fig 3: Implementation of an simulation environment for inventory management.

Training an Agent

So how do we actually build an agent that can learn a high-reward inventory management policy? Well, we explored both Q-learning and policy gradients, two common reinforcement learning algorithms. Both algorithms have their advantages and disadvantages (e.g.), but the jury is still out on which algorithms are best for supply chain problems (more on this in future articles). We particularly liked policy gradients because it brings all the recent advances in supervised learning to bear on reinforcement learning problems, and was super easy to implement. Essentially, we just seconded a stochastic gradient descent classifier from scikit-learn, set the classes to be the actions in our problem, the features to be our observations, and weighted samples according to a function of the reward. In the first iteration we generated customer demand by sampling (uniformly) from the set {0, 1, 2} and set our agent’s actions space to be an order for a number of items also from the set {0, 1, 2}. We ran it and it just worked, see Fig 4.

Fig 4: Left hand plot shows the effective inventory over time for multiple rollouts. Right hand plot shows the agent’s policy, i.e. probability of ordering each number of items given the current effective inventory. The agent appears to learn to keep the effective inventory close to zero, achieving just-in-time delivery.

OK, it’s a simple problem, but it’s cool that it was so easy to implement. At first our agent is struggling to meet customer demand, but over time it learns to do nothing (i.e. order 0 items) if it has excess stock and to order 2 items (the maximum it can) if it has a backlog of orders. We explored other demand profiles (e.g. Poisson, Poisson + periodic trend) and expanded the actions space and the agent still worked (how cool is ML?).

What’s next?

We are looking to extend this work to cases involving more realistic demand profiles for customers in the oil and gas supply chain. We are also looking into the idea of re-using other machine learning models we have developed elsewhere (e.g. demand predictors) inside the agent — more on this soon!

Want to learn more?

If you’re excited about reinforcement learning we recommend watching ‘AlphaGo’ on Netflix or for some about machine learning more generally we recently liked this article by Michael Jordan. And of course, give our next post a read!

--

--

Andy McMahon
streamba

Head of Data Science & Machine Learning @ Streamba, PhD physics candidate @ Imperial College London.