Optimal placement of boxes in a container: An implementation of Markov Chain Monte Carlo Search Tree in Python — Reinforcement Learning, Part 1

judopro
6 min readAug 15, 2019

--

Recently I came across a real world business problem to solve. Having written software for logistics industry for long time, here is something we needed to optimize: The clients provide shipping services across the world. Their customers call them with a number of boxes, packages or palettes to ship. Gathering multiple orders from many customers, our clients then need to efficiently fill their container, or truck or other vehicle for transportation. And they needed a way to figure out how to place the boxes to maximize the total volume being stored? It is actually a complex problem, there are some commercial solutions out there, but they are proprietary algorithms and not available to general public. Here I wanted to see if I can approach and solve this problem using Machine Learning, Reinforcement Learning and share my findings.

Okay but why Monte Carlo Search Tree? Why not Markov Decision Process? This is where your knowledge in domain will come in handy…. We know that to set up a Markov Decision Process, we need to provide our States, actions, transitions and rewards… At first I thought i could also take this approach and take advantage of the MDP Toolbox in python. But let’s see what the nature of the problem is first.

Suppose we have 200 boxes to place in an empty container. How would one do it with no optimization whatsoever, but randomly placing some boxes one after another? We will pick a box, put in the container, then pick the next box and put it in container and so on… What is the number of options at start point? Well, you can choose to place any 1 of the 200 boxes to begin with, so you have 200 options to start with. Once you placed that box, in next step you can choose to place any 1 of the remaining 199 boxes, on next step any 1 of remaining 198 and so on…. So the number of different combinations that these boxes can be placed would be represented by 200! in Math which would be equivalent of 200 x 199 x 198 x 197 x ….. 3 x 2 x 1 which has a…. boat load of digits in it…. So to calculate the total possible states, transitions, and rewards for all these options is not that easy, unless you have the entire Google Cloud or AWS at your disposal. So after thinking this through, it sounded like Monte Carlo Search Tree (I will refer to as MCST from now on) would be a better approach at this problem, at least given our hardware limitations. But how?

I am not going to dive in to the Statistics and Math part of it, as there are many great articles out there going through probabilities and statistics of it (by any means please read and get a good understanding of the concepts behind it), but there is far and few example implementations in python out there (see resources for what I found), and I decided to make my own implementation to further improve my own understanding of how MCST works.

The way MCST works is as follows: It runs through certain number of iterations (training) you specify to build a tree. Most articles start with Selection and then Expansion but i will start other way around as we start from an empty state and nothing to select at beginning.

  • Expansion: If the current node is not a leaf node, which is the case for the Root node (meaning it still has more options to act at that state), then we expand the tree by adding each available option to this node as a child node (That’s how we get ourfirst nodes under the Root — the empty state).
  • Selection: At each iteration (training), you start from the root node, get a random choice from all available nodes (which we were added at Expansion step), then you go on by selecting a random choice of all nodes of that child node we placed after root, and so on until you reach a leaf node where you have no more options available.
  • Simulation: Once you added a node, you then run your normal policy. For example, if I we created couple nodes under the root, then you make those moves and after that you continue with your normal business logic (Also called rollout sequence). In this example, it means however many boxes our algorithm decided to place at first, we place them, then the rest goes either random or sequentially until we can no more put a box in the container.
  • Backpropogation: At that point you calculate the reward until that point since you reached a finale, and then backpropogate that number to all parent nodes. Doing this a number of times, you would end up with a tree of possible options and rewards. From there on you can choose the best viable option calculated at each level of the tree.
Src: https://www.researchgate.net/profile/George_Konidaris2/publication/312172859/figure/fig1/AS:613954925625373@1523389669557/Phases-of-the-Monte-Carlo-tree-search-algorithm-A-search-tree-rooted-at-the-current.png
  • Note:Here is interesting part though, once all the available options are already added under one node, than at each iteration you choose the best rewarding for that parent node so far and then continue from the children nodes of that. But as soon as you determine possible reward for all childs at one level, you would normally always end up choosing the same child node from this parent on other iterations. What MCTS brings to table is a fact of randomness. When choosing the most rewarding child node from available options (remember the rewards at that point represents random choices taken from that node all the way to the end and backpropogated), we do NOT always grab the best rewarding child, sometimes we grab a random child and start exploring that more (Creating more tree paths), for that maybe this unvisited path may yield a better reward than we currently have. This is called Exploration vs Exploitation. We need a little bit of both to have a successful model.

How does MCTS relate to our problem?

In our case, the root node is the empty container. Options are all the boxes unplaced at that moment. And a random option means a random box out of all available. Reward in our case represents the total volume of the boxes placed. So the more volume we are able to store, the more reward we get.

The rollout policy is the standard business policy which is either placing the boxes sequentially as they appeared or randomly. And we want to see if we can improve on that. There could certainly be more optimized algorithms on the business logic (rollout policy) side of things. Although improving the rollout policy may improve over all performance for both scenarios, spending more time on that would diminish the purpose of this article, let’s keep the rollout simple and see what our model can do with it instead.

So Imagine if we very to quickly go pick and place random boxes one after another while we write down each box in the order we placed them and the total volume we stored until we fill the container. And then we start all over again. Having done this say 10,000 or 100,000 times, then we go back to our records and check which way ended up with the most volume and then we go and use that order to solve the problem. This is basically the approach of MCST.

Let’s implement it in Part 2.

Resources:

https://sawcordwell.github.io/mdp/conservation/2015/01/10/possingham1997-1/

--

--