Who wants to figure out how to pack anyway?

14 min readMay 18, 2022

We’ve all been there. You have a wonderful long vacation planned, you’re counting down the days, you’ve selected the outfits you’re going to wear and suddenly it’s the day before and you have to figure out how on earth you’re going to fit all this stuff into one 23kg piece of luggage. So, you try and put the shorts in, then the shirts and then the shoes in the other side. Oh, but there is no more space for the jackets. You suddenly remember that hack you saw on YouTube where the influencer rolls all their clothes into little cylinders and magically space appears. You give it a go, and don’t you know it you are still have some clothes left over to pack. You try a few more configurations, painfully packing and unpacking until eventually you resign yourself to the fact that you will have to pay for that extra luggage or give up that perfect look at the beach.

Wouldn’t it be great though if there was a way to avoid all this? A program that takes in a list of your belongings and their dimensions and will fit at least 95% of them in your bag every single time and you only need to pack it ONCE.

This is an example of the 3D Bin Packing Problem (3D BPP) which is critical in logistics.

Well, here’s the thing: in logistics, paying for that extra pallet or that extra truck hits hard. If you can use one pallet instead of two well that just means you can fit more stuff in your truck in the long run and hence make more of that sweet $$$. The problem is that a single misplaced item on a pallet can cause one more pallet to be necessary, requiring another truck, so doubling transport costs (if there isn’t already stuff going in the same direction in which case that’s a whole other problem!). The butterfly effect is real!

Do you know that humans on average only pack to fill 50% to 60% of the available space, and to get a lot higher usually requires multiple attempts? We have developed two Agents, which can outperform humans, and can do it much quicker and are even able to make snap correct decisions. How you ask? We did this in TWO separate ways! These solutions form part of our LOGOS program (LOGOS | Decision Lab UK): Logistics Optimisation System. For the sake of completeness we compare our methods to an established packing algorithm.

The Challenge

The challenge we are setting is to pack as many given items into as few containers as possible. (We could also equally well consider loading items on to a pallet.) We’ll simplify the scenario and consider a 5x5x5 container, which we’re trying to pack with cuboid shaped items. These cuboids can vary in size from 1x1x1 to 3x3x3 with variability in each dimension. (So if we imagine we have a 100cm x 100cm x 100cm container, example item sizes could be 10cm x 20cm x 20cm and 30cm x 20cm x 10cm, etc.) We aim to maximise the utility of each container, which we define as:

Method

We want to compare three different techniques, to find out the relative pros and cons of each:

· Mathematical optimisation, where we use mathematical search techniques to find the optimal packing solution.

· Reinforcement learning, where we use a machine learning based approach to learn good packing strategies through trial and error.

· A rules-based algorithm, where we use a set of procedural rules to pack the container.

Traditional optimisation works better when all information is available. In contrast, reinforcement learning is better able to adapt to unseen future information and deal with partial observations. Reinforcement leaning is particularly proficient at dealing with stochasticity. A rules-based algorithm is easy to understand, and also provides a good baseline for comparison.

Let’s discuss each approach in turn.

Mathematical Optimisation

In mathematical optimisation we specify an objective function and constraints, and use a mathematical optimiser to find the solution that optimises the objective function within these constraints. We formulate it as a Mixed Integer Linear programming (MILP) problem, following Trivella and Pisinger (2021).

For the bin packing problem, we set the objective function as minimising unused container space. This is equivalent to maximising the utility function given above. We also minimise the number of containers used.

Our constraints are defined as:

· Each item can be assigned to exactly one used container.

· Each item must be within the limits of the container (so no poking out of the sides or top).

· Items cannot overlap each other.

· Items can be orthogonally rotated but some rotations are not permitted for some items.

· Vertical stability: the bottom side of each item needs to be supported by the top face of other items or by the base of the container. The stability of a pattern is ensured by forcing the four base vertices of each item to be supported.

We can include other constraints such as the total weight allowed in the container, fragile items cannot be placed beneath heavy items, etc. We can also consider a very wide range of item sizes, although for this particular study we have limited ourselves to the regular cuboid shapes described above.

We use the Gurobi commercial solver to search through the problem space to find the optimal solution to our formulation.

We call our mathematical optimisation approach LOGOS-OPT.

Reinforcement Learning

Reinforcement learning is a control optimisation technique useful when the performance of a sequence of decisions can only be measured after a prolonged period of time. In a reinforcement learning scenario, the decision made in the present state will impact the decision made in the next.

It is also applicable in scenarios where context matters, which differs from a normal optimisation approach where the optimal solution is found unconditionally.

As is typical for any reinforcement learning problem, we need a state and action space as well as a reward function. The agent is given information on the environment (the state) and it takes an action. The effect of this action on the environment is measured and a reward (positive or negative) given. This process is repeated numerous times with the goal of maximising the reward value, and in this way the agent learns the optimal action to take for the given state.

To make it suitable to reinforcement learning, we have to re-pose our problem, and it actually makes it more challenging than for the other methods.

For reinforcement learning, we have a conveyor belt which delivers items to the packing area. The next item on the conveyor belt must be dealt with on arrival and be placed in the container before dealing with the next item. The item should also be placed such that:

· The item does not topple over and is supported by the items below it.

· The item does not overlap the boundaries of the container.

We do allow the decision-making agent to see one item ahead, so that it can do some limited planning. So this is much more challenging than the mathematical optimisation method, where that has complete knowledge of all items to be packed — our RL agent only sees one ahead of a randomly ordered sequence.

We call our RL decision making agent LOGOS-RL.

We use Microsoft Bonsai for our creating of RL decision making agent. Microsoft Bonsai is a complete toolchain to build, train and deploy Reinforcement Learning Agents (Brains in Microsoft speak). It allows for ease of use for Subject Matter Experts without a machine learning background to program their expertise directly into an AI model and generate controller to produce optimised control actions. However, it also works quite well for a simulation expert with a background to create something quite special as various simulators including AnyLogic, Simulink, VP Link and normal python Gym Environments can be integrated into it.

All that is required from the user is to create the simulator, specify the reward function or goal of the agent and hit the train button. Another fantastic feature of Bonsai is that it will automatically choose the best reinforcement learning algorithm for the use case, choosing between SAC, Apex DQN and PPO. The agent is trained online using Microsoft Azure and scales using a managed simulator. This means it will parallelize training as much as possible and scale the number of simulators up and down as needed.

Finally, we can export our trained agent as a docker container where we can directly query where to place an item given the appropriate state.

Bonsai allows for lessons to be created. These lessons allow us to progressively scale up the difficulty until the agent is adept at its tasks.

We developed the simulation in AnyLogic. Our sister company SimulAI is the sole distributor for AnyLogic in the UK and Italy, and so it made sense to use it to design our simulation. It also helps that our preferred RL training platform, Microsoft Bonsai, has easy integration with AnyLogic.

We defined our states, actions and reward structure as follows:

· State: We can vary parts of the simulation to test out different scenarios, namely the number of items on the conveyor belt, the dimensions of both items and containers and the number of containers available. The observations themselves are a combination of the height map of each container (an N x N matrix of numbers informing the current height of items at that place in the container), a measure of the current utility of each container, the mean container utility, and the feasibility map of placing the current item in the container in its present orientation. The feasibility map is particularly useful as a sort of action mask, as Bonsai did not have the action masking functionality built in at the time. It is represented as a matrix of zeros and ones with one being present if the item in question can have its top left most corner placed there. The current, next item and container dimensions are also given.

· Action: We use multiple discrete action spaces: horizontal rotation, container placement, x-coordinate and y-coordinate. Rotation rotates the current item 90 degrees. Container placement specifies which container to place the item in. The x and y state which coordinate in the chosen container to place the item into.

· Reward: For the reward we simply used the feasibility map conditioned on whether the item was rotated or not. The agent is rewarded by the change in container utility (multiplied by a factor of 10) given by the item being placed if it is placed in a valid position as specified by the feasibility map. If it was not placed on a feasible location the agent gets a penalty of -1 and the episode is terminated. This sped up training significantly over the simple reward of the delta in utility.

The training and lesson plan we chose were relatively simple. We had a maximum of one 5x5x5 container to pack into and varied the sizes of the items that could be packed. Once the agent was adept at repeatedly packing up to 60 percent utility the lesson would progress to the next stage. We also note that when the container gets to 80 percent utility we “send it off” and replace it with a new empty container.

The first lesson was quite simple: pack 1x1x1 items in a 5x5x5 container. This lesson allowed for the agent to understand the simple behaviour of the feasibility map. It took around 318,000 training iterations (8,000 episodes) to achieve the 80% packing goal every time. Of course, with this lesson the rotate action has absolutely no effect — that is something it will learn in the next lesson.

In the second (and final) lesson, we allowed item dimensions to vary between 1 and 3, e.g. a 1x2x3 box or a 3x3x3 box or a 2x2x3 etc. This is a harder problem as the feasibility maps are more complicated and the agent now needs to understand how the rotation action works. However, as we can see in the plot below, the training works very well, being able to exceed the 80% utility goal every time after approximately 7 million iterations (500,000 episodes).

Rules-based Packing Algorithm

For comparison sake, we created a rules based algorithm based on Dube and Kanavathy (2006). The algorithm for this is:

1. If there are no items in the container, place the current item at the bottom-upper-left corner of the container

2. Go through the current items in the container and check their adjacent edges (such as on top of, to the left/right etc.) and see if the item can be placed there

3. If item can be placed, place item, if not:

a. If there are other containers that can be packed, place item in the other container.

b. If there are no other containers, end packing.

Results

Comparisons with Human Packers

We wanted to test LOGOS-OPT versus human packers. We set up a trial where one person attempted to load a set of items into a container using their own devices and another person simultaneously packed an identical set of items using LOGOS-OPT. We had items of 40 different sizes, which we selected to create some packing challenges.

In our first experiment we had 17 items with the challenge of packing them onto a single pallet. LOGOS-OPT could find the solution in 13 seconds, achieving a packing density of 75.8%. The packing itself took approximately 40 seconds, with what we call the LOGOS Visual Assistant guiding the packer through the placement of the items. (Our LOGOS system is made up of an optimisation core linked with a Visual Assistant that projects instructions on item placement to guide packers and ensure the optimal placement is efficiently realised.) This is contrasted to the unassisted human who took typically about 90 seconds to pack between 12 and 15 items (multiple runs were done for the unassisted human), so achieving a packing density of between 55% and 70%.

For the second experiment, we had a more difficult challenge to pack another set of 19 items on one pallet. LOGOS-OPT found the solution in 105 seconds, which resulted in a packing density of 73.5% and, with LOGOS guidance, the packer was able to pack all 19 items in a time of 50 seconds. We allowed three different unassisted human packers to attempt to pack the container leading to some interesting results. The first packer was unable to pack all the items and gave up after 300 seconds of trying. The second was able to get a packing density of approximately 60% with 14 items packed, leaving 5 items unpacked. The final packer was able to pack all 19 items although he did take 10 minutes, which is 4 times longer than LOGOS-OPT when computation and packing time are combined. This more challenging case revealed the greater disparity between machine and human packing efficiency.

The key result is that LOGOS could pack more effectively than a human.

Comparison of the Three Computer-based Methods

We decided to compare our two brand new agents and the established rules-based algorithm. Giving the optimiser the full list of items the RL agent (and rules based algorithm) would see; we compared their performance. As order of arrival is extremely important in the RL-scenario we ran the scenario 20 times to get an average.

We used the LOGOS-OPT result as the benchmark and compared the results of the runs with LOGOS-RL and the rules-based algorithm. We used three separate lists of items to run each experiment.

The table below shows the summary of the results. We can see that across the three sets LOGOS-DRL performs well. It achieves between about 80% and 83% packing density, which is around 90% of the mathematical optimisation gold standard in two cases and 86% in the other one. It outperforms the rules-based approach. As LOGOS-DRL is not given sight of the item list, but rather deals with the packing an item-by-item basis, its performance is remarkable. By its nature the DRL approach is exceptionally fast, as it gives a near instant response as it is provided each item to pack.

If we look into the 20 repeats of each experiment, we discover other interesting results. In experiment 1, we find that the RL agent was able to achieve the same packing density as LOGOS-OPT 20% of the time (which the rules approach never achieved). Even more impressive is with the second list of items, the RL agent was able to achieve the same packing density as LOGOS-OPT a staggering 55% of the time, whereas rules-based managed this 25% of the time. Experiment 3 provided a greater challenge, although the DRL did achieve 99% of the maximum packing density on occasion. This was similar to the rules-based although the DRL was consistently better at packing across the repeats.

Discussion

The results from our experiments for our three solutions show that LOGOS-DRL works generally well. It achieves about 90% of the LOGOS-OPT solution and this is with it being given sight of two items ahead when compared with the full list of items for LOGOS-OPT. Both of our solutions well outperform the established rules-based algorithm.

We did take a simplified scenario for our investigation, limiting it to simplified shapes. We have included greater flexibility in the mathematical optimisation approach, going to centimetre accuracy for item dimensions. We could extend this to the RL approach, although it would increase the learning times. However, it should also be noted that item sizes could be ‘rounded up’ to the nearest 5cm, which would maintain training times as quite reasonable. Although the packing densities would be lower, we can see that with our simplified scenario the mathematical optimisation achieves ~90% packing density compared to the 70%-75% with high accuracy dimension specified; as the RL achieves about 90% of the optimisation packing density, we can reasonably speculate that it could achieve 63% to 68%, which is still good.

We could also improve performance. One way would be to increase the number of items the RL is given sight of. Although this increases the complexity of the state space it should improve performance as it provides the RL with more choice options. It’s also worth noticing that we set the RL the objective of achieving 80% packing density per container, and in fact we see that the RL exceeds that routinely in practice. We could train more and push this target to 85% or even 90% (greater than that isn’t likely to provide a benefit, as the probability of no feasible solution would be high).

There is one important aspect that has not be discussed and that is scalability. Mathematical optimisation does not scale linearly as we add more containers to be filled. Instead, we have to try to decompose the problem to avoid the computation times exploding. This is not a problem for the RL: it will continue to pack as many containers as desired and the time will still be very fast. That is an important power of the RL approach, as well as its great flexibility for handling unexpected events, such as sudden changes in the orders or items not available.

With the kinds of gains we’re seeing with the approaches we’ve developed, paying for excess luggage on your next long-haul flight could be a thing of the past.

References

Dube E. and Kanavathy L. (2006), “Optimising Three-dimensional Bin Packing through Simulation”. Proceedings of the Sixth IASTED International Conference on Modelling, Simulation and Optimisation. Garborone. Bostwana.

Trivella A. and Pisinger D. (2021) “Bin-packing problems with load balancing and stability constraints”, Core.ac.uk. Available: https://core.ac.uk/reader/154332720.

If you have any questions about this article, or would like to know more about LOGOS (LOGOS | Decision Lab UK), please contact us at hello@decisionlab.co.uk