How we went from zero insight to predicting service time with a machine learning model — Part 2/2

Tarjei Bondevik
Oda Product & Tech
Published in
6 min readJun 23, 2022

--

In this article, we will explain the implementation of, and the results from, our machine learning model that predicts service time. If you haven’t already, make sure that you read Part 1 first to understand how we define service time, why it makes sense from a business perspective to predict it, and how we collect the data to measure it.

We define service time as the time our drivers use to park, re-stack the car, scan the order, and carry the groceries to our customers. In total, this time accounts for roughly half of a driver’s workday. We measure service time with geofence technology, shown in the sketch below.

A sketch of how we define service time, for customers A and B. For a more thorough explanation, please refer to Part 1 of this article series.

To create an accurate machine learning (ML) model predicting the service time, we needed to find features that were likely to affect the service time. Luckily, we have a broad set of data on our deliveries, including two years of service time recordings. This meant the stage was set to take the step from manually predicting service time using business rules — referred to as the Business logic model — to using an ML-based model.

We used a range of features for our ML model, among them the most important were:

  • The order size (weight, number of items, number of boxes)
  • The area (relevant for parking difficulties)
  • Customer-related features (previously recorded service times at a customer, which floor, if the customer has elevator, etc.)

We fed these features into our lightgbm gradient boosting model, and trained the model using Bayesian optimization in our hyperparameter tuning with the optuna library.

Once we had an ML model that performed well on historical data, we wanted to test the model in the real world. We ran our test for six weeks, starting in mid-November 2021 in the Sandvika region, which is an area west of Oslo. That area was chosen because it is comparable to our entire delivery area in terms of distribution of urban and rural deliveries, making it fairly representative. Roughly 10% of our deliveries were ordered in the test area — large enough to gather data on the model performance, but few enough that we would be able to fix most problems ad-hoc in case the model was off.

The new service time predictions were promising, and are shown in the plot below. Here, we compare the mean absolute error (MAE) of the ML model used in the Sandvika region, with the Business logic model, used everywhere else. Note that the comparison is somewhat flawed since these are different areas. Still, given that the ML model consistently performed ~30 seconds better than the previously used Business logic model, and that the improvement in the test area occurred abruptly overnight, we were confident to roll out the ML model in our entire delivery area.

Comparison of the ML model used in the Sandvika test region with the Business logic model used in the rest of our delivery area. The test starts at the black dotted line; previous to this date, the Sandvika region used the Business logic model. For reference, we also show simulations of how a Naive model would have performed in our entire test area. The Naive model is simply to set the service time to our historical average service time. An interesting side note is to observe how the ML model struggles the days right before Christmas Eve; these days, our customers’ shopping pattern is very different.

As we write this in June 2022, we have had the model running in our entire delivery area since the start of January. At the same time, we’ve deployed a similar ML-based driving time model to ensure that the driving time between customers also is correctly set based on patterns in our historical data. These two models are used as input to our route planner so that, ideally, the route planner optimizes routes based on correctly predicted service and driving times.

The big question then becomes: Does deploying these ML models improve our delivery precision and accuracy?

The somewhat disappointing answer is “A little, but not as much as we had hoped”.

The plot below shows the delay distribution in our entire delivery area, where the ML model data is from April and May 2022, and the Business logic model data is from September and October 2021. The delay is the registered delivery time minus the planned delivery time of an order; if it’s negative, we’re ahead of the schedule. Our routes usually have around 30 stops; the delay at a given stop depends on the accuracy of all the former service time predictions on that route.

Delay distribution, comparing the Business logic model (September and October 2021) with the ML model (April and May 2022). These sets of months are chosen because they are comparable, with similar delivery conditions (absence of snow and heat, both of which may slow us down). The standard deviation, σ, is given in minutes. The increase in density at around -35 minutes is because some drivers are far ahead of schedule, and have to wait for the customer’s delivery time slot to open before they are allowed to complete the delivery. We have only included data where the route started on time from our fulfillment center, to eliminate delayed route start as an explanation variable.

With perfect route precision and accuracy, the delay distribution would be an infinitely sharp function around 0 minutes delay. Rather surprisingly, we see that the distribution hasn’t changed that much in the ML era. Although there’s a reduction of 10% of the distribution’s standard deviation, this improvement is quite modest given that the MAE of the service time predictions has dropped 23%.

How can this be possible? Common sense tells us that if we send more precise service and driving time estimates into the route planner, the route planner should be able to plan more precise routes.

The surprisingly low improvement in the overall route precision is likely explained by the fact that reduced precision might not matter as long as the errors even out.

In a manually well-tuned system — like the previously used Business logic model — one might be able to get the total time of an entire route more or less correct, even if the time estimates of the sub components (i.e. each stop on the route) are imprecise. A route usually consists of around 30 individual stops.

This is shown in the map below; it compares the two models on the delay increase per stop, per area, in seconds. In the Business logic model era, drivers got too little time for the complicated deliveries in central areas of Oslo. Typically, drivers would see an average of 1.5 minutes additional delay per stop in that area. When delivering in rural areas, however, drivers get that time back.

In the ML model, there is less variation in the delay per area, with no clear systematic bias in the urban and rural areas.

Since most routes consist of a mix between urban and rural areas, and there are usually around 30 stops in each route, you may end up with a roughly correct total time, even if you greatly miscalculate for each individual stop. Most likely, this is an important reason why the delay distribution has not improved that much, even as our predictions per stop are significantly better.

Two maps showing the accuracy in the delay increase per stop, per area, in seconds, in the Oslo region, comparing the Business logic and the ML model. The mean delay increase is somewhat closer to 0 in the ML model, while the Business logic model gave more delay in the city center, and less delay in the rural areas.

Even though the improvement in precision on the route level has been surprisingly modest, we still believe that our ML-based approach is useful for several reasons:

  • We have a system we can iterate on — by adding more data sources, for example. If we continue improving our predictions on each individual stop, we will likely see improvements on the route level as well.
  • The ML approach requires less oversight and manual adjustments, which becomes increasingly important as Oda is growing and operating in several markets.
  • The model gives us detailed insight on how the various features affect the service time; these insights can be used to improve our understanding of our operation.

For future work, we plan two major innovations:

First, we want to improve the filtering of our training data. A lot of messy, unpredictable things happen on the road, giving us inaccurate data that we train our model on. Removing erroneous training data while keeping the correct data is a continuous challenge in this project, and the potential for improving the model performance should be large.

Second, we want to make the model more responsive to drift in the environment. Currently, we have to adjust the service times manually during snowy weather; implementing a weather feature should be feasible. Also, we expect a lot of drift when we enter new markets, simply because our fleet of drivers will rapidly improve their skills in such a setting. We plan to add some exploration elements to our predictions, to capture this type of drift.

Thanks for reading! If you have any questions, ideas, or comments, we’d love to hear about it!

--

--

Tarjei Bondevik
Oda Product & Tech

Data scientist working in Oda.com, with background from physics and nanotechnology.