How we implemented bike distance restrictions: A story

Published in

Quiqup Engineering

8 min readApr 30, 2018

Hey folks!

The Time Team has been working hard to change the restrictions for bikes on Generate Match (GM) from filtering based on estimated time of arrival (ETA) to filtering based on distance. It’s always cool to share what we do and why we do it. So here we go:

In the blog post below, I’ll explain what we did to improve the assignments of jobs to our bike Quiqees. These changes have already been working well in production for some time (full disclosure: we implemented the solution in early November, 2017). As they say — it’s better late than never to share this with you.

Problem

Some time ago (in a galaxy far far away…), CORs received complaints from bike Quiqees who used estimated time of travel on Google Maps to determine whether or not they would accept a job. They knew we limited assignments of jobs with an estimated pickup or dropoff more than 20 minutes to scooters and cars, and they made sure to double check it on their mobile phones. If Google Maps said it would take more than 20 minutes to reach a pickup or dropoff, some of the drivers would complain about the job and ask to be unassigned.

We use neural network models to predict times that are valid for our business (see previous Time Team posts), but they compared our ETA with Google’s and obviously would see different numbers than what we had predicted.

It goes without saying that in the past, we checked if we could use Google to predict our job ETAs, but found that it’s not fit for purpose. This isn’t a problem with Google or our models, it’s just that our couriers’ behaviour cannot be predicted based on the data set used by Google.

The time you need to go from one place to another depends on many variables, the same path may require different lengths of time due to weather conditions, traffic, transport mode, etc. As a result, we needed a more predictable method. Our solution was to use distance instead of time (ironic for the Time Team, huh?) to ensure more stability in how bikes are restricted from long distance jobs and reduce number of complaints from drivers. Because distance is always the same. No matter if we use distance or time, we knew our couriers were going to crosscheck the job using mapping applications. So, we had to be sure to calculate something close to the predicted distance on-the-road by the Almighty Google Maps.

We had few options to choose from: a) querying Google Maps itself (or any similar paid 3rd party mapping service), b) resurrecting the open source mapping service OSRM for GM purposes and retrieving the distance from this internal service, or c) modelling the actual on-the-road distance with simple mathematical formulas. Guess what we chose…option c, of course! It was the easiest, cheapest and fastest action we could take. After just a few short days of research, we had a no-cost, no-latency solution that was fit for purpose. Not too shabby. Below you’ll find more details about the modelling and evaluation processes, as well as the results.

Modelling

As mentioned above (and unsurprisingly), we chose ‘the cheapest’ option of solving the business problem. We aimed to experiment with different distance metrics and algorithms to come up with the most accurate calculation of the on-the-road distance knowing only the current position of the courier and coordinates of the destination locations.

Definitions of distance metrics from A to B

Using straight line distance between two coordinates isn’t a good option (you can try, but I don’t recommend it!) for obvious reasons. In the city, we have path restrictions due to one-way roads, buildings, bridges, etc…

Mathematically, there are few different ways to calculate the distance between two points. In the data team, we use Haversine distance. For simplicity’s sake, we can describe Haversine distance as the same as a straight line, but over the earth (you know, the earth isn’t flat right?).

Another way to calculate the distance is using what is called Manhattan distance (or taxicab distance). This looks more similar to what you might do while driving in the city.

Figure 1: Manhattan distances

The blue and yellow lines show what you may do while driving, i.e. you can’t go straight from one point to another, but rather in a zigzag, navigating around buildings, one-way traffic, and other road blocks. Same for red line, but well, a bit less realistic. Note that all those lines have same distance, though their trajectories are very different. See the example below of the Manhattan vs. Haversine vs. actual on-the-road distance on the real map.

Figure 2: Manhattan vs. Haversine vs. Actual

Dataset and evaluation of success

Firstly, I used our internal dataset. It simply included pickup coordinates, drop coordinates and current locations of drivers when they accepted jobs. This is the input data, and the output data comprised of distance on-the-road (Ground Truth). I retrieved this Ground Truth value from querying Google Maps API service for the limited amount of our observations.

Secondly, in every modelling attempt there is a requirement to measure the success or, in other words, have a way to quantify how much your predicted values correspond to the reality (Ground Truth). For this, I used average error measured in meters. For each observation error is calculated as: abs(predicted distance in meters — actual distance in meters), where abs is an absolute value without regard to its sign (+/-). I took an average of an error for thousands of observations we were testing our new model against and got the final evaluation metrics. This procedure would be repeated for every separate model. Larger the error, worse the representation of reality.

Trialed approaches and their results

Raw Haversine distance

Using the straight distance between two points we have:

Avg. error: 876 meters

As we can see we have an average error of 876 meters, thus the prediction model isn’t doing a very good job. This was our benchmark. All the subsequent models were aiming to bring the average error down.

Linear regression using Haversine distance

Next, I tried a linear regression model with the Haversine distance as a feature. If we call Haversine distance “X” then we have the formula:

A*X + B

Where we have to calculate A and B, to fit the data we have from Google Maps.

Avg. error: 470 meters

This model was already looking much better than just a straight line.

Linear regression using Haversine distance + Manhattan distance

We often have hypotheses that we wish to try, to improve our models. The Manhattan distance is not as direct as straight line distance and thus we hoped it would account for obstacles drivers would have to navigate between. So we experimented, using this as a feature. If we apply “X” to Haversine distance and “Y” to Manhattan distance then we have the following formula:

A*X + B*Y + C

Where we have to calculate A, B and C (the machine does this for us, easy peasy!)

Avg. error: 470 meters

That’s interesting!! As we can see, using Manhattan distance isn’t useful at all, there is no significant improvement whatsoever.

Gradient boosting algorithm

I also tried gradient boosting, one of the most advanced methods, to push the boundaries of our knowledge. However, using more advanced and more accurate methods isn’t always the best option, there is a trade-off; complexity comes with a cost, and sometimes it’s not possible (or very difficult) to implement it in production.

Avg. error: 393 meters

This is about 100 meters better on average. But as I said, complexity is bad, really bad. It’s not worth using this method for a mere 100 meters improvement. It’s not significant enough to warrant implementing gradient boosting in production.

Choosing the model

After considering all the data, I chose the Linear Regression with Haversine distance as a feature. The model is good enough while keeping complexity to a minimum, with an average error of 470 meters. Another advantage is that it is easy to understand what is happening, consider our formula (remember that was calculated by the computer!), X means straight distance:

1.27 * X + 306

So, what does this really mean? It means that have to multiply straight distance by 1.27 then add 306 to give us a distance close to what Google Maps will return.

Impact

We’re near the finish line! We just need to measure what impact to expect.

Here come the plots:

This plot shows the error (distance calculated by our model compared with Google). We can see that on average, the error is 119 meters and 80% of the data is between +931 meters of error and -776 meters of error.

Positives values mean we overpredicted and negative values mean we underpredicted.

One more:

Not an easy one, but we can see that our predictions (red) are pretty close to Google (green). Meaning that we don’t have to use Google like our Quiqees do (thereby reducing costs). Good stuff!

Blue is the performance of the previous prediction based on time vs. distance. That prediction was calculated using neural networks. Though they sound cool, they aren’t always the best option. The best option is quite simply, whatever solves the problem better, in this case it’s reducing the number of jobs rejected by couriers due to length.

The graph above tells us how many bikers will be rejected by GM. Remember, too many low rejections is not desired because couriers will complain about inaccurate predictions!

Conclusion

In data science we have to find the most parsimonious model that accurately predicts our target variable. As previously discussed, we achieved this with something as simple as multiplying straight line distance by a coefficient. We always prefer to find simple solutions than to resort to complex ones. If simple solutions can provide good enough results, we shouldn’t waste 80% of our time to achieve an extra 20% in improvement. In this case, we don’t need accuracy down to the meter, so no point going after it. We all have better things to do!

After some time, we can now confirm that the number of long runs complaints from bike Quiqees went down. So moral of the story: KISS — keep it simple, stupid!