Improving the Matching Performance for Carpooling

Published in

BlaBlaCar

9 min readJun 11, 2024

A little background…

Zen is the door-to-door product of BlaBlaCar. Instead of both meeting at a common meeting point, like a station, passengers request a ride from point A to point B.
Drivers post the trips that they intend to do, and if that comes close to the passenger’s itinerary, they are able to pick up the passenger, doing a little detour and being compensated on the way.

On Zen, we don’t let the passenger choose a trip from an inventory: Instead, we match them to what we believe to be the best drivers for the needs of our passenger. We name the intent of a passenger a “ride request”, and the inventory of drivers’ trips “trip offers”. Trip offers are either one-off (On the 3rd of May), or recurring (every Friday).

We found a driver for our passenger! This cat looks trustworthy.

Anytime a passenger publishes a ride request, we want to see if there are any trip offers that would work to fulfill that ride request. Anytime a driver publishes a trip offer, we want to see if there are any ride requests that don’t have a driver yet that could be fulfilled by this new trip offer. If a trip and a ride request work well together, we have a “match”.

We also do “dry runs”, in order to display to a passenger how many drivers they should expect on a given ride request, allowing them to change it before publication.

All of that is a lot of calculations, which we want to optimize. But first, let’s talk about how it is done.

Calculating a match

To calculate whether we have a match or not, we have a two-step algorithm:

First, for the relevant dates, we calculate the distance between the passenger departure and arrival points with the line of the driver’s journey, using PostGIS, a PostgreSQL geospatial library. This is gross-matching (as the crow flies), which represents the best case scenario. We only keep the matches where the detour (time and distance) is acceptable for the driver, according to their preferences.

Then, using map services, we calculate the actual length of the detour needed to fulfill the request according to average traffic, closed off roads, etc.

If the length and time of the detour is acceptable given the driver’s flexibility, we have a match! Great news. Obviously, calculating all of this is resource-intensive, and we want our service to work even if we had 10, 100 or even 1000 times as many users.

All of this is also done on a materialized view, which does some pre-transformation of the trip data to make matching faster and easier to manage. In essence, a materialized view is a temporary table that is the result of a query. In order to keep this view up-to-date, it needs to be refreshed every so often, re-running the query and recreating the view. While this is clearly not ideal from a performance point-of-view, this allows us to change our search model easily and without any fear of breaking change, so we are keeping it for now.

In order to both find axes of improvements and to know where we stand, we decided to run a load test.

Load testing Zen: making sure the methodology is right

While it’d be tempting to simply duplicate a trip 100 000 times and a ride 100 000 times, our matching is temporal (when is the trip?) and geospatial (where is the trip?); that means that to make sure our results are representative, we need the temporal and geospatial spread of our test data to be representative.

The following methodology was thus applied:

3 axes between cities were selected: Vannes-Sarzeau, Grenoble-Chamrousse and Dax-Hossegor. For each, between 8h and 20h, one non-recurring trip and one passenger ride was created per hour. This was repeated every day, with 3 entries instead of one on Fridays, Saturdays and Sundays. This was then repeated 300 times. The trips each have 1 hour of flexibility both ways, meaning that the driver would accept any ride that starts up to one hour before or after their trip. This results in 150 000 trips and 150 000 rides being created for this week. The test data was created by generating a SQL script and executing it.

Over this week, the dataset has more trips and rides on weekends, which is accurate to real-life. We spread out the spatial data over three axes, so that we can see how efficient our database would be at eliminating data that isn’t geographically close to the ride request/trip offer.

We also need to know what we’re going to be looking at: we want to know the performance (latency, CPU usage, I/O, …) of the matching (upon ride request creation), the reverse matching (upon trip offer creation), and finally the refresh of the materialized view which contains our trip data.

Results and interpretation

First, for matching and reverse matching, every single publication resulted in between 2000 and 6000 matches. This is obviously an insanely high number of matches, which would represent users in the hundreds of thousands or millions range. In fact, we currently have 150 000 pending trips (BlaBlaCar has 135 000 daily trips with 20 million users in France), all spread out over only 6 cities.

Despite this extremely high load, we have very encouraging results, with matching (passenger looking for a driver) taking between 600ms to 2.5s:

With reverse matching (driver looking for a passenger), we see some scary results:

However, they come with a silver lining: 87% of the time is spent on sorting the returned rows to get the 10 “best” passengers for a trip. With ride requests more spread out in France, we’d expect to maybe get 10, 20 potential passengers for each trip, not 4000. As such, we can confidently expect this figure to go down almost 10-fold on an as-busy-but-more-spread-out activity.

Finally, the refresh of the materialized view is more dire: it takes 2 to 3 minutes to do a concurrent refresh (refreshing the view without locking read access to the data). This isn’t a problem per-se, but it consumes all the database’s resources doing so, making any concurrent activity impossible. This is not too surprising for two reasons:

We still have a very small server, even after a few tweaks: memory was 3.75GiB, with a small CPU, and most importantly storage at 40GiB. Since allowed IOPS is directly dependent on storage size, we had a small allotment of IO, and this is directly linked to the slow-downs. Should Zen reach hundreds of thousands or millions of users, we definitely should upgrade that storage. Postgre is configured as default, and there’s a lot of exploration on how to tune it better for our purposes.
We have already identified the materialized view as the weak link: multiple ideas have floated on how to improve it (use a table updated via events, have a partitioned materialized view per day, etc…).

In summary, here are the conclusions interpreted from the testing:

Matching and Reverse Matching are not expected to cause any issues until we reach at least a million of users. The Materialized View could start causing issues in the 10 000s or 100 000 users range. We’ll see it coming, but it should not be neglected.

This load testing didn’t take high QPS (query-per-second) into account; that is quite important, as a very high QPS on our database could lead to cascading failures. However, the publication use-case receives only a small amount of traffic. Even with 50 000 publications per day, only done between 8h and 20h, that’s about 1QPS.

How we made sure matching wouldn’t stall: indexes

Before doing this load test, a lot of work was done on geospatial and temporal database indexes beforehand, ensuring that the matching would be as quick as practically possible. PostGIS has very powerful indexes, but the way to use them is not straightforward: a lot of operators are not compatible with indexes, and will simply never be used. This was the case for the geospatial operator we were using to check matches: ST_DistanceSpheroid, which calculates the distance in meters between two coordinates. Thankfully, there is a workaround:

ST_DWithin is another PostGIS operator that can tell whether two geographic objects are within a certain distance of one another. It allows us to reverse the operation from “is the distance between the objects lesser than X?” into “are the objects less than X apart?”, for the same effect. However, there’s a catch: the index lookup for ST_DWithin doesn’t support a varying argument, and X is variable! Here, X is what we call non-sargable: it can’t be used by an index. This brings us back to the original problem, which can be fixed with an interesting quirk of the database: using a functional index, we can persist X into a known and sargable argument!

CREATE INDEX IF NOT EXISTS my_index ON trip_table 
USING GIST(_st_expand((route_polyline_geom)::geography, (X)));

ST_Expand takes our route, and expands it in every direction by X. This creates a “thicker” line, where the route of the driver has been expanded by X, and thus X is no longer an issue for the index lookup.

Imagine if this line was 5km thick; that’s what we just created.

The final piece of the puzzle is that the actual condition in our query needs to incorporate the ST_Expand operation (in order to use the index) and the ST_DWithin operation (in order to actually check the distance!):

SELECT * 
FROM my_view
WHERE […]
AND _ST_DExpand(route_polyline, X) && <Departure point of passenger> 
AND ST_DWithin( <Departure point of passenger>, route_polyline, X)

The && operator is an overlap operator; it looks strange to use both the overlap and ST_DWithin, but this is the only way for Postgre to recognise and use the index!

This index allowed us to divide the cost of matching by a factor of 34! And thus guarantee good results in the load testing.

What’s next?

This load test helped us confirm the positive impact of our indexes, as well as general improvements to the matching query. It gave us the confidence that we can provide our users with a fast response even if the product were to scale quickly.

We have also identified some next steps:

First of all, work on the materialized view (whether to replace or improve it) should not be neglected: we will see the problem ramp up, but the rework could take time, and it’s better to come up with a solution before it is desperately needed.

Secondly, indexing and improving work continues, as we continually find ways to improve matching and reverse-matching performance. Things such as partial indexes (indexes created with a condition, for example here), or simply improving the query itself can have sizeable impacts on performance.

Finally, this test should be re-done in the future, with two improvements:

The data should be far more spread out geographically, to more precisely resemble business conditions: we won’t have 1000s of matches available for a given trip for a long time, so this isn’t what we should be testing
The database was left to its own devices; we could introduce toxics (purposefully bad data, or purposefully damaging our performance) with the help of our database reliability engineers to see how we would perform with timeout, higher latencies, and so on.

Thanks for reading, and any feedback is very welcome!

A special thanks to Tony Demol, Antoine Sauray, Thomas Désert, Victor Rubin, Mickaël Ayeng, Aurélien Beltrame and Denis Wernert for your help in writing this article!