Fixing the N+1 Query Problem
A few months ago, one of our platform engineers was looking through our Datadog traces and keeping an eye on performance when they noticed that the flame graph report for the VoucherCodes homepage was looking pretty busy.
Here is what they saw:
The bottom row caught their attention as it contained hundreds of colourful spans indicating hundreds of system interactions; so many that it likely exceeded Datadog’s payload limits resulting in an abrupt cutoff.
The Investigation
Zooming into the graph, we can see that all of these spans are in fact repeated calls to Redis and PDO (a PHP extension to interface with databases). The repeating nature of the spans implies a possible N+1 scenario.
N+What?
N+1 is a query fetching problem, meaning that your application is making a variable (and potentially large) number of calls to fetch a record and its related records.
The “+1” part describes the fetch to get the record in question, for example a movie from a “movies” database table.
The “N” part describes the variable number of related records, such as actors from an “actors” database table that appear in the aforementioned movie.
Ideally, we would like to make only two database fetches; the first for the movie and the second for the list of actors. However, if we are not careful and end up fetching each actor one by one, then we will be executing many fetch requests per film.
A small independent film could result in a few dozen fetches as it is likely to cast fewer actors, however, a blockbuster movie could result in hundreds of fetches due to a much bigger cast!
As you can see, the variable number of related records to fetch means that different resources will be slower to fetch than others, and the problem is amplified as the number of related records is increased.
N+1 MySQL Queries
Let’s begin looking at the MySQL queries as executing hundreds on a single request is a red flag for performance.
Datadog is very helpful in grouping the queries and counting the number of executions for us. Looking at the table below, we can see there is a query that is being executed 663 times with each request!
Finding the source of this query in the codebase, I’m presented with something along the lines of this:
foreach ($offers as $offer) {
// ...
$offer->artwork = $this->getArtworkForMerchant($offer->merchantID);
// ...
}
This code is looping over a list of offers and fetches additional artwork data one offer at a time. This is a definite bottleneck as making database calls is not necessarily cheap, so we want to minimise the number of calls we need to make.
Fortunately, we can use the SQL IN()
function to help us fetch the records with the specific IDs we are interested in. We can rewrite the query so that, rather than fetching a single record by ID, it can fetch multiple records by multiple IDs:
# old
SELECT * FROM merchant_artwork WHERE id = :merchant_id;
# new
SELECT * FROM merchant_artwork WHERE id IN (:merchant_id_1, :merchant_id_2, ...);
Our PHP code can then be updated to utilise this new logic:
// old
foreach ($offers as $offer) {
// ...
$offer->artwork = $this->getArtworkForMerchant($offer->merchantID);
// ...
}
// new
$merchantIDs = array_map(
fn ($offer) => $offer->merchantID,
$offers,
);
$allArtwork = $this->getArtworkForMerchants($merchantIDs);
foreach ($offers as $offer) {
$offer->artwork = $allArtwork[$offer->merchantID];
}
This new code is a bit more involved as it requires us to do some work upfront in the form of extracting the list of merchant IDs from the list of offers, then to retrieve all of the artwork data in a single query, and finally we need to stitch the artwork data onto each offer. However, despite the added steps, this code solves the N+1 issue and ensures we can fetch all artwork records in a single query rather than 663!
On top of addressing the N+1 performance issue, we were sure to cache this data in Redis so that subsequent requests wouldn’t hit the database again.
N+1 Redis Calls
Next up are the calls to Redis. Similar to the MySQL calls, we want to limit the number of individual calls and batch fetch records where possible.
Let’s look at the Datadog report again:
We’re seeing 330 MGET
calls on each request, which is too many.
Running Redis’ MONITOR
command in my terminal and hitting the webpage in the browser allowed me to watch the Redis calls come in in real-time:
...
1677660510.869751 [0 172.18.0.12:55858] "MGET" "offer-7962672"
1677660510.875843 [0 172.18.0.12:55858] "MGET" "offer-last-used-7962672"
1677660510.876687 [0 172.18.0.12:55858] "MGET" "offer-7949258"
1677660510.876891 [0 172.18.0.12:55858] "MGET" "offer-last-used-7949258"
1677660510.877358 [0 172.18.0.12:55858] "MGET" "offer-7937533"
1677660510.877526 [0 172.18.0.12:55858] "MGET" "offer-last-used-7937533"
1677660510.877874 [0 172.18.0.12:55858] "MGET" "offer-7929644"
1677660510.878048 [0 172.18.0.12:55858] "MGET" "offer-last-used-7929644"
1677660510.878389 [0 172.18.0.12:55858] "MGET" "offer-7928439"
1677660510.878607 [0 172.18.0.12:55858] "MGET" "offer-last-used-7928439"
1677660510.878941 [0 172.18.0.12:55858] "MGET" "offer-7912763"
1677660510.879118 [0 172.18.0.12:55858] "MGET" "offer-last-used-7912763"
1677660510.879497 [0 172.18.0.12:55858] "MGET" "offer-7901126"
1677660510.879718 [0 172.18.0.12:55858] "MGET" "offer-last-used-7901126"
...
It was obvious where the issue lay as we can see repeated calls to the offer-*
and offer-last-used-*
keys, indicating another foreach loop and N+1 scenario.
Digging up the code that hits those cache keys, I find the foreach loop as expected:
$offers = [];
foreach ($offerIDs as $offerID) {
$offers[] = $this->offerService->findOfferForDisplay($offerID);
}
This code is simply looping over a list of offer IDs and fetching the cached record. The findOfferForDisplay()
method only fetches one offer at a time, so it was wrapped in a loop to support fetching multiple offers. The solution is to refactor findOfferForDisplay()
to accept multiple IDs and forward them to Redis to fetch in bulk.
One thing I should mention is that this application is using Symfony’s Cache Component library to abstract away the calls to Redis, and the findOfferForDisplay()
is calling the CacheItemPoolInterface::getItem()
method behind the scenes. What we want to do is call the CacheItemPoolInterface::getItems()
method which accepts multiple keys and will utilise the batch fetching nature of Redis’ MGET
calls.
// old
function findOfferForDisplay(int $offerID) {
$offer = $this->cache->getItem("offer-$offerID");
$lastUsedDate = $this->cache->getItem("offer-last-used-$offerID");
// ...
$offer->lastUsed = $lastUsedDate;
// ...
return $offer;
}
// new
function findOffersForDisplay(array $offerIDs) {
$offerKeys = array_map(fn ($offerID) => "offer-$offerID", $offerIDs);
$lastUsedKeys = array_map(fn ($offerID) => "offer-last-used-$offerID", $offerIDs);
$offers = $this->cache->getItems($offerKeys);
$lastUsedDates = $this->cache->getItems($lastUsedKeys);
// attach last used dates to each offer ...
return $offers;
}
Having updated the code, when I rerun and watch the MONITOR
command again this is what we see:
1677660388.754528 [0 172.18.0.12:40518] "MGET" "offer-7962672" "offer-7949258" "offer-7937533" ...
1677660388.759681 [0 172.18.0.12:40518] "MGET" "offer-last-used-7962672" "offer-last-used-7949258" "offer-last-used-7937533" ...
These are exactly the result we are after. There are only two MGET
calls to request the same records as before, and each call is providing a list of keys to fetch in one go.
Results
With those simple refactors let’s look at the results of our changes.
Latency
After deploying these changes, our latency measure stabilised to (on average) about a third of their former times.
Time To First Byte (TTFB)
The time to first byte graph halved from ~650 ms to ~350 ms.
User Happiness
SpeedCurve estimates user happiness based on performance metrics and reports, it shows that our users should experience “happier page views” because of these changes.
Summary
This was a short write-up of what we went through to identify and fix some performance issues on our homepage. As you can see the code changes were simple but the return on investment is far greater.
It’s worth investing in some decent monitoring tools like Datadog to help with analysing and investigating performance issues like these. And remember to always approach loops in your code with caution and ensure you fetch your data in bulk!
Have you heard? We’re hiring at VoucherCodes! Check out our careers page here.