Fixing the N+1 Query Problem

Nikush Patel
VoucherCodes Tech Blog
7 min readJul 26, 2023
Photo by Stephen Dawson on Unsplash

A few months ago, one of our platform engineers was looking through our Datadog traces and keeping an eye on performance when they noticed that the flame graph report for the VoucherCodes homepage was looking pretty busy.

Here is what they saw:

The bottom row caught their attention as it contained hundreds of colourful spans indicating hundreds of system interactions; so many that it likely exceeded Datadog’s payload limits resulting in an abrupt cutoff.

The Investigation

Zooming into the graph, we can see that all of these spans are in fact repeated calls to Redis and PDO (a PHP extension to interface with databases). The repeating nature of the spans implies a possible N+1 scenario.

N+What?

N+1 is a query fetching problem, meaning that your application is making a variable (and potentially large) number of calls to fetch a record and its related records.

The “+1” part describes the fetch to get the record in question, for example a movie from a “movies” database table.

The “N” part describes the variable number of related records, such as actors from an “actors” database table that appear in the aforementioned movie.

Ideally, we would like to make only two database fetches; the first for the movie and the second for the list of actors. However, if we are not careful and end up fetching each actor one by one, then we will be executing many fetch requests per film.

A small independent film could result in a few dozen fetches as it is likely to cast fewer actors, however, a blockbuster movie could result in hundreds of fetches due to a much bigger cast!

As you can see, the variable number of related records to fetch means that different resources will be slower to fetch than others, and the problem is amplified as the number of related records is increased.

N+1 MySQL Queries

Let’s begin looking at the MySQL queries as executing hundreds on a single request is a red flag for performance.

Datadog is very helpful in grouping the queries and counting the number of executions for us. Looking at the table below, we can see there is a query that is being executed 663 times with each request!

Finding the source of this query in the codebase, I’m presented with something along the lines of this:

foreach ($offers as $offer) {
// ...
$offer->artwork = $this->getArtworkForMerchant($offer->merchantID);
// ...
}

This code is looping over a list of offers and fetches additional artwork data one offer at a time. This is a definite bottleneck as making database calls is not necessarily cheap, so we want to minimise the number of calls we need to make.

Fortunately, we can use the SQL IN() function to help us fetch the records with the specific IDs we are interested in. We can rewrite the query so that, rather than fetching a single record by ID, it can fetch multiple records by multiple IDs:

# old
SELECT * FROM merchant_artwork WHERE id = :merchant_id;

# new
SELECT * FROM merchant_artwork WHERE id IN (:merchant_id_1, :merchant_id_2, ...);

Our PHP code can then be updated to utilise this new logic:

// old
foreach ($offers as $offer) {
// ...
$offer->artwork = $this->getArtworkForMerchant($offer->merchantID);
// ...
}

// new
$merchantIDs = array_map(
fn ($offer) => $offer->merchantID,
$offers,
);
$allArtwork = $this->getArtworkForMerchants($merchantIDs);
foreach ($offers as $offer) {
$offer->artwork = $allArtwork[$offer->merchantID];
}

This new code is a bit more involved as it requires us to do some work upfront in the form of extracting the list of merchant IDs from the list of offers, then to retrieve all of the artwork data in a single query, and finally we need to stitch the artwork data onto each offer. However, despite the added steps, this code solves the N+1 issue and ensures we can fetch all artwork records in a single query rather than 663!

On top of addressing the N+1 performance issue, we were sure to cache this data in Redis so that subsequent requests wouldn’t hit the database again.

N+1 Redis Calls

Next up are the calls to Redis. Similar to the MySQL calls, we want to limit the number of individual calls and batch fetch records where possible.

Let’s look at the Datadog report again:

We’re seeing 330 MGET calls on each request, which is too many.

Running Redis’ MONITOR command in my terminal and hitting the webpage in the browser allowed me to watch the Redis calls come in in real-time:

...
1677660510.869751 [0 172.18.0.12:55858] "MGET" "offer-7962672"
1677660510.875843 [0 172.18.0.12:55858] "MGET" "offer-last-used-7962672"
1677660510.876687 [0 172.18.0.12:55858] "MGET" "offer-7949258"
1677660510.876891 [0 172.18.0.12:55858] "MGET" "offer-last-used-7949258"
1677660510.877358 [0 172.18.0.12:55858] "MGET" "offer-7937533"
1677660510.877526 [0 172.18.0.12:55858] "MGET" "offer-last-used-7937533"
1677660510.877874 [0 172.18.0.12:55858] "MGET" "offer-7929644"
1677660510.878048 [0 172.18.0.12:55858] "MGET" "offer-last-used-7929644"
1677660510.878389 [0 172.18.0.12:55858] "MGET" "offer-7928439"
1677660510.878607 [0 172.18.0.12:55858] "MGET" "offer-last-used-7928439"
1677660510.878941 [0 172.18.0.12:55858] "MGET" "offer-7912763"
1677660510.879118 [0 172.18.0.12:55858] "MGET" "offer-last-used-7912763"
1677660510.879497 [0 172.18.0.12:55858] "MGET" "offer-7901126"
1677660510.879718 [0 172.18.0.12:55858] "MGET" "offer-last-used-7901126"
...

It was obvious where the issue lay as we can see repeated calls to the offer-* and offer-last-used-* keys, indicating another foreach loop and N+1 scenario.

Digging up the code that hits those cache keys, I find the foreach loop as expected:

$offers = [];
foreach ($offerIDs as $offerID) {
$offers[] = $this->offerService->findOfferForDisplay($offerID);
}

This code is simply looping over a list of offer IDs and fetching the cached record. The findOfferForDisplay() method only fetches one offer at a time, so it was wrapped in a loop to support fetching multiple offers. The solution is to refactor findOfferForDisplay() to accept multiple IDs and forward them to Redis to fetch in bulk.

One thing I should mention is that this application is using Symfony’s Cache Component library to abstract away the calls to Redis, and the findOfferForDisplay() is calling the CacheItemPoolInterface::getItem() method behind the scenes. What we want to do is call the CacheItemPoolInterface::getItems() method which accepts multiple keys and will utilise the batch fetching nature of Redis’ MGET calls.

// old
function findOfferForDisplay(int $offerID) {
$offer = $this->cache->getItem("offer-$offerID");
$lastUsedDate = $this->cache->getItem("offer-last-used-$offerID");
// ...
$offer->lastUsed = $lastUsedDate;
// ...
return $offer;
}

// new
function findOffersForDisplay(array $offerIDs) {
$offerKeys = array_map(fn ($offerID) => "offer-$offerID", $offerIDs);
$lastUsedKeys = array_map(fn ($offerID) => "offer-last-used-$offerID", $offerIDs);

$offers = $this->cache->getItems($offerKeys);
$lastUsedDates = $this->cache->getItems($lastUsedKeys);

// attach last used dates to each offer ...

return $offers;
}

Having updated the code, when I rerun and watch the MONITOR command again this is what we see:

1677660388.754528 [0 172.18.0.12:40518] "MGET" "offer-7962672" "offer-7949258" "offer-7937533" ...
1677660388.759681 [0 172.18.0.12:40518] "MGET" "offer-last-used-7962672" "offer-last-used-7949258" "offer-last-used-7937533" ...

These are exactly the result we are after. There are only two MGET calls to request the same records as before, and each call is providing a list of keys to fetch in one go.

Results

With those simple refactors let’s look at the results of our changes.

Latency

After deploying these changes, our latency measure stabilised to (on average) about a third of their former times.

The above line graph shows latency measures dropping, from an average of around 1 second to just a third of a second.

Time To First Byte (TTFB)

The time to first byte graph halved from ~650 ms to ~350 ms.

The above line graph shows the time to first byte dropping from about 650 milliseconds to about 350 milliseconds.

User Happiness

SpeedCurve estimates user happiness based on performance metrics and reports, it shows that our users should experience “happier page views” because of these changes.

The above graph shows “user happiness” scores as percentages, as you can see the values for happy users increased and the values for unhappy users decreased.

Summary

This was a short write-up of what we went through to identify and fix some performance issues on our homepage. As you can see the code changes were simple but the return on investment is far greater.

It’s worth investing in some decent monitoring tools like Datadog to help with analysing and investigating performance issues like these. And remember to always approach loops in your code with caution and ensure you fetch your data in bulk!

Have you heard? We’re hiring at VoucherCodes! Check out our careers page here.

--

--