n+1 problem

Chichen Itza, Mexico © Leon Fayer

One of the more frequent performance mistakes developers make is also perhaps one of the simplest to avoid and fix. n+1 problem can be best illustrated by code. Consider a very generic (and very common) way to retrieve a list of items and its properties from the database.

# select a list of items
$sth = $dbh->prepare("select id from items_table where ....");
$sth->execute();
$items = $sth->fetchAll();
foreach ($items as $item) {
# create an array of items with all the properties
$sth_propss = $dbh->prepare("select * from item_props
where item_id = ?");
$sth_properties->execute($item["id"]);
array_push($item["props], $sth_props->fetchAll());
}

Items and properties can be anything. Blog posts and comments, products and SKUs, restaurants and reviews, users and history, etc. The actual representation doesn’t matter. What matters is how long it takes to retrieve the data that’s needed. And the above way is most common, yet least performant way of accomplishing the task.

Why n+1?

The reason it is called n+1 is because the code executes n+1 queries to get the data. One query to get the list of items (1) followed by number of queries equaling to the number of items returned (n), to get properties for each. In concrete terms — if you have 1 item, the code will execute 2 queries in order to return you the information requested. If you have 10 items, the code will execute 11 queries. And if you have 1000 items … you will never get your results back because the page will timeout waiting to execute 1001 queries.

But it runs fast in my environment!

Perhaps. Perhaps it will continue running fast for some time. But the trickery of the n+1 problem is the continuous degradation of performance as you get more data. And the attrition rate will depend on the frequency of data injection. So if you are just getting started on a noble path of a professional blogger, the number of queries will continue to grow with every new great post created, slowly degrading performance to the point of unusability. Conversely, if you are an online reseller for multiple brands, receiving a large batch of new products may bring your site to its knees immediately.

How to avoid it?

Simple. Never execute queries in a loop. Especially if the number of iterations is not static. If you don’t know how many queries you’ll run, you will never be able to predict user perceived performance.

Wait .. but I still need information from different tables!

Use SQL joins to get all the information you need in a single query. Specifically, for this example, LEFT OUTER JOIN. It will return all the rows from properties table for each row in the items table.

The right way

# select a list of items WITH all properties
$sth = $dbh->prepare("select i.*, p.* from items i, item_props p
where i.id = p.id
and .....");
$sth->execute();
$r_items = $sth->fetchAll();
$items = [];
foreach ($r_items as $item) {
$items[$curr_item_id]["name"] = $item["name"]
...
array_push($items[$curr_item_id]["props"], $item);
}

Above is what the same logic/result would look like using joins. Now, no matter how many items you have, only one query will be executed each time. Yay!

Now we can move onto the next potential performance bottleneck — quantity of data returned and processed. But that’s a story for another day.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.