Optimize the way you fetch relationships with PostgreSQL

Clément Grimault
8 min readDec 12, 2022

--

Relationships
Value your relationships. Photo by Duy Pham on Unsplash

TL;DR: Use the JSON_AGG function to fetch all the relationships of an entity in a single query. This solution is more optimized when we fetch the relationships for a small number of entities, as it won’t scale very well for larger sets.

We use relational databases (RDBMS) to store relational data (entities with their relationships), and they do that in an efficient way. But it’s a bit more complicated when we want to pull these rows and map them to the objects our application uses (models). Indeed, when we start joining several tables, we get one row per tuple, so we need to group and duplicate rows in our application code. This is the reason why many people use ORMs by default, as they provide a nice abstraction to load the rows with their relationships and map them to the right entity objects.

But ORMs have some drawbacks:

  • They are black boxes: it’s hard to know what’s happening under the hood (it can generate some pretty bad queries if you do not pay attention).
  • They can lack flexibility: they are OK for simple CRUD operations but struggle when we need to perform more complex queries.
  • Each ORM is different: there are no two ORMs defining and querying entities the same way. What you learn by using one ORM may not be applicable if you decide to switch to another ORM. SQL is already an abstraction layer (you write a declarative query to tell what you want to get, and the DB decides how to retrieve it: which index to use, etc…). So in some cases, it may make more sense to rely on raw SQL rather than using 3rd party libraries that abstract even more the database functionalities.

Whether you use ORMs or raw SQL queries, I’d like to share an efficient way to fetch an entity with all its relationships with a single SQL query.

Database schema

First, let’s use a simple schema for our relational DB as an example. Let’s say we want to create an online CV platform, in which we’ll store some basic information about the user (languages, skills, experiences, …):

The entities we’ll store in our RDBMS

Let’s explore the options we have to retrieve a user with all its relationships (languages, skills, and experiences with their organization).

Option 1: one query per relationship

This option is simple and straightforward, we first fetch the user and then fetch its relationships.

That’s the ORM behavior by default, the relationships are “lazy loaded”, meaning that we fetch them only if they are accessed programmatically.

// Load the user entity
const user = await User.findOne(id);

// Load the experiences before we can use them...
const experiences = await user.getExperiences();

// Load the languages...
const languages = await user.getLanguages();

The issue with this approach is that we send many queries to the DB (N+1, N being the number of relationships, in our case 3). That’s kind of okay when we load only one entity, but this approach will be a pain when you try to load the relationships from several entities at once (or if you load the relationships from another relationship, …).

Option 2: Join everything and transform on your side

We send one big query to the DB asking it to join all the tables, and on the application side, we de-duplicate the rows and map the columns to the right objects. This is basically what is done when you “eager-load” your entities with an ORM (example with Sequelize):

const user = await User.findOne(id, { include: [Experience, Language, Skill] });

It will generate this kind of query:

SELECT u.id as u_id,
u.firstname as u_firstname,
u.lastname as u_lastname,
u.title as u_title,
s.name as s_name,
l.language as l_language,
l.level as l_level,
e.id as e_id,
e.title as e_title,
e.contract_type as e_contracttype,
e.start_date as e_startdate,
e.end_date as e_enddate,
o.id as o_id,
o.name as o_name
FROM users u
LEFT JOIN skills s ON u.id = s.user_id
LEFT JOIN languages l ON u.id = l.user_id
LEFT JOIN experiences e ON u.id = e.user_id
LEFT JOIN organizations o ON e.organization_id = o.id
WHERE u.id = $1

In our example, if a user has 3 languages, 2 experiences, and 5 skills, we’ll load 3 x 2 x 5 = 30 rows.

Here’s an example with a user with 4 skills and 3 languages (12 rows):

[
{
"u_id": 4000001,
"u_firstname": "Peter",
"u_lastname": "Test",
"u_title": "Software Engineer",
"s_name": "Java",
"l_language": "EN",
"l_level": "NATIVE",
"e_id": 2,
"e_title": "Software Engineer",
"e_contracttype": "fulltime",
"e_startdate": "2020-03-17T23:00:00.000Z",
"e_enddate": null,
"o_id": 1,
"o_name": "The Fake Company"
},
{
"u_id": 4000001,
"u_firstname": "Peter",
"u_lastname": "Test",
"u_title": "Software Engineer",
"s_name": "Java",
"l_language": "ES",
"l_level": "FLUENT",
"e_id": 2,
"e_title": "Software Engineer",
"e_contracttype": "fulltime",
"e_startdate": "2020-03-17T23:00:00.000Z",
"e_enddate": null,
"o_id": 1,
"o_name": "The Fake Company"
},
{
"u_id": 4000001,
"u_firstname": "Peter",
"u_lastname": "Test",
"u_title": "Software Engineer",
"s_name": "Java",
"l_language": "FR",
"l_level": "NOTIONS",
"e_id": 2,
"e_title": "Software Engineer",
"e_contracttype": "fulltime",
"e_startdate": "2020-03-17T23:00:00.000Z",
"e_enddate": null,
"o_id": 1,
"o_name": "The Fake Company"
},
{
"u_id": 4000001,
"u_firstname": "Peter",
"u_lastname": "Test",
"u_title": "Software Engineer",
"s_name": "PostgreSQL",
"l_language": "EN",
"l_level": "NATIVE",
"e_id": 2,
"e_title": "Software Engineer",
"e_contracttype": "fulltime",
"e_startdate": "2020-03-17T23:00:00.000Z",
"e_enddate": null,
"o_id": 1,
"o_name": "The Fake Company"
},
// 8 more...
]

This approach has a few drawbacks:

  • We have to add all the columns manually and alias them to prevent collisions in column names (not an issue if we use an ORM as it generates the query for us)
  • We load a lot of duplicated data, which means more data that will go through the network, that we’ll need to download and process.
  • It makes pagination more complex to handle as you cannot use the LIMIT clause in the main query. We have to change the FROM clause to something like: FROM (SELECT * FROM users WHERE id > $1 LIMIT 10) u

Option 3: Use the JSON_AGG function

The JSON_AGG function is an aggregation function from PostgreSQL:

Collects all the input values, including nulls, into a JSON array. Values are converted to JSON as per to_json or to_jsonb

SELECT
u.*,
json_agg(to_jsonb(l.*)) as languages,
json_agg(to_jsonb(e.*)) as experiences,
json_agg(s.name) as skills
FROM users u
LEFT JOIN (
SELECT e.*, to_json(org) as organization
FROM experiences e
INNER JOIN organizations org ON e.organization_id = org.id
) e on u.id = e.user_id
LEFT JOIN skills s ON u.id = s.user_id
LEFT JOIN languages l ON u.id = l.user_id
WHERE u.id = $1
GROUP BY u.id

We use the to_json function from PostgreSQL to convert each row to a JSON object (column => value). We need the GROUP BY clause as we’re doing aggregations.

This query works well, it returns a single result, but we can notice 2 problems:

  • When the entity has no relationships, it returns [null]
  • We still have duplicates within the array
{
"id": 4000002,
"firstname": "John",
"lastname": "Doe",
"title": "S. Software Engineer",
"languages": [
{
"language": "EN",
"member_id": 4000002,
"level": "NATIVE"
},
{
"language": "EN",
"member_id": 4000002,
"level": "NATIVE"
},
{
"language": "FR",
"member_id": 4000002,
"level": "NOTIONS"
},
{
"language": "FR",
"member_id": 4000002,
"level": "NOTIONS"
}
],
"skills": [
"TypeScript",
"Java",
"TypeScript",
"Java"
],
"experiences": [
null,
null,
null,
null
]
}

We can use DISTINCT to de-duplicate each aggregation and FILTER to remove NULL values:

SELECT
u.*,
json_agg(DISTINCT to_jsonb(l.*)) FILTER (WHERE l.id IS NOT NULL) as languages,
json_agg(DISTINCT to_jsonb(e.*)) FILTER (WHERE e.id IS NOT NULL) as experiences,
json_agg(DISTINCT s.name) FILTER (WHERE s.id IS NOT NULL) as skills
FROM users u
LEFT join (
SELECT e.*, to_json(org) as organization
FROM experiences e
INNER JOIN organizations org ON e.organization_id = org.id
) e on u.id = e.user_id
LEFT JOIN skills s ON u.id = s.user_id
LEFT JOIN languages l ON u.id = l.user_id
WHERE m.id = $1
GROUP BY u.id

Now we get exactly what we wanted:

{
"id": 4000002,
"firstname": "John",
"lastname": "Doe",
"title": "S. Software Engineer",
"languages": [
{
"language": "EN",
"member_id": 4000002,
"level": "NATIVE"
},
{
"language": "FR",
"member_id": 4000002,
"level": "NOTIONS"
}
],
"skills": [
"TypeScript",
"Java",
],
"experiences": []
}

This approach allows us to fetch our entities and their relationships in a shape that is very similar to our application models, so we need less work to transform the query results. The query response is also way smaller, so less data needs to go through the network.

Note that another approach is to use subqueries, so we don’t need to de-duplicate and filter out null values (let’s call it option 3 bis):

WITH experiences_with_org as (
SELECT e.*, to_json(org) as organization
FROM experiences e
INNER JOIN organizations org ON e.organization_id = org.id
)
SELECT
m.*,
(SELECT coalesce(json_agg(languages), '[]') FROM languages WHERE user_id = u.id) as languages,
(SELECT coalesce(json_agg(skills.name), '[]') FROM skills WHERE user_id = u.id) as skills,
(SELECT coalesce(json_agg(experiences_with_org), '[]') FROM experiences_with_org WHERE user_id = u.id) as experiences
FROM users u
WHERE u.id = $1

Performance

Here’s the benchmark on a PostgreSQL database running on my machine.

Specs:

  • PostgreSQL 14 (Alpine on Docker)
  • macOS M1 12.6
  • 4m users / 1 experience per user / 6 skills per user / 3 languages per user

Fetching a single entity with its relationships:

LEFT JOINS without aggregation (Option 2):
Nested Loop Left Join (cost=1.27..1985.16 rows=1 width=8217) (actual time=0.127..34.841 rows=18 loops=1)
AVG 35ms

LEFT JOINS with aggregation (Option 3):
GroupAggregate (cost=1.27..1985.21 rows=1 width=1922) (actual time=4.518..4.519 rows=1 loops=1)
AVG 17.5ms

Subqueries (Option 3bis):
Index Scan using members_pkey on members m (cost=0.43..1985.10 rows=1 width=1922) (actual time=13.802..13.810 rows=1 loops=1)
AVG 17.5ms

Fetching relationships for 10 entities:

LEFT JOINS without aggregation (Option 2):
Nested Loop Left Join (cost=9.60..2145.18 rows=10 width=8217) (actual time=0.073..8.267 rows=180 loops=1)
AVG 27.5ms

LEFT JOINS with aggregation (Option 3):
GroupAggregate (cost=2514.53..2514.98 rows=10 width=1922) (actual time=43.554..45.864 rows=10 loops=1)
AVG 49.5ms

Subqueries (Option 3bis):
Index Scan using members_pkey on members m (cost=0.43..19775.16 rows=10 width=1922) (actual time=4.035..42.874 rows=10 loops=1)
AVG 62.5ms

Fetching relationships for 100 entities:

LEFT JOINS without aggregation (Option 2):
Nested Loop Left Join (cost=14.53..3482.81 rows=96 width=8217) (actual time=0.167..10.880 rows=1800 loops=1)
AVG 33.5ms

LEFT JOINS with aggregation (Option 3):
GroupAggregate (cost=3852.06..3856.38 rows=96 width=1922) (actual time=35.072..52.790 rows=100 loops=1)
AVG 67.5ms

Subqueries (Option 3bis):
Index Scan using members_pkey on members m (cost=0.43..189770.99 rows=96 width=1922) (actual time=27.606..432.147 rows=100 loops=1)
AVG 475ms

Conclusion

Option 2 (eager-loading by ORMs) is the least performant when we want to retrieve the relationships of a single entity. The response size is also way bigger, which could affect the bandwidth, and it delegates the data processing to the client. On the other hand, it seems to deliver better results when we fetch relationships for a large result set. We should also take into account the time spent downloading and transforming the data.

Option 3 (left join + json_agg) offers very good performance to load a single entity with its relationships. It’s a bit slower than the previous option when we fetch more entities at the same time, but if we take into account the data transfer and data transformation, it should be equivalent.

Option 3b (subqueries) is as efficient as option 3 to fetch a single entity and has the benefit of being less verbose. On the other hand, it doesn't scale and we shouldn’t use it to fetch many entities.

I hope this was useful and that you’ve learned something. I personally enjoy using option 3b to fetch individual entities with all their relationships.

--

--

Clément Grimault

Software Engineer | Databases, Distributed Systems, Software Architecture