MTM on arrays in PostgreSQL

AlekseyL
AlekseyL
Sep 30, 2017 · 5 min read

Since the time I discover arrays in PostgreSQL as a data type, I keep on wondering what would happend with a many to many relationship done through arrays instead of a mtm intermediate table. Of course at a scale of 100K it looks like nothing interesting, but this is a wrong assumtion, even at a scale of 100K * 100K with intence linking between objects we can see some interesting trends.

Many to many

In systems analysis, a many-to-many relationship is a type of cardinality that refers to the relationship between two entities[1] A and B in which A may contain a parent instance for which there are many children in B and vice versa.

For example, think of A as Authors, and B as Books. An Author can write several Books, and a Book can be written by several Authors.

In a relational database management system, such relationships are usually implemented by means of an associative table (also known as cross-reference table), say, AB with two one-to-many relationships A -> AB and B -> AB. In this case the logical primary key for AB is formed from the two foreign keys (i.e. copies of the primary keys of A and B).

Simple data model

Assume we have private docs and users. Each doc could be shared to many users, and each user could have access to many shared docs. Lets connect them through a table of granted accesses. Lets assume that each doc shared accross 1K users.

In numbers:

If we have 100K docs and 100K users then total granted accesses will equal 100M records.

Size of table granted_accesses in bytes will b: 100M * 3 * sizeof(bigint) bytes ~ 2.4Gb, where 3 is not the magic number but columns count: user_id, doc_id and primary key. ( This calculation misses the iternal rows info, see “UPD” sections below )

If we need bidirectional searches we will need at least two indexes [doc_id, user_id] and [user_id, doc_id], each will bring +2.4Gb on it’s own ( cause it will contain same amount of data as pure row data )+ a primary key index as usual ~ +0.8Gb ( actually I see 2Gb at my test table, but I recreated it couple of times, so lets assume lowest predictable value).

UPD: actually every index also contain additional system info per record, so this would be even more than above numbers.

Total size for cross-table + indexes will be near 8Gb!

UPD: Thanks to Maciej Bąk comment below! Actually this calculation doesn’t includes tuple headers, which brings +23 bytes per row! Meaning that table size would be more than 10 Gb!

Adding two array columns, with 1K of bigints per row, to users and docs tables, may bring 0.8 Gb each, but since a pg_toast compress such data it may be less than 0.8 Gb ( I see ~400Mb+ growth on my set, but it may not be an issue in your cases ).

So total size of needed data will be ~ 1.6Gb or less.

Perfomance

Related records without JOIN

--with arrays
SELECT "users".*
FROM "users"
WHERE "users"."id" IN (
SELECT unnest(user_ids)
FROM "docs"
WHERE "docs"."id" = ID
)
--with mtm table
SELECT "users".*
FROM "users"
WHERE "users"."id" IN (
SELECT "granted_accesses"."user_id"
FROM "granted_accesses"
WHERE "granted_accesses"."doc_id" = ID
)

1.25 — 1.5 times faster when using array.

JOINS

--mtm-table
SELECT * FROM cards
JOIN granted_accesses ON doc_id = docs.id
JOIN users ON users.id = user_id
WHERE docds.id = ID;
--arr
SELECT * FROM cards
JOIN users ON users.id = ANY( docs.user_ids )
WHERE docs.id = ID;

1.4–1.7 times in favor for array representation

This is a simple cases and still arrays shows better results, though they both did it in a couple miliseconds.

Don’t forget to run ANALYZE after populating intermediate table otherwise postgres may skip Index Only scans! And yeah, this numbers are for IndexOnly scans, not bad for array!

Complex searches

In some cases an intermediate table may bring a better result, for example in this query:

--mtm
SELECT * FROM docs
JOIN granted_accesses ON card_id = docs.id
WHERE doc_id IN (?) AND user_id IN(?)
--arr
SELECT * FROM docs
WHERE id IN (?) AND user_ids && ARRAY[?]::bigint[];

With arrays, pg planner cannot bring nothing other than a docs.id index scan with a filter on user_ids. In contrary mtm search can optimize this case and win 1.7 times boost on medium sets (100–1000 elements), but still mtm is slower on small and large data sets: ≤ 10, ≥ 2000.

Although the only way to overperfom array-based queries for join is to stay in doc_id + user_id conditions only, adding just one more indexed column restriction may easily restore arrays supremacy on whole range of cases. Such query may look something like this:

--mtm
SELECT * FROM docs
JOIN granted_accesses ON doc_id = cards.id
WHERE doc_id IN (?) AND user_id IN(?) AND indexed_rare_key = ?
--arr
SELECT * FROM docs
WHERE id IN (?) AND user_ids && ARRAY[?]::bigint[] AND indexed_rare_key = ?;

There is one more case when array wins not by a split decision, but with KO: it’s a disproportional or asymmetric relation. In one of my test data sets there were 1K elements on one side of relation and 3–5 elements on the other, and I needed a full text search combined with restrictions on this many-to-many relation. I get 20–30 times perfomance boost in a favor of an array over a mtm table, with an appropriate GIN index of course.

Small remark: GIN doesn’t look good on a scale of thousands (i.e. don’t use it when arrays in a column contain thousands elements ) and useless for the first example, but works fine on tens.

Restrictions/problems

There are two main restrictions I can think of: a referntial integrity and an ORM support.

If you want to be sure, and keep a referential integrity — you must do it yourself by creating triggers in your DB. Less safe but usually cheaper: callbacks on ORM models.

--Don't forget to use array functions, for example:
UPDATE cards
SET user_ids = array_remove(user_ids, ?::bigint)
WHERE ...
#or
user.docs.update_all('user_ids = array_remove(user_ids,?)', user.id)

Another problem is that your ORM most likely will not support array based mtm relation out of the box, so you’ll need to build custom workarounds for all functionality you need.

One more thing to mention — you must have arrays on both sides of a mtm relation, you can’t just store user_ids on a doc, without storing doc_ids on a user, it will make unacceptably slow any reversed operations.

Summary

Clear JOIN is fast and perfomance tradeoff on simple queries is a relatively small (1.25–1.7 times in favor of array) this worth nothing at milisecond scale. But on more complex searches even 1.5 times perfomance boost of an array solution may become noticeable, not to mention special cases when an array just KO a mtm table with 20–30 times boost.

Memory achievement is huge, 6 times less memory in a favor for arrays, this may not be an issue when relations between tables is relatively rare and a referential table has an appropriate size, but when it’s not, size of an intermediate table and its indexes will go through the roof.

Resume

On a small or even medium data scale, a memory overgrowth and perfomance loses may not take over of the usablity and comfort of an ORM mtm support. But if you going BIG or COMPLEX or asymmetric — I strongly suggest you to consider arrays as a mtm basis over the intermediate table, at least fork your data and code, and test your top slowest query on a different mtm basis.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade