LATERAL, a hidden gem in PostgreSQL
LATERAL is a very useful tool that every SQL user should have in his toolbelt. However, it is normally not explained in introductory courses and many get to miss it.
Suppose that we want to find the perfect match between two tables. For example, we could have a table with all students, a table with all schools, and we would like to find the school that is closest to each student. Another example, we have a table of users with some preferences and another table with products, and we want to find the product that is most similar to what they want. Then LATERAL is a lot of times the right tool for you.
Let’s try out to solve the problem with mock data.
First, we start by creating the tables:
CREATE TABLE test_table1 (id1 int PRIMARY KEY, property DOUBLE PRECISION);
CREATE TABLE test_table2 (id2 int PRIMARY KEY, property DOUBLE PRECISION);
INSERT INTO test_table1 (id1, property)
SELECT generate_series, RANDOM() from generate_series (1, 100000);
INSERT INTO test_table2 (id2, property)
SELECT generate_series, RANDOM() from generate_series (1, 100000); CREATE INDEX ON test_table1(property);
CREATE INDEX ON test_table2(property);
Now, as we said, we want to find for each id1 of test_table1, the id2 of test_table2 which is closest, in our case that is going to be ABS(test_table1.property — test_table2.property).
One possible way to do it is using distinct on:
SELECT DISTINCT ON (t1.id1) t1.id1, t2.id2
FROM test_table1 t1, test_table2 t2
ORDER BY t1.id1, ABS(t1.property - t2.property)
But the performance is terrible, after 20min it hadn’t finished, if we do explain on the query:
EXPLAIN SELECT DISTINCT ON (t1.id1) t1.id1, t2.id2 FROM test_table1 t1, test_table2 t2 ORDER BY t1.id1, ABS(t1.property - t2.property) LIMIT 10
'Limit (cost=2177764254.44..2177769254.44 rows=10 width=16)' '
-> Unique (cost=2177764254.44..2227764254.44 rows=100000 width=16)' '
-> Sort (cost=2177764254.44..2202764254.44 rows=10000000000 width=16)' ' Sort Key: t1.id1, (abs((t1.property - t2.property)))' '
-> Nested Loop (cost=0.00..175003332.00 rows=10000000000 width=16)' '
-> Seq Scan on test_table1 t1 (cost=0.00..1541.00 rows=100000 width=12)' '
-> Materialize (cost=0.00..2041.00 rows=100000 width=12)' '
-> Seq Scan on test_table2 t2 (cost=0.00..1541.00 rows=100000 width=12)'
We see that it has to sort a collection of size(test_table1) * size (test_table2) so that is no good.
Lateral is the perfect tool for this kind of problems, it allows to do a query for each row of the first table:
SELECT t1.id1, t2.id2
FROM test_table1 t1,
LATERAL (SELECT tt.id2 FROM test_table2 tt ORDER BY ABS(tt.property - t1.property) LIMIT 1 )t2
It takes 293ms!! More than 6000x faster!! Let’s examine a bit the query inside the lateral clause. We are doing a query for test_table2 where we can reference t1. Wait, that sounds a lot like subqueries, and in fact using one gives exactly the same time.
SELECT t1.id1, (SELECT t2.id2 FROM test_table2 t2 ORDER BY ABS(t1.property - t2.property) LIMIT 1)
FROM test_table1 t1
Also 293ms. However, LATERAL is way more powerful. Suppose we want the two closest locations, we simply need to change the LIMIT 1 for LIMIT 2, a subquery will raise an error because we cannot return more than one row. Suppose that apart of the id2 we also want the value of the difference, we just need to add the value to the lateral clause.
SELECT t1.id1, t2.id2, t2.diff FROM test_table1 t1, LATERAL (SELECT tt.id2, ABS(tt.property - t1.property) diff
FROM test_table2 tt
ORDER BY ABS(tt.property - t1.property) LIMIT 2 )t2
That is impossible with a subquery, you can only return one column.
In summary, whenever you need to match two tables and try to optimize some criterion you should give it a try to LATERAL.
By Gabriel "Fusten" Furstenheim