PostgreSQL: one experiment on comparison of MergeJoin, HashJoin and NestLoop
Discovering data skew and behaviour of the PostgreSQL optimizer, I found strange EXPLAINs. Just look at this.
Here is a simple query:
Find all women with the ‘Tractor Driver’ occupation from the Chelyabinsk region. In Russia, a tiny fraction of women usually work as truck or tractor drivers. So, querying such data we provoking data skew, not clear for the planner.
-- Generate data on tractor drivers gender in Russian homesteads.
DROP TABLE IF EXISTS people,address CASCADE;
CREATE TABLE people (
id serial,
occupation_id integer,
is_woman boolean
);
INSERT INTO people (occupation_id, is_woman)
SELECT x % 100,
CASE
-- Woman has this occupation with veeeery tiny probability
WHEN (x % 100 = 13) THEN
CASE WHEN (random()<0.001) THEN true ELSE false END
ELSE
-- As a rule, we have a lot of women working on farms
CASE WHEN (random()<0.98) THEN true ELSE false END
END
FROM generate_series(1,1E5) AS x;
-- At least one such woman has to exist.
INSERT INTO people (occupation_id, is_woman) VALUES (13, true);
-- Create dummy table to make a join
CREATE TABLE address (
id serial,
region_id integer,
payload text
);
INSERT INTO address(region_id,payload)
SELECT x % 89, ‘aaaaaaaaaaaaaaaaaaaaaa’
FROM generate_series(1,1E6) AS x;
-- This index adds parameterized NestLoop into account
-- and increases chance of a MergeJoin
CREATE INDEX ON address(id);
ANALYZE people,address;
explain analyze
SELECT * FROM people t1
JOIN address t2 ON (t1.id = t2.id)
WHERE occupation_id = 13 AND is_woman = true AND region_id = 74;
Here I anticipate seeing an overestimation of scanning the table of people and getting a HashJoin plan (Keep in mind, parameterized NestLoop should be much better). But I’ve got the next:
Merge Join (cost=1841.78..5444.30 rows=11 width=40) (actual time=62.755..62.756 rows=0 loops=1)
Merge Cond: (t2.id = t1.id)
-> Index Scan using address_id_idx on address t2 (cost=0.42..35836.43 rows=10667 width=31) (actual time=0.044..44.171 rows=1124 loops=1)
Filter: (region_id = 74)
Rows Removed by Filter: 98897
-> Sort (cost=1841.36..1843.88 rows=1009 width=9) (actual time=18.444..18.446 rows=1 loops=1)
Sort Key: t1.id
Sort Method: quicksort Memory: 25kB
-> Seq Scan on people t1 (cost=0.00..1791.01 rows=1009 width=9) (actual time=18.436..18.437 rows=1 loops=1)
Filter: (is_woman AND (occupation_id = 13))
Rows Removed by Filter: 100000
Planning Time: 0.347 ms
Execution Time: 62.792 ms
Why is the cost of the top-level Merge Join node less than the underlying Index Scan? Ok, go further: disable MergeJoin path and try again:
SET enable_mergejoin = ‘off’;
Nested Loop (cost=0.42..9442.11 rows=11 width=40) (actual time=23.726..23.727 rows=0 loops=1)
-> Seq Scan on people t1 (cost=0.00..1791.01 rows=1009 width=9) (actual time=23.682..23.683 rows=1 loops=1)
Filter: (is_woman AND (occupation_id = 13))
Rows Removed by Filter: 100000
-> Index Scan using address_id_idx on address t2 (cost=0.42..7.57 rows=1 width=31) (actual time=0.037..0.037 rows=0 loops=1)
Index Cond: (id = t1.id)
Filter: (region_id = 74)
Rows Removed by Filter: 1
Planning Time: 0.351 ms
Execution Time: 23.773 ms
We have a much better plan now. But where is the HashJoin node hidden? Let me be more dogged, disable NestLoop and try again:
SET enable_nestloop = ‘off’;
Hash Join (cost=1803.62..21696.74 rows=11 width=40) (actual time=194.681..194.683 rows=0 loops=1)
Hash Cond: (t2.id = t1.id)
-> Seq Scan on address t2 (cost=0.00..19853.00 rows=10667 width=31) (actual time=0.034..169.033 rows=11236 loops=1)
Filter: (region_id = 74)
Rows Removed by Filter: 988764
-> Hash (cost=1791.01..1791.01 rows=1009 width=9) (actual time=23.663..23.664 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on people t1 (cost=0.00..1791.01 rows=1009 width=9) (actual time=23.655..23.656 rows=1 loops=1)
Filter: (is_woman AND (occupation_id = 13))
Rows Removed by Filter: 100000
Planning Time: 0.191 ms
Execution Time: 194.729 ms
What is happening with this bloody planner? HashJoin here is the worst plan because of the slow SeqScan operation. But why theSeqScan node executes 169s in comparison with IndexScan’s 44s? They both scan the same table and get the same number of tuples. Moreover, the IndexScan should see an index relation and, afterwards, go to the heap. What is the problem here? But anyway, we can quickly improve this situation by adding an index on the column region_id and trying again.
CREATE INDEX ON address(region_id);
ANALYZE people,address;
Still disabling MergeJoin and NestLoop paths, we have:
Hash Join (cost=1931.51..9802.97 rows=11 width=40) (actual time=45.292..45.295 rows=0 loops=1)
Output: t1.id, t1.occupation_id, t1.is_woman, t2.id, t2.region_id, t2.payload
Hash Cond: (t2.id = t1.id)
-> Bitmap Heap Scan on public.address t2 (cost=128.26..7957.11 rows=11333 width=31) (actual time=6.546..19.210 rows=11236 loops=1)
Output: t2.id, t2.region_id, t2.payload
Recheck Cond: (t2.region_id = 74)
-> Bitmap Index Scan on address_region_id_idx (cost=0.00..125.42 rows=11333 width=0) (actual time=3.540..3.541 rows=11236 loops=1)
Index Cond: (t2.region_id = 74)
-> Hash (cost=1791.01..1791.01 rows=979 width=9) (actual time=23.295..23.296 rows=1 loops=1)
Output: t1.id, t1.occupation_id, t1.is_woman
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on public.people t1 (cost=0.00..1791.01 rows=979 width=9) (actual time=23.287..23.288 rows=1 loops=1)
Output: t1.id, t1.occupation_id, t1.is_woman
Filter: (t1.is_woman AND (t1.occupation_id = 13))
Rows Removed by Filter: 100000
Planning Time: 0.238 ms
Execution Time: 45.394 ms
Finally, HashJoin is much faster than MergeJoin, but the cost is still worse. What is the problem?
In toto, we have two issues:
- IndexScan is much faster than SeqScan on the same data.
- MergeJoin total cost is less than its subtree total cost.