Explain SQL command in PostgreSQL

Rain Wu
Random Life Journal
5 min readSep 29, 2020

--

Driven by the curiosity of database tunning, I have taken some time to study how the database planner plans the query strategy this week. This really gave me great inspiration, which makes me pay more attention to how I construct a complex SQL.

Image from Unsplash

Here I will write down some notes about serval useful commands to display the plan and details via PostgreSQL.

SQL Command

Whenever we enter a SQL command, it will go through the few steps below:

  • parse
    detect whether the command is lexically and syntactically correct, this makes sure it could be understood in the following steps.
  • plan
    generate several suitable query plan according to the SQL command, and pick the one which the planner think the best.
  • execute
    execute the query plan given by the planner, modify the current database, or return the specified data.

Most of the performance issues are focus on the planning step. If we develop our SQL command without any awareness, or just concatenate every clause like how ORM does, the query plan may not perform well.

Image from Unsplash

But the official PostgreSQL online document does not show how it creates the example table, so I decide to use the data on the SQLite document for my practice :)))

practice=# SELECT * from fruit;
fruit_id | name | price | state
----------+------------+-------+-------
1 | orange | 85 | FL
2 | apple | 45 | NC
3 | peach | 60 | SC
4 | grape | 80 | CA
5 | lemon | 125 | FL
6 | strawberry | 245 | NC
7 | orange | 105 | CA
(7 rows)

Explain

The EXPLAIN command can describe how PostgreSQL construct the plan and the expected cost of the execution.

practice=# EXPLAIN SELECT * FROM fruit;
QUERY PLAN
----------------------------------------------------------
Seq Scan on fruit (cost=0.00..13.80 rows=380 width=184)
(1 row)

It will extract all records with table fruit through a sequential scan and follows by some prediction status. The first is the initial cost (0.00) which means how long does the preparation will take before the plan starts, and the next (13.80) is the total cost, the planner will try to make it as low as possible.

rows=380 indicates that the plan predicts there will be 380 rows output if all the nodes executed successfully, width=184 is the length of the average bytes of all the rows return.

Now append a WHERE clause:

practice=# EXPLAIN SELECT * FROM fruit WHERE name = 'lemon';
QUERY PLAN
--------------------------------------------------------
Seq Scan on fruit (cost=0.00..14.75 rows=2 width=184)
Filter: ((name)::text = 'lemon'::text)
(2 rows)

Although the row number of output is decreased, a sequential scan is also needed. The total cost is a little bit more this time, which reflects the extra works of string comparison.

A more realistic situation

Obviously, it is impossible to have less than ten records in the production environment. For the simulating, we need a tool to help us inject data into our table programmatically. You can use any ORM tool to achieve that or just code the shell scripts, I pick the Django as the data injecting tool.

Here come two simple classes, which will become the table schema for the simulation later.

And the records injecting scripts, we can execute external python scripts via a module called django-extensions. After the execution, each table has 11000 rows, we can check that by the following SQL:

practice=# SELECT count(*) AS extract_count FROM inject_fruit;
11000

If the SQL command is constructed cross-table, a join operation node would be used:

practice=# EXPLAIN SELECT * FROM inject_fruit f, inject_vegetable v WHERE f.name IN ('banana', 'apple') AND f.price < v.price;
Nested Loop (cost=0.00..595221.51 rows=13218333 width=37)
Join Filter: (f.price < v.price)
-> Seq Scan on inject_vegetable v (cost=0.00..179.00 rows=11000 width=18)
-> Materialize (cost=0.00..226.53 rows=3605 width=19)
-> Seq Scan on inject_fruit f (cost=0.00..208.50 rows=3605 width=19)
Filter: ((name)::text = ANY ('{banana,apple}'::text[]))

The details of how the plan nodes are organized are described completely, first is the loop for traversing cross-table comparison, then materialize the cross-table relationship after a sequential scan. In the end, check whether the value of the name field is inside the given set.

Image from Unsplash

But the structures of plan nodes are not always the same, it is actually the best strategy picked by the query planner, which means even minor changes in SQL commands can make the entire architecture completely different.

practice=# EXPLAIN SELECT * FROM inject_fruit f, inject_vegetable v WHERE f.name IN ('banana', 'apple') AND f.price = v.price;
Hash Join (cost=316.50..56620.39 rows=4958836 width=37)
Hash Cond: (f.price = v.price)
-> Seq Scan on inject_fruit f (cost=0.00..208.50 rows=3605 width=19)
Filter: ((name)::text = ANY ('{banana,apple}'::text[]))
-> Hash (cost=179.00..179.00 rows=11000 width=18)
-> Seq Scan on inject_vegetable v (cost=0.00..179.00 rows=11000 width=18)

After changing the last compare clause from less than to equal, the planner selects a Hash Join as the root node directly. This is definitely reasonable, comparing whether two values are the same via hash at the beginning can save the effort to materialize cross-table relationships.

However, it spends some time (316.50 in the above output) for the preparation, I guess it is mainly the hashing process. But the total time consumption has been decreased obviously (56620.39 in the above output), this demonstrates the value of query planning.

Image from Unsplash

For advanced statistic details of real execution, PostgreSQL provides the EXPLAIN ANALYZE command, which actually runs the SQL and gives the feedback to us.

practice=# EXPLAIN ANALYZE SELECT * FROM inject_fruit f, inject_vegetable v WHERE f.name IN ('banana', 'apple') AND f.price = v.price;
Hash Join (cost=316.50..56620.39 rows=4958836 width=37) (actual time=155.553..33169.077 rows=4956566 loops=1)
Hash Cond: (f.price = v.price)
-> Seq Scan on inject_fruit f (cost=0.00..208.50 rows=3605 width=19) (actual time=0.020..31.075 rows=3605 loops=1)
Filter: ((name)::text = ANY ('{banana,apple}'::text[]))
Rows Removed by Filter: 7395
-> Hash (cost=179.00..179.00 rows=11000 width=18) (actual time=155.294..155.300 rows=11000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 681kB
-> Seq Scan on inject_vegetable v (cost=0.00..179.00 rows=11000 width=18) (actual time=0.019..75.273 rows=11000 loops
=1)
Planning Time: 0.131 ms
Execution Time: 64030.084 ms

Some accuracy bias may exist in the expected information, such as time and rows count, but it is just the question of whether we spend more time to actually execute the SQL command to obtain more accurate information.

Conclusion

The query plan plays an important role in the performance, but it also greatly depends on how we construct our SQL commands. Understanding how the database query planner selects the strategies and how the plan nodes were constructed can improve our workflow while tuning databases very effectively.

--

--

Rain Wu
Random Life Journal

A software engineer specializing in distributed systems and cloud services, desire to realize various imaginations of future life through technology.