Update: we have now put together a more comprehensive glossary.
There are lots of guides out there to the basics of execution plans, but a lot are quite scarce on the details — how to interpret particular values, what they really mean, and where the pitfalls are.
We’ve spent a lot of time over the last 18 months learning, clarifying, and downright misinterpreting how each of these fields work — and there’s still further for us to go on that.
But we have come a long way, and I’d like to share the guide that I wish had existed when we started out — a glossary of the most common fields you’ll see on the operations in a query plan, and a detailed description of what each one means.
If you work with query plans often, hopefully you’ll still learn a thing or two by reading this guide start-to-finish, but for people who do less performance analysis, you may want to use it as a reference, to look up fields as and when you need them.
I’ve broken down this glossary into sections, based on which flag to EXPLAIN causes the field to be shown, to make it easier to find your way around:
- Query Structure Fields — always present.
- Estimate Fields — present when the COSTS flag is set.
- Actual Value Fields — present when the ANALYZE flag is set.
- Buffers Fields — present when the BUFFERS flag is set.
- Verbose Fields — present when the VERBOSE flag is set.
Query Structure Fields
These fields represent what the plan will actually do: how the database will process the data and return the results for your query. When applicable, they’ll be present whatever flags you use to generate the query plan.
The operation the node is performing. The best guides I’ve read on what different operation types actually do are Depesz’s series on different operations and the annotations to the code itself — both in the code for node execution and the planner nodes.
The child operations executed to provide input to this operation.
A guide as to why this operation needs to be run in order to facilitate the parent operation. There are six different possibilities:
- Outer is the value you’ll see most often. It means “take in the rows from this operation as input, process them and pass them on”.
- Inner is only ever seen on the second child of join operations, and is always seen there. This is the “inner” part of the loop. ie, for each outer row, we look up its match using this operation.
- Member is used for all children of append and modifyTable nodes, and on bitmap processing nodes like BitmapAnd and BitmapOr operations.
- InitPlan: Used for calculations performed before the query can start, eg a constant referred to in the query or the result of a CTE scan.
- Subquery: The child is a subquery of the parent operation. Since Postgres always uses subquery scans to feed subquery data to parent queries, only ever appears on the children of subquery scans.
- SubPlan: Like a Subquery, represents a new query, but used when a subquery scan is not necessary.
When present, this is a filter used to remove rows.
The important thing to note is that this is a filter in the traditional sense: these rows are read in (either from a data source or another operation in the plan), inspected, and then approved or removed based on the filter.
Although similar in purpose to the “Index Cond” that you see on Index Scans or Index Only Scans, the implementation is completely different. In an “Index Cond”, the index is used to select rows based on their indexed values, without examining the rows themselves at all. In fact, in some cases, you might see an “Index Cond” and a “Filter” on the same operation. You can read more about the difference between Index Conditions and Filters in my post about index efficiency (focusing on multi-column indexes).
Whether or not the operation will be run in a special mode to support parallelism. Some operations need to be aware that they are running in parallel, for example sequential scans need to know they only need to scan a smaller proportion of the table. Other operations can be run on several threads without each one having any knowledge of the others.
The data source being read/written from. Almost always a table name (including when the data is accessed via an index), but can also be a materialised view or foreign data source.
The alias used to refer to the Relation Name object.
These fields are added to nodes whenever the
COSTS flag is set. It is on by default, but you can turn it off.
The estimated total cost of this operation and its descendants. The Postgres query planner often has several different ways it could resolve the same query. It calculates a cost — which is hopefully correlated with the amount of time taken — for each potential plan, and then picks the one with the smallest cost. It’s worth bearing in mind that the costs are unit-free — they’re not designed to convert into time, or disk reads. They’re just supposed to be bigger for slower operations, and smaller for faster ones. You can get a taste of what sort of things they consider by looking at the cost calculation code.
The estimated amount of overhead necessary to start the operation. Note that, unlike “Actual Startup Time”, this is a fixed value, which won’t change for different numbers of rows.
The number of rows the planner expects to be returned by the operation. This is a per-loop average, like “Actual Rows”.
The estimated average size of each row returned by the operation, in bytes.
Actual Value Fields
When you run
EXPLAIN with the
ANALYZE flag set, the query is actually executed — allowing real performance data to be gathered.
The number of times the operation is executed. For a lot of operations it will have a value of one, but when it is not, there are three different cases:
- Some operations can be executed more than once. For example, “Nested Loops” run their “Inner” child once for every row returned by their “Outer” child.
- When an operation that would normally only consist of one loop is split across multiple threads, each partial operation is counted as a Loop.
- The number of loops can be zero when an operation doesn’t need to be executed at all. For example if a table read is planned to provide candidates for an inner join, but there turns out to be no rows on the other side of the join, the operation can be effectively eliminated.
Actual Total Time
The actual amount of time in milliseconds spent on this operation and all of its children. It’s a per-loop average, rounded to the nearest thousandth of a millisecond.
This can cause some odd occurrences, particularly for “Materialize” nodes. “Materialize”operations persist the data they receive in memory, to allow for multiple accesses, counting each access as a loop. They often get that data from a data read operation, which is only executed once, like in this example:
"Node Type": "Materialize",
"Actual Loops": 9902,
"Actual Total Time": 0.000,
"Node Type": "Seq Scan",
"Actual Loops": 1,
"Actual Total Time": 0.035}]}
As you can see, the total time spent across all loops on the “Materialize” node and its child “Seq Scan” is less than 9902 × 0.0005 = 4.951ms, so the per-loop “Actual Total Time” value is less than 0.0005, and is rounded to zero.
So we’re in an odd situation where the “Materialize” node and its children seem to take 0ms in total, while it has a child that takes 0.035ms to execute, implying that the “Materialize” operation on its own somehow takes negative time!
Because the rounding is to such a high degree of accuracy, these issues usually occur only with very fast operations, rather than the slow ones that are often your focus. Still, in the future, I’d love to see query plans also include a value stating how much the “total total” (not the per-loop total), to avoid any of these annoying niggles with rounding, which can be noticable when the number of loops is high.
Actual Startup Time
When we started out, I thought that this was a constant for any operation, so if you halved the number of rows on a linear operation, then this would remain the same, while the “Actual Total Time” would decrease by half the difference between it and the “Actual Startup Time”.
In reality, this is the amount of time it takes to get the first row out of the operation. Sometimes, this is very close to the setup time — for example on a sequential scan which returns all the rows in a table.
Other times, though, it’s more or less the total time. For example, consider a sort of 10,000 rows. In order to return the first row, you have to sort all 10,000 rows to work out which one comes first, so the startup time will be almost equal to the total time, and will vary dramatically based on how many rows you have.
The number of rows returned by the operation per loop.
The important thing to notice about this is that it’s an average of all the loops executed, rounded to the nearest integer. This means that while “Actual Loops” × “Actual Rows” is a pretty good approximation for total number of rows returned most of the time, it can be off by as much as half of the number of loops. Usually it’s just slightly confusing as you see a row or two appear or disappear between operations, but on operations with a large number of loops there is potential for serious miscalculation.
Rows Removed by Filter
The rows removed by the “Filter”, as described above. Like most of the other “Actual” values, this is a per-loop average, with all the confusion and loss of accuracy that entails.
These values describe the memory/disk usage of the operation. They’ll only appear when you generate a plan with the
BUFFERS flag set, which defaults to off - although it’s so useful that there is a movement afoot to turn it on by default when
ANALYZE is enabled.
Each of the ten buffers keys consist of two parts, a prefix describing the type of information being accessed, and a suffix describing how it has been read/written. Unlike the other actual values, they are total values, ie not per-loop, although they still include values for their child operations.
There are three prefixes:
- Shared blocks contain data from normal tables and indexes.
- Local blocks contain data from temporary tables and indexes (yes, this is quite confusing given that there is also a prefix “Temp”).
- Temp blocks contain short-term data used to calculate hashes, sorts, Materialize operations, and similar cases.
There are four suffixes:
- Hit means that the block was found in the cache, so no full read was necessary.
- Read blocks were missed in the cache and had to be read from the normal source of the data.
- Dirtied blocks have been modified by the query.
- Written blocks have been evicted from the cache.
So for example “Shared Hit Blocks” is the number of blocks read from cached indexes/tables.
I/O Write Time, I/O Read Time
These are also obtainable through buffers if you set a config option to enable them. If you are a superuser, you can collect these values during the current session by executing
SET track_io_timing = on;
VERBOSE is a bit of a funny flag — it’s a grab-bag of different extra pieces of information that don’t necessarily have much in common.
A summary of the operation detail broken down by thread. Note that, like the “Workers Planned” and “Workers Launched” values for a parallelised operation, these entries don’t include the main thread (although you can work out the values for the main thread by subtracting the thread totals from the values of the field on the operation).
The columns returned by the operation.
The schema of the “Relation Name” object.
Hopefully this has gone some way to explaining what those mysterious values on query plans really mean. There are plenty more operation-specific fields which you’ll see more or less often, but these are the core ones you’ll see every time you look at a plan.
Did I miss anything? Make any terrible mistakes? Anything you’d like to hear more about? Let me know, in the comments or at dave[at]pgmustard[dot]com!