pgMustard
Published in

pgMustard

Row count estimates in Postgres

Why the query planner makes estimates

There are even more choices to be made when you consider all the different operations and data structures available to Postgres. For example, the documentation for EXPLAIN talks at some length about how the same information might be read differently based on what it will be used for and how much of it might be needed. So making good row count estimates is core to what the query planner does.

An example with four row…ers. Picture by Matteo Vistocco

How the query planner makes estimates

Postgres collects some statistics about your table’s contents whenever you run the ANALYZE command, or run VACUUM with the ANALYZE option (confusingly, these are completely different to the ANALYZE option to EXPLAIN,and share absolutely nothing but the name). This is triggered by the autovacuum daemon, which is enabled by default.

These statistics include things like the most common values in a column and the distribution of values, which are calculated from a sample of the table’s rows and stored in the database (check out the pg_stats view if you’re curious). They’re used when the query planner needs to calculate how many rows from your data will pass a given filter.

What happens when they are bad

Fortunately, it’s easy to work out when the planner’s row estimates were inaccurate. Just get a copy of the query plan. For each operation it will list the estimated (“plan”) number of rows, and how many there actually were:

[
{
"Plan": {
"Node Type": "Seq Scan",
...
"Plan Rows": 208,
"Actual Rows": 40154,

...
},
...
}
]

In pgMustard, we flag up instances where the row count estimates are out by a factor of 10 or more. That doesn’t mean that it will always be catastrophic — the query planner might still have made a sensible decision. But if the row estimates are out by more than an order of magnitude and you can’t work out why your query is taking so long, it might be worth a look.

How not to fix your plan

Although it’s possible to achieve a short-term performance improvement by explicitly specifying aspects of the query plan, it can leave your code fragile. The query planner will be less able to adjust to changes in the structure or volume of your data, and may even be unable to take advantage of performance improvements in future versions of Postgres.

I’m not ruling out the idea that it might sometimes be necessary to restrict the query planner’s choices to achieve optimal performance. However, in general it’s more robust to guide the query planner to the right decision by improving its data quality. The query planner was written by some terrifyingly bright people with, as you’d expect, a pretty decent understanding of the ins and outs of Postgres performance. As long as it has the right information, it will usually make the correct decisions.

How to help the query planner out

This will cause a cleanup of any deleted data (VACUUM) and trigger a recalculation of statistics values (ANALYZE). If the statistics values are out of date, perhaps due to recent writes or deletes, then hopefully the refreshed values will be more accurate.

If recalculating the statistics doesn’t work, then you can try increasing the amount of information gathered— the increased level of detail can lead to better estimates.

You can do this for all tables by increasing the default_statistics_target configuration from its default value of 100. Or you can target a particular column using the ALTER command, something like:

ALTER TABLE table_1 ALTER column_2 SET STATISTICS 400;

Multivariate statistics

SELECT *
FROM people
WHERE vegetarian = true AND favourite_dip = 'hummus';

Say 14% of people are vegetarian, and 25% of people prefer hummus. Then the query planner would expect 25% of 14% = 3.5% of people to be vegetarians who like hummus. But if vegetarians are significantly more likely to prefer hummus than the population in general, then this number could be much higher.

I reckon a lot more than 25% of vegetarians prefer hummus. Photo by Ojashri Basnyat

Multivariate statistics are an attempt to solve this problem. Statistics are gathered on the columns as a group, so that Postgres can understand correlations or relationships between columns.

Multivariate statistics are not computed automatically, but if you suspect that your query plan estimates are poor because of correlations between your columns, you can read more about creating multivariate statistics in Postgres here.

--

--

Posts from the team behind pgMustard. Check out pgmustard.com/blog for our latest posts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store