Row count estimates in Postgres

Published in

pgMustard

5 min readDec 14, 2018

Why the query planner makes estimates

When Postgres plans how it’s going to execute your query, it has a bunch of implementation choices to make. Some of these depend on just how much data it’s going to be dealing with — for example, there are several different sort algorithms available, and one of the major factors that affects which one is the fastest is how many things you’re sorting.

There are even more choices to be made when you consider all the different operations and data structures available to Postgres. For example, the documentation for EXPLAIN talks at some length about how the same information might be read differently based on what it will be used for and how much of it might be needed. So making good row count estimates is core to what the query planner does.

An example with four row…ers. Picture by **Matteo Vistocco**

How the query planner makes estimates

Sometimes this is simple — if the operation is going to read a whole table, then the output will be the number of rows. But when you throw filters into the mix by adding a WHERE clause, things can get a bit trickier. How many rows in your table obey the constraint WHERE cost > 10? And how many WHERE f1 @> polygon ‘(0.5,2.0)’? It can be pretty hard to guess!

Postgres collects some statistics about your table’s contents whenever you run the ANALYZE command, or run VACUUM with the ANALYZE option (confusingly, these are completely different to the ANALYZE option to EXPLAIN,and share absolutely nothing but the name). This is triggered by the autovacuum daemon, which is enabled by default.

These statistics include things like the most common values in a column and the distribution of values, which are calculated from a sample of the table’s rows and stored in the database (check out the pg_stats view if you’re curious). They’re used when the query planner needs to calculate how many rows from your data will pass a given filter.

What happens when they are bad

As we discussed earlier, the query planner uses row count estimates to choose between different query implementations with very different performance profiles. If those estimates are a long way out, then the query planner can make some bad choices, leaving your query running very slowly indeed.

Fortunately, it’s easy to work out when the planner’s row estimates were inaccurate. Just get a copy of the query plan. For each operation it will list the estimated (“plan”) number of rows, and how many there actually were:

[
  {
    "Plan": {
      "Node Type": "Seq Scan",
      ...
      "Plan Rows": 208,
      "Actual Rows": 40154,
      ...
    },
  ...
  }
]

In pgMustard, we flag up instances where the row count estimates are out by a factor of 10 or more. That doesn’t mean that it will always be catastrophic — the query planner might still have made a sensible decision. But if the row estimates are out by more than an order of magnitude and you can’t work out why your query is taking so long, it might be worth a look.

How not to fix your plan

Sometimes you’ll see people suggesting that you should override the query planner, to get the plan you want. For example, you can force the query planner to use an index when it would otherwise have chosen a sequential scan.

Although it’s possible to achieve a short-term performance improvement by explicitly specifying aspects of the query plan, it can leave your code fragile. The query planner will be less able to adjust to changes in the structure or volume of your data, and may even be unable to take advantage of performance improvements in future versions of Postgres.

I’m not ruling out the idea that it might sometimes be necessary to restrict the query planner’s choices to achieve optimal performance. However, in general it’s more robust to guide the query planner to the right decision by improving its data quality. The query planner was written by some terrifyingly bright people with, as you’d expect, a pretty decent understanding of the ins and outs of Postgres performance. As long as it has the right information, it will usually make the correct decisions.

How to help the query planner out

A good first attempt at improving the planner’s statistics (and therefore its estimates) is to run a VACUUM ANALYZE command. If you want to target a particular table, say table_1, then you can run VACUUM (ANALYZE) table_1, or VACUUM (ANALYZE) to operate on all tables.

This will cause a cleanup of any deleted data (VACUUM) and trigger a recalculation of statistics values (ANALYZE). If the statistics values are out of date, perhaps due to recent writes or deletes, then hopefully the refreshed values will be more accurate.

If recalculating the statistics doesn’t work, then you can try increasing the amount of information gathered— the increased level of detail can lead to better estimates.

You can do this for all tables by increasing the default_statistics_target configuration from its default value of 100. Or you can target a particular column using the ALTER command, something like:

ALTER TABLE table_1 ALTER column_2 SET STATISTICS 400;

Multivariate statistics

Postgres 10 also introduced multivariate statistics. Normally, estimates assume that column values are independent, but naturally this is often not the case. Suppose your query contains something like:

SELECT *
FROM people
WHERE vegetarian = true AND favourite_dip = 'hummus';

Say 14% of people are vegetarian, and 25% of people prefer hummus. Then the query planner would expect 25% of 14% = 3.5% of people to be vegetarians who like hummus. But if vegetarians are significantly more likely to prefer hummus than the population in general, then this number could be much higher.

I reckon a lot more than 25% of vegetarians prefer hummus. Photo by **Ojashri Basnyat**

Multivariate statistics are an attempt to solve this problem. Statistics are gathered on the columns as a group, so that Postgres can understand correlations or relationships between columns.

Multivariate statistics are not computed automatically, but if you suspect that your query plan estimates are poor because of correlations between your columns, you can read more about creating multivariate statistics in Postgres here.