An introduction to optimizing PostgreSQL Databases

David Varga
Machine Learning Reply DACH
8 min readSep 8, 2022

An enterprise-ready approach

Elephant silhouettes
Photo by Jordan Heinrichs on Unsplash

In the following few lines I attempt to provide an overview of certain steps, that can help someone in optimizing any PostgreSQL Database including AWS Aurora Database with PostgreSQL engine, without the usage of external tools.

This post has two major sections. The first will focus on investigating structural issues, which lead to suboptimal query executions. In certain cases though, it might make sense to challenge the data access pattern or the query itself. This can be especially true, in the case of big enterprises, where multiple stakeholders have a vast number of use cases. In these environments, it’s also common, that obsolete (also duplicated) data stays in the database taking away space and other resources from actually used data and automatic housekeeping processes.

The second major section touches on only a few aspects of the many configuration options that provide a high level of flexibility that PostgreSQL is showcasing to accommodate a diverse set of use.

1 Investigating structural issues

I assume the dear reader ended up on this article, as the database at disposal could/should perform better. I could also imagine users of the database complaining about slow execution times.

First things first, one needs to identify bottlenecks, where also a definition of slow might come in handy, as every database is different; every load profile is different and databases also vary in data volume. But these are obvious. Nonetheless, a common way forward is to use three standard approaches: querying system tables for clues, built-in logging capabilities, and the pg_stat_statements extension.

1.1 Identifying slow queries

A best practice approach is to start with the low-hanging fruit. Do not try to optimize what is not even a problem. Issues that cause sluggish performance need to be tackled first. To do that one can start by utilizing certain logging functionality from PostgreSQL.

1.1.1 log_min_duration_statement and co

There are many different configuration parameters available, but we primarily focus on `log_min_statement_duration` (https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-MIN-DURATION) and `auto_explain` ones.

Database administrators should be reluctant to increase debugging levels for production databases, as it would start quickly filling log files. Then host machines could easily run out of tmp storage space causing database restarts.

Hence, there’s the option to log only such statements that are taking longer than a specified threshold. One can govern this threshold with the `log_min_duration_statement`. If there is a huge volume of such queries, it might be a good idea to rather set `log_min_duration_sample`, to enable PostgreSQL to log only a sample of those queries, not each one of them.

Once, there is a list of slow queries, the optimization can be started by investigating each query. `EXPLAIN ANALYZE [QUERY]` is your friend, which provides priceless insights into the query plan, displaying problematic parts. More about those in points 1.3, 1.4, and 1.5;

1.1.2 auto_explain

Dissecting slow queries can be automated. The `auto_explain` module yields two additional configuration settings, namely `auto_explain.log_min_duration` and `auto_explain.log_analyze`. With the first one can set a minimum duration, above which statements get automatically ‘explained’ in the log files while setting the latter to `TRUE` changes the standard explain output to the extended `EXPLAIN ANALYZE` output. More information can be found here: https://www.postgresql.org/docs/current/auto-explain.html.

1.1.3 pg_stat_statements

This extension is probably one of the best to have in your toolbox. The information available once the module is enabled (check if it is included in the `preload_shared_libraries` configuration parameter) provides you with immensely useful details about what your database is actually working on. By running the following simple query, you can identify the TOP 10 queries causing the most load. They are not necessarily the slowest queries, but the ones where the `number of execution calls` multiplied by the `average execution times` are the highest.

SELECT
queryid,
left(query, 20) AS query,
round(total_time::numeric/360000) AS total_time_spent_in_h,
round(total_time::numeric/60000) AS total_time_spent_in_min,
calls AS no_calls,
round(mean_time, 2) AS avg_time_spent_in_ms
FROM pg_stat_statements
ORDER BY 3 DESC
LIMIT 10;

To get the full query, instead of taking the first 20 characters, just change `left(query, 20)` to `query`. Please note, that the module is aggregating statistics for the same structure of queries with variable parts (e.g.: there is a varying condition in the `WHERE` clause for the same attribute).

As soon as you have the list of queries, you can start analyzing and optimizing them. Please note, that even a small improvement can lead to massive overall performance implications if the given query is executed many times.

1.2 Querying system tables

One might be wondering if there is a straightforward way to identify tables, which are primarily not accessed by index scans but sequential ones. And yes, there is one:

SELECT
schemaname,
relname,
seq_tup_read,
seq_scan,
seq_tup_read / seq_scan AS avg_seq_scan_size
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY 5 DESC;

The query above would sort all user tables in descending order by the average size of sequential scans happening on the tables. This would hint at tables where data access happens in a sub-optimal way. The tables on the top definitely need a closer look.

1.3 Missing primary key indices

Continuing from the previous section, if tables are accessed with sequential scans, then there is a good chance adding indices can help improve the performance. So let us find tables that don’t have a primary key index defined:

SELECT
t.table_schema,
t.table_name
FROM information_schema.tables t
WHERE t.table_type = ‘BASE TABLE’
AND t.table_schema NOT IN(‘information_schema’, ‘pg_catalog’)
AND NOT EXISTS (
SELECT kcu.table_name
FROM information_schema.key_column_usage kcu
WHERE kcu.table_schema = t.table_schema
AND kcu.table_name = t.table_name)
ORDER BY t.table_schema, t.table_name;

By utilizing the command above (https://www.postgresonline.com/journal/archives/65-How-to-determine-which-tables-are-missing-indexes.html), one can get all the tables not being part of the `information_schema` or `pg_catalog` schemas and at the same time, there’s no entry in the `key_column_usage` system tables. Obviously, this command can also be easily modified to limit it to certain schemas or even just a list of tables.

1.4 Missing foreign key indices

After identifying tables without primary key indices, it usually makes sense to check foreign keys also, if using `JOIN`-s is a common development pattern in the organization. Creating indices for the columns used to build datasets can deliver amazing performance improvements. Also in connection with `DELETE` and `UPDATE` operations checking for referential integrity using indices makes a big difference.

1.5 JOINs, CTEs, over fetching, partitioning

This part of the article will just mention a few more general topics to watch out for, which can cause sluggish database performance.

Queries with multiple `JOIN`-s should be investigated with care. Reviewing the order of how tables are joined and checking the purpose of joins (filtering?) is crucial. Always aim for combining smaller datasets and also try to execute operations only with data, which won’t be thrown away. There is no reason to `over fetch`, try to read only the necessary attributes. If it is a few attributes, it might even make sense to create a dedicated index for them, so that the table won’t even need to be touched.

By understanding the database structure and the related access patterns, big tables are worth being prepared by utilizing `partitioning`. Please note, that partitioning can only be done at creation time. The topic of `partitioning` is so big, that it could fill multiple posts, so interested readers can deep dive here: https://www.postgresql.org/docs/current/ddl-partitioning.html.

CTEs gained a bad reputation for the dire performance of materialization of each step in the past. Nonetheless, starting with version 12, if a table is not recursive, has no side effects, and is referenced only once, then automatic inlining (https://www.postgresql.org/docs/release/12.0/) happens (this feature is overrideable) to forgo materialization.

1.6 Removing obsolete tables, indices

Dropping data that is not needed anymore yields multiple positive outcomes. First of all, there are no resources used for the occasional maintenance effort. Secondly, you regained storage space, which can be used for meaningful data, and/or you could downscale the infrastructure to reduce your costs and consequently lower your CO2 footprint. Generally speaking, companies should start focusing on cleaning the data trash also from a sustainability perspective.

2 Investigating configuration

Default configuration settings of Postgresql are defined to accommodate most usage patterns and to provide a solid foundation as a starting point. However, as the given database is used in specific ways, it might be worth investigating specific configuration settings and whether the tweaking of those would yield performance improvements.

In the following two short sections, I attempt to provide a glimpse into a tiny fraction of such parameters, that should be taken a look at first.

2.1 Vacuuming and analyzing

Vacuuming and analyzing are extremely important Postgresql housekeeping and maintenance jobs, that are responsible for garbage collection and also ensuring an up-to-state of indices and tables. These have a direct influence on query planner efficiency. It must be stated that the default settings are also a good starting point, however as workloads and access patterns are different from database to database, one might need to tweak the frequency of these. Generally speaking, if vacuuming and analyzing is too painful, then it’s not done frequently enough.

2.2 Workmem

The topic of workmem settings is quite a delicate one, so here’s a snippet from the official documentation with the definition:

Sets the base maximum amount of memory to be used by a query operation (such as a sort or hash table) before writing to temporary disk files. If this value is specified without units, it is taken as kilobytes. The default value is four megabytes (4MB). Note that for a complex query, several sort or hash operations might be running in parallel; each operation will generally be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value.(https://www.postgresql.org/docs/14/runtime-config-resource.html#GUC-WORK-MEM)

To put it more simply, this is the limit after which any single operation of the many parallel executed ones the database starts writing in temporary files instead of using memory. However, multiple operations are conducted in parallel, especially for more complex queries, hence the same multiple of the amount specified would need to be available for the database. So adjust this setting with care and understand the queries that are getting executed in-depth.

Conclusion

Optimizing a PostgreSQL database is a relatively complex topic with quite some pitfalls, of which I hope to spare you some. There is also no one-size-fits-all solution available, as every database might have a unique use case. In this article, a few general points were outlined, that are worth taking a look at, especially in the case of such databases that grew over time in structure as well. I wish everyone happy inefficiency hunting!

--

--