Strategies for Scaling with PostgreSQL

Cory Tucker
Windfall
Published in
5 min readJun 23, 2018

At Windfall, our customers rely on our data to help them better understand the affluence profiles and characteristics of the individuals within their own datasets. In order to do this, we must first establish a link between their datasets and our own. This process is known as record linkage.

When we receive data from our customers, it will typically end up in a Postgres database for analysis and to perform the record linkage steps. As we have grown, both in number and size of customers, this Postgres database’s performance began to suffer. All customer data was essentially stored in one massive table, with links for the records that match a Windfall household also being stored in one massive table. As the tables grew, so did the indexes, and everything else that goes with it such that the size of these tables was quickly becoming a burden that affected all customers equally.

This post outlines steps you can take when looking to scale your own database, as well as the solutions we came up with to handle our next phase of growth.

Identifying Problem Areas

When looking to scale any database architecture, you must first know where your problems are. Often times, a company will naively scale the database vertically (aka, “throw more hardware at it”), however, this is usually just delaying the inevitable. Of course, I also am a believer in the philosophy that you generally can’t solve for scaling problems beyond 10x whatever you currently have.

The first step in identifying your problem areas is to monitor your query performance. You may have some low hanging fruit that a few simple indexes might solve. You may also be able to redesign your queries in a way that reduces the load on the database. If so, definitely do those things first before looking at something more drastic like a schema redesign. Using tools like the awesome Postgres EXPLAIN Visualizer to help you understand your query performance will help to save a ton of time and resources.

It may be that you have squeezed all the performance you’re likely to get out of your database without redesigning or re-architecting the way data is stored in your system. Ultimately, the best way to improve query performance is to reduce the operational dataset size — a query that operates on 100,000,000 rows will always perform worse than a query that operates on 100 rows.

Methods of Reducing Operational Dataset Sizes

There are a few ways to go about reducing your operational dataset size (in terms of number of rows as well as disk / memory footprints). Not every approach is applicable in every situation and you definitely need to know your workload. For example, OLTP vs OLAP type operations are very different beasts and generally cannot be solved using the same methods.

Here are a few strategies:

  1. Reduce the total number of rows in your table. This is likely to always give you the best performance wins across the board. Unfortunately, for many situations this is simply not possible for an ordinary table (more on this later).
  2. Have fewer columns in your tables. Unless your queries are using covering indexes, then the database will always have to go back to refer to the whole table row on disk in order to complete your query. The more columns you have in your table then the fewer rows will fit into a page on disk and its those pages that are cached in memory.
  3. Get clever with your indexing strategies. Postgres supports a varied array of indexing techniques. However, there is a tradeoff between the number and size of your indices and query performance. Also, each index you have will impose an added write penalty whenever you modify your table (inserts, updates, and deletes).
  4. Pre-aggregate your data. If your data does not change very frequently, and you also have a lot of (any?) OLAP style queries, then you should really consider pre-aggregation of the data in some form. Whether this is an OLAP Cube style transformation, a materialized view, or some other mechanism; the performance gains can be orders of magnitude better than you would see otherwise.

Data Partitioning

Data partitioning is a way of creating many smaller sets of data out of the larger total set. In theory you could be explicitly doing this already, but the query management would likely get a little out of control.

Prior to Postgres 10, you could partition your tables using table inheritance and insert rules but it was fairly gnarly. With the release of Postgres 10, the ability to use declarative partitioning greatly simplified the partitioning semantics and maintenance.

At Windfall, we decided to utilize Postgres 10 declarative partitioning to isolate each customers’ data into its own partition. This allowed us to keep our data models effectively the same (including the queries), but still get the performance wins gained by having much smaller tables to operate on.

Besides having smaller tables, and therefore better average performance across all customers, there are several other benefits to having partitioned tables:

  • The queries did not need to change. The parent table could still be queried exactly like before and the query planner’s partition-pruning mechanism would route the query to the single table that is targeted using the partition key.
  • Only the most recent and active customers’ datasets will be in memory at any one time, instead of portions of all customers.
  • Dropping customer data is as easy as dropping tables. The impact of deleting rows and then vacuuming the remaining rows from the shared tables is completely eliminated.
  • Allows for easier isolation of sensitive customer datasets utilizing role-based access control restrictions at the table level.

Declarative Partitioning: Not All Sunshine and Roses

There were a couple problems with the new declarative partitioning implementation that we ran into, however.

  1. During benchmarking, we discovered that the query planning times would rise almost linearly with the number of partitions that we had. Our tests showed that by the time we got to around 3,000 partitions the time it took just to plan the query was about 9s and we needed to increase the max_locks_per_transaction setting otherwise we’d get errors saying we were out of shared memory.
  2. Queries that affected all partitions turned out to be very dangerous. The majority of our queries are focused on a single customer account where every query gets routed to a single partition based off the account ID as the partition key. These performed great. A few of our queries that we had for data export and analysis type tasks utilized the entire dataset. These types of queries may need to be reworked otherwise they might crash the server.

Its my understanding that Postgres 11 will come with a lot of performance and stability improvements targeted at partitioning specifically, so my hope is that these issues will go away with the next major release.

--

--