Architecturing databases — PostgreSQL (1)

Published in

Ubuntu AI

11 min read2 days ago

This is a series of blog posts on basic-indertermiate concepts around system design trade-offs considerations for production grade SQL database systems, specifically taking PostgreSQL as a concrete example.

Who can benefit from these blogs?

Software developers, solutions architects and anyone interested in quickly skilling up on databases.

Why?

We often spend most of our time thinking about software development, design patterns, DevOps, backend technology stacks, Microservices, and so forth. But we forget most of our systems actually revolve around our databases.

Think about it for a second — what would be the impact in your system and your business if your database sudently crashed?

Either if we want to achieve high availability with tons of 9s, high performance with low latency or high durability with close to zero data loss, we need to switch our attention towards our database design and architecture.

How should we approach this?

I will be presenting recurrent topics of discussion we have encounter when designing solutions for our customers. On these topic I will explain some base definitions, questions that need to be asked, and design decisions that need to be taken each with their own pros and cons depending on the use case constraints and requirements.

You can read through the full article, or simply jump into the topic you are insterested on.

Disclaimer

When we refer to “database architecture” we are not refering to database application design concepts like database schema design, model-relation diagrams, normalization, etc. But rather, we are refering to a system design approach to databases, meaning discussing concepts like high availability, replication, backup strategies, table sharding, system configurations and tunnables, etc.

PostgreSQL

Before jumping straight into system design considerations, a 10.000 feet high overview of PostgreSQL looks like this:

It is loved by developers and enterprises

It is the most used database engine by professional developers according to Stack Overflow Developer Survey 2023

One of the reasons it has become a very popular is its capacity to install multiple extensions. This allows database administrators to mantain a single database system with a wide variaty of features. For instance if security is a business concern, you can add extensions like pgaudit for session auditing or pgcrypto for data encryption at column level, the same if starting to explore specific use cases you can add extensions like postgis to enable PostgreSQL as a spatial database, pgvecctor as a vector database or timescaledb as a timeseries database. Evidently, you when scaling these specific use cases to million of users you might need a highly speciallized database (like Chroma or Opensearch) rather trying to make PostgreSQL accomodate for all your use cases.

It is a complex beast

Databases seem rather simple and compact from the outside, however, as you start zooming in you will find they are actually fairly complex and entail a full blown list of subcomponents and processes dedicated to different tasks. For instance, you will see processes dedicated to client connections, some others dedicated to data replication, and even some dedicated only to database garbage cleanance and mantainance.

We will not deep dive into PostgreSQL inner workings, but for those who wish learn more about it I will refer you to this low-level detailed article: PostgreSQL Process Architecture. Postgres Process Architecture, I… | by Hussein Nasser | Medium

It is highly customizable and tunable

Now, each of these subcomponents within PostgreSQL can be configured and tunned depending on its actual usage. Questions like: How much RAM should I allocate to each process? How many pages should I store in my disks buffers? Should I enable huge pages? How many disks do I need? How much performance out of my disks do I actually need? Which filesystem type should I use? And many, many more …

Some of these you can try to answer yourselves following configuration guides like https://postgresqlco.nf/

However, it actually takes experience and proper system monitoring to make this tunables great.

High Availability

Scaling out

A first consideration on high availablity is to scale the system out. So if one instance fails, you have at least another one in place to take over the work. This is called failover.

Having a secondary machine in case the first one fails sounds great. However, default failover mechanisms in PostgreSQL are manual and require intervention from an operator, meaning that even though the second machine might have all the required data in it, it needs “a little push” to take over. This creates system downtime while the second machine gets fully ready.

Auto-Failover

To reduce this downtime we can add extra tooling for consensus management and heartbeat communication like Patroni, Repmgr, Pgpool-ii, or pacemaker with PAF, that will automatically detect machine failures and manage failovers automatically rather than manually.

Split Brain

You might think only one extra machine (two machines in total) should be good enough. However, you will end up in an undesired situation called split brain.

In a distributed system, network can also fail, and when having only two instances it is complex for instance “A” to differenciate between instance “B” failing vs. the network failing, thus getting into a state where each instance believes the other one failed creating data insconsistency problems.

Cluster of 3 instances

The simplest way to avoid this is in distributed systems is to have a cluster of at least three instances in total. In this way if one instance fails, you have at least confirmation from another instance that it is indeed true and not just the network flapping.

Cluster with Witness

A bit more complex way to avoid split brain, but benefinitial by reducing the the total amount of resources needed would be to use “witness” machine. You would still need three machines in total, however, the witness machine does not need to run a database service or copy data over, therefore requiring less disk space, CPU and RAM resources for it. It will only be there to take consesus decisions about which machine is considered down or not.

Primary and Replicas

Now, an important limitation of PostgreSQL clustering is that only one instance at the time can write information into the system (primary), while the rest can only be used to read information from (replicas).

This was intentionally designed like this to prioritize consistency over availability and partition tolerance on the CAP theorem.

Thus

Only primary instance can write (INSERT, UPDATE, DELETE)
Replicas are read only (SELECT)
Replication is unidirectional
Scaling out enhances read performance
Scaling up enhances write performance

If the primary instance fails, auto-failover tools will promote one of the replicas to become the new primary instance. When the failed machine is recovered is will comeback into the cluster as a replica.

Load Balancing

Assuming we have a cluster of multiple instances, how will a client application know which IP to point to.

For this we use load balancer tools like pgbouncer, HAProxy or pg-pool-ii so that the client points to a single IP as well as to smartly distribute read queries across the replicas while correctly redirecting the write queries to the primary instance.

To avoid creating a single point of failure (SPOF) by introducing a load balancer, the load balancer should be highly available on its own.

Connection Pooling

Another interesting tool we can add to our toolbox are connection pools like pgbouncer and pgpool-ii (correct, you have seen these before, and this is because they provide multiple features for high availability and not just one).

The main benefit of adding it into your design is reducing the connection overhead created by making a new connection on each SQL query requested from your client. This can significantly bring up the transactions per second (TPS) your database can perform.

If you like to dive deeper into this concept here is a short video tutorial on how to manually set it up and compare the performance with and without a connection pool.

Replication

Another important factor when designing a database solution is data durability, which can be enhanced through replication.

We briefly touched on this topic while discussing about “Primary and Replicas”, however, there are different ways to replicate data from one instance to another.

We can group replication in the following groups:

Transfer method

It refers to the format in which data is transfered through the replication.

Streaming replication transfers bit to bit the information writen into the disk from the primary to the replica.

Data will be identical
Limitations: Requires both primary and replica to use the same PostgreSQL version.

Logical replication transfers data as an SQL statament.

Pros: You can use different PostgreSQL versions for primary and replicas, thus being more flexible to perform upgrades and migrations.
Pros: Less bandwitdh is required
Pros: Replication can be scoped. For instance just replicate some databases, some tables or maybe just replicate INSERTs and UPDATEs but ignore DELETEs.
Cons: It does not replicate DDL commands (CREATE TABLE, ALTER TABLE ..)
Cons: It does not replicate schema objects like sequences, indexes or views.

Durability modes

It refers to how much the primary instance will wait until it considers the data copied over to the replica.

Async replication will not even wait for a connection to be made from the primary to the replica, it will simply write data into the primary WAL and then continue with other tasks. Data will be transfered later in the background.

Pros: Lower latency on writes
Cons: Higher risk of data loss
Commonly used for cross region replication
Supports N-1 failures in the cluster

Sync replication will force primary instance to wait until data has reached the replica in some form. Data will be transfered in the foreground thus waiting more for the SQL statement to be processed before processing the next one.

This type of replication actually has some subtypes, you can see them sort of an expectrum on how deep the data is processed and stored on the replica:

sync / remote_write: Primary waits until the replica receives information on the WAL receiver.
sync / on: This is the default sync subtype. Here the primary waits a bit more, until the replica stores the information on its WAL disk.
sync / remote_apply: Here the primary waits the most, until the replica is not just stored on disk but also visible on SELECT queries made to the replica.

The more it is configured to wait

Cons: The higher latency on writes
Pros: The less the risk of data loss
Commonly used for replication within the same region
Supports N-2 failures, then it turns into read only mode

Hybrid sync-async replication refers to excecuting a sync or async replication repending on the data values. For example, having a table of payments and replicating every row by default through async replication to get its performance benefits, but then if a payment is over $1000 then replicate this row through sync replication to take advantage of its lower risk of data loss.

Other replication techniques

Delayed replication refers to waiting a certain amount of time after having data copied over to a replica but before actually commiting data into it. This is a usefull technique to fix human error problems like DROP TABLE, or to quickly recover from a cybersecurity attack without resorting to a backup.

How much should you delay?

You will need to answer this first: How much time does your database operation team need to detect an incident?

Some teams might work on a 24/5 schedule on only one time zone. This means that if an incident happens end of Friday night (like a DROP DATABASE), it will not be detected until Monday morning. So roughly at least 72 hours for the database operation team to detect the issue and skip commiting the SQL statement into the replica.

A trade-off to consider is that the more your delay commiting, the more time you will have for your team to detect it but also the more storage you will need to hold the extra data. For example if you generate 1 MB/s from your primary to your replica, and you delay commiting data for 72h you will need around 250 GB extra of storage on your instances.

Not all replicas need to be delayed, you can configure one replica to be close to real-time sync with primary data, so if it fails it has somewhere to quickly swap with the lastest information commited into the system, while also holding data on another replica instance to account for human error incidents.

Cascade replication refers to changing the default data flow of replicating from primary to replicas. This is usually okay when having a cluster of only three instances, however, the load on the primary as well as the required network bandwitdh of it will increase as more replicas are added into the cluster.

There are many ways to do cascade replication, the main goal is to offload some of the pressure on the primary instance, and give it to another replica.

Bi-directional replication means enabling a cluster to have multiple primary instances that can write data. It can increase performance on writes now that is not just one instance that can write data at the cost of data consistency. There are some private PostgreSQL services offering this feature, however, it is still a fairly dangerous configuration to have with open-source tooling because of the complex extra layer of concurrent write management over the consistency approach of PostgreSQL.

If performance actually becomes a bottleneck for your database there are other approaches that can be taken like scaling up, table partitioning and sharing, etc which we will discuss in another blog of this series.

Get started with PostgreSQL

If you want to quickly get started with PostgreSQL in a highly available configuration without the hassle of setting replication, load balancing and such on your own but rather to get it “auto-magically” done for you, you can get started with our Charmed PostgreSQL documentation to be deployed anywhere: AWS, GCP, Kubernetes, baremetal, etc

If you would like to have a custom PostgreSQL or any other Data Platform solution with enterprise support from the publishers of Ubuntu OS, you can contact us through https://canonical.com/data