Postgres: Not Your Grandfather’s RDBMS
Over the course of the last 2 decades, Postgres has evolved to become one of the most powerful, flexible, and popular database systems in existence. While on the surface it’s strictly a relational system, unbeknownst to most, it achieves near parity with NoSQL stores via support for unstructured data through indexable JSON, query parallelism, data partitioning, streaming replication, triggers, full-text search, and async pub/sub notifications. In addition, courtesy of foreign data wrappers, you can directly query data in other sources, including other RDBMS’s (MySQL, Oracle, etc.), NoSQL (Cassandra, HBase, etc.), files (CSV, XML, etc.), and even niche sources like Twitter, git, et al. All these features coupled with its robust optimiser and planner, enterprise deployments, and SQL-92 compliance prompted DB-Engines to declare Postgres the DBMS of the Year in 2017.
While all of the features mentioned above can stand on their own, Postgres also has a number of third-party extensions and packages that help put it on steroids. These range from adding support for time series data to allowing columnar storage. In this article, we will look at 10 extensions and how they can help you soup up your applications using Postgres.
Time series data is generated by a plethora of sources, including IoT sensors, Web 2.0 applications, and scientific experiments. Vanilla relational stores have a tough time dealing with the time varying nature, high ingest rates, and temporal querying required by these use-cases. Enter TimescaleDB. TimescaleDB adds support for time series data through the concept of a hypertable, which simply chunks and stores the data by time in standard tables. From the user’s perspective, a hypertable behaves just like any other table which permits the user access to all standard Postgres features, including referential integrity via foreign keys. Users can even perform joins between a hypertable and a normal relational table.
Databases ofttimes require operations that need to be triggered at a schedule. For instance, VACCUM a table, archive data, kill idle transactions, or roll data up periodically. Traditionally, Linux cron in a single node environment or the likes of Metronome/Chronos in Mesos or CronJob in Kubernetes in a distributed environment are employed to this end. All of these solutions add an extra layer of complexity and external dependencies to an otherwise simple task. pg_cron obviates the need for an external scheduler by enabling cron based queries within the database itself using the familiar Linux cron format.
A large class of applications are streaming in nature with real-time requirements. These are enabled by stream processing systems like Spark Streaming, Flink, Akka Streams, and the ilk. For instance, you can do streaming aggregations in Flink and then periodically spit out the aggregated data to a database table for downstream consumption. This requires an entire pipeline of different systems wherein each brings its own latency and operational complexity. Postgres on the other hand, like most databases, is batch-oriented, i.e. data needs to be first inserted or loaded before it can be queried. PipelineDB adds streaming capabilities (including windowing semantics) to Postgres through continuous SQL queries. These queries can be used to roll up data in the form of an incrementally updated materialised view.
Production environments need to be highly available while Postgres itself is a single server solution. Postgres does have support for various replication options. While replication allows one to keep the data between multiple replicas in sync, the non-trivial task of leader election and client proxying is left to the application code. Stolon is an opinionated, one stop shop HA solution that relies on streaming replication (for data consistency), an external key/value configuration (for leader election) such as Consul/etcd, and 3 custom services (for failure detection and client proxying).
Solutions like Stolon only provide HA because of their reliance on master/slave replication. With just replication, for scalability therefore the only option is to go vertical which becomes untenable after a certain level. In addition, queries do not take advantage of distributed parallelism. Citus comes to the rescue here as it enables auto sharding and replication atop standard Postgres. In Citus, one Postgres instance in the cluster is marked as the master node while the rest act as slaves. The master is in charge of metadata and query distribution. Client interaction always takes place through the master which distributes shared queries across the workers. It can also withstand worker failure due to its sharding and replication model.
Postgres is a row-oriented data store which means that rows are physically stored one after the other in sequence. As a result, Postgres in general is great for write heavy, transactional workloads. On the other hand, column-oriented databases store columns together. This enables them to optimise for efficient aggregation wherein only a subset of the columns need to be accessed. Column stores also have better compression ratios as columns have the same data type. Cstore_fdw adds column oriented tables to Postgres. Based on the Optimized Row Columnar (ORC), Cstore has excellent compression rates and loads only relevant columns for a particular query without requiring any explicit indexes.
The old school way of backing up and restoring a Postgres database is by using pg_dump and pg_restore, respectively. These work reasonably well for small deployments but not for large deployments at scale, because they lock the tables being backed up/restored. Wal-E is a solution that leverages base backups and WALs to enable continuous archiving. The base backup contains a snapshot of the data directory which does not interfere with any transactions while the WAL contains incremental changes to the base backup. Wal-E can send these to a number of external storage options, such as AWS S3, GCS, Azure Blob Store, and the local filesystem. Furthermore, it also performs compression and encryption.
Postgres out of the box is capable of performing full-text search across multiple languages. In fact, it can even be used to implement a rudimentary search engine. For advanced use-cases though, systems like ElasticSearch are a better match. In such set ups the primary datastore is still Postgres but textual search is offloaded to ElasticSearch. This obviously necessitates keeping data in sync between the two systems which leads to consistency problems. ZomboDB simplifies this by using ElasticSearch as an index type within Postgres. The gory details of ElasticSearch and data integration are masked away by ZomboDB while clients only interact with Postgres.
One widely applicable use-case for Postgres is its usage as the persistence layer for a RESTful API service. The set up typically consists of a web server, an ORM, JSON serialisers, and a ton of boilerplate to handle data mapping, security, database interaction, and so on. Conceptually, these APIs are just a RESTful view atop underlying database tables. PostgREST simplifies this task by automatically stitching together a RESTful API directly from an existing database. It uses a Haskell server which brings all the good features of the language, such as its legendary type safety, to the table. It also achieves scalability by delegating to the database as much as possible. For instance, JSON responses are serialised directly within the database.
No article these days is complete without a mention of machine learning. Real-world machine learning pipelines consist of fetching data from some data store and training a model in sklearn, Spark ML, Tensorflow, R, etc. For scoring, another variant of the same pipeline with the trained model is employed. This entire flow involves data materialisation across multiple systems which affects the end to end latency of the application. What if you could do machine learning in situ in the database? MADlib leverages standard SQL operators: SELECT, FROM, WHERE, GROUP BY to enable machine learning and statistical analysis within Postgres. It also uses a combination of a driver UDF in Python to control iterations and C++ UDFs for mathematical operations. It currently supports classification, regression, clustering, neural networks, graph analytics, and several other techniques.
So the next time you have a project that requires any of the features listed above, you should just default to Postgres. This is also by no means an exhaustive list. There are other really cool Postgres extensions and packages out there. Do yourself a favour and explore them all.