From CITUS Postgres, WITH love

Background

So, sometime in 2016, Chartbeat tried to go all BACK TO THE FUTURE on our customer base to power an interactive web-based historical dashboard. To that point, we were all about surfacing real-time, real-time and more real-time analytics. This shift would give our client publishers a peek at their audience data for the last day, week, month or several months. We figured publishers would be interested in seeing their top articles over a period of time, as well as an hourly breakdown of traffic pivoted on a few useful properties, since that’s the gluten-free bread and butter of our very popular real-time dashboard.

The problem with doing this is that our raw dump of every visit by every audience member from each of their devices for every one of our clients would be impossible to query responsively. In fact, in raw form, Chartbeat ingests records of 30–40K visits per second. Thats over 50B page views a month! If I had a nickel for every page view, I wouldn’t be writing blog posts about slow database queries.

Historical Audience Data Pipeline

The compromise was to roll-up our visitor data into hourly buckets for each publisher article, while separately storing roll-ups that attribute articles to particular authors or publication sections.

But, we couldn’t break out the champagne just yet. Even the roll-ups aren’t cheap.

During peak traffic periods for our nearly 5000 Chartbeat Publishing clients, we, from our Amazon Redshift visitor store, import hourly roll-ups to our Postgres cluster of about 1.2M publisher article engagement records, 1.6M section tags along with 800K author tags for publisher articles and 130K overall publisher site engagement records.

Hourly Postgres Visitor Data Imports

Le Probleme

So, for our 2 biggest Postgres tables, we are ingesting over 28M rows per day. In general, we are importing over 2.6B rows a month into the entire cluster.

Writing and running naive join queries of our sections or authors with our publisher article data in search of top publisher articles would feel like waiting for a crosstown bus in Union Square during rush hour.

So, how could we promise our publishers month-long queries for their top articles that returned in under 5 seconds? And, for the real bacon, how could we promise them sub-second queries of a day’s worth of data?

Strike 1: An average-sized single-instance Postgres Cluster using basic indices

Our first set of queries were pretty straight forward: sort all the things for a given publisher, get the top N articles, and then try to join them to find the authors for those articles. We used a simple index that uniquely identified each entry for each publisher article for each hour. We tried a standard Amazon RDS Postgres instance of the r3.2x large variety.

Bases-loaded upper-deck homerun, game over and we win. Right? Not exactly. Not even close.

Our month-long query response times were averaging 40 seconds which was a long drive from the 5 second response times we wanted for our publishers.

Strike 2: A bigger single-instance Postgres Cluster using the same indices.

Ok. We’re smart. Right? We needed a 5x improvement in our response times. So, trying a database cluster with 4x the amount of memory had to get us a lot closer right? This wasn’t just a shot in the dark. We knew all about work_mem, the amount of memory Postgres allocates to operations involving sorting, hashing or merging. And, the crux of our problem was trying to sum a host of article engagement records and then sort them before joining them on our author data table. So, here we were willing to spend 3x the monthly hosting cost for 4x the amount of memory on a r3.8x large. This was the walk-off win we were ready to celebrate. Right?

Again, just a bit outside. Our response times were still averaging 30 seconds.

Covering Indices: The only way to make the playoffs.

After staring long enough at the romantic poetry of the Postgres EXPLAIN ANALYZE output for one of our top articles queries, it was pretty clear that fast query kryptonite is BITMAP HEAP Scan. Yes, the query planner is fixin’ to search for select rows and not sequentially scan the actual table, but no, a month’s worth of hourly article records for the average publisher site isn’t a quick fetch. We had to limit the number of table rows we read from disk or there would be no playoffs for Team Chartbeat.

The trick, as always, was to break down the problem into smaller pieces. Here, we did it using common table expressions (CTEs). On the streets, they call them WITH clauses. Because of a Postgres quirk, these mini-queries gave us explicit control over the query plan. A CTE in Postgres is what’s known as an “optimization fence.” Most SQL flavors will push query optimization through CTEs, but in Postgres they chill hard when they see CTEs. Put plainly, CTEs give us a way to tell the query planner “I got you” instead of letting the know-it-all query planner engine decide how to run the query.

By using covering indices in our CTEs or indices that contain all the columns our mini-queries needed to sum and rank article engagement metrics, we were able to avoid the dreaded disk-access DETENTION. There were still metadata things we needed to read from good ‘ol disk like article titles and url. But, getting metadata for the top N articles from disk is a bit faster than trying to fetch the same data for each of the hundreds or thousands of a given publisher’s articles that would be thrown away anyway by LIMIT N. In the end, we fixed our queries by borrowing the same playbook for people who quickly get in and out of Trader Joe’s: figure out what you need for dinner before you ask a cheerful store associate to help you search all the things in all the aisles.

But, that was only part of the secret sauce. The other part was a <drum roll> </drum roll> distributed Postgres solution.

Just CITUS Baby: Fast queries need a rockin’ fast cluster

We successfully limited the number of rows our queries had to fetch from disk which made them a lot faster, but they still weren’t fast enough for a month, and we wanted longer range queries to not be impossible as well.

We needed scale. We needed shards.

We thought about rolling our own sharded Postgres solution, but that seemed like way too much of a last-mile effort. For one, we’d have to hand-write code to manage shard and tenant placement, to say nothing of the good fun of implementing error-free atomic writes across a distributed cluster. Oh, and lastly, there would be the serious migraine every time we needed to re-size our cluster. Plus, who feels like doing any of that when hungry product managers want hot fresh product yesterday?

So, we went with Citus Data, a cloud database company whose mission, as they put it, is to make developers never worry about scaling. Citus is an open-source extension to PostgreSQL that distributes the database and queries across multiple machines. With their scaled-out Postgres solution, Citus was neatly positioned to address our multi-tenant (or, in our case, multi-publisher) database access patterns.

Not only was the setup pretty easy. CITUS DEVS HELPED US with our queries. Let me say that again. CITUS DEVS HELPED US with our queries and were a big reason why we fixed our index game.

Citus On-boarding

Our Python-based importers had to be changed to import to a distributed cluster and not a single-instance one. Basically, three main changes were needed: schema changes, ingestion changes and access query changes.

  • We needed a hash key on all our tables so that the citus plugin could know which shard to query for a particular publisher’s data. We went with a hash of publisher ids. We could have hashed on time or a combination of time and publisher_id. But, we figured a publisher-based hash would be the easiest to reason on later. While we had the hood up our schema, we had to tweak how we generated table indices. Among other things, Citus 5.1 didn’t support CREATE INDEX CONCURRENTLY, for example.
  • Then, we needed to migrate all our SQL writes to use COPY and not INSERT, since Citus 5.1 didn’t support multi-line inserts to distributed tables. (This didn’t end up being too much of a drag because it was on our backlog to do anyway, since it made imports at least 20% faster.)
  • Lastly, we needed to slightly tweak all of our queries to be shard-aware. Simply put, every clause of every query had to be loud and proud about which publisher id it was referencing. This was mostly because the query planner waves the ‘no mas’ flag on any subqueries where the shard isn’t specified, and the shard is a function of the distribution column or, in our case, the publisher id.

Citus Gotchas

  • Citus 5.1 and the versions, we’ve updated to since then, don’t provide transactional support for distributed table ingestion. So, yeah, our product has paged our on-call engineers engineers yelling about failed imports that weren’t properly rolled back a few times, but we’ve stayed off the most-likely-to-page list for quite a few months. (Citus has addressed the transaction issue in version 7)
  • Upsizing or upgrading clusters dependent on those game-changing indices can be time-consuming. From all-night vacuum analyzes after a citus version upgrade and reboot to baseball-game sloooow shard rebalances, there were moments over the last 9 months where we’ve definitely felt like we had a cranky newborn. (Citus has promised zero-downtime shard rebalancing in version 7.1)
  • #RealTalk: The Amazon RDS Dashboard isn’t exactly cash money, but the Citus Dashboard is, by no stretch of your cloudy imagination, an upgrade.

In sum, choosing Citus was the smart choice. Sure, its never fun to need to tweak parts of a pipeline just before shipping a product, but the lifting here was reasonable: a few days of work where a lot less than the weeks (or months) we’d spend building, re-building and unbreaking our own app-level sharded solution. Plus, Citus Data’s solid technical support made our trips to the data dentist a lot less painful.

The other souvenir from the experience has been the Citus team’s willingness to take 100% of our feedback from our migration and maintenance pain points and address them in future versions of their extension. That didn’t mean much in the very short-term of ship-or-die, but has meant a lot in the time since launch.

Oh, and by the way, our queries got way faster. A healthy citus cluster has returned month-long queries in under 5 seconds pretty routinely which has Chartbeat’s historical dashboard pretty #lit.

Citus Postgres for the scaled-out win!

Summary

Covering indices are everything, especially when you need to limit disk access.

Oh, and the next time you think you feel like rolling your own sharded PostgreSQL at the last-minute, get your therapist on the phone and then give Citus a serious look.