How we’re solving a latency issue on a 3 Billion Records Postgres Table

Published in

Tailor Tech

7 min readAug 7, 2023

No one likes getting paged in the middle of the night, even less when the alert shows “Database latency”, that’s never a good sign.

Slowdowns of SQL queries on an unpartitioned table with more than 3 Billion records which happened “sporadically” a few times with an interval of a few months — That’s the issue we faced lately at Tailor Brands and this post is about how we handled it and some interesting conclusions.

‘Brands’ Table

Before diving into the issue, A few words about brands table and the tech stack.
We’re using AWS RDS Aurora PostgreSQL clusters and in one of our clusters we’ve got a Database with the table: brands which reached lately 3 Billion records.
Our users can create multiple brands and for historical reasons, the table isn’t partitioned and every day there are about 800k INSERT/ UPDATE operations on that table.

As you can imagine, having so many update operations made vacuum operations more frequent and more resource-intensive as well.

The Sporadic Issue

Queries latency on the table brandsis suddenly increasing from ~5ms to ~300ms and the CPU utilization of our RDS cluster increased by ~20%.
We say it’s “Sporadic” because it wasn’t a one-time incident, it happened a couple of times in the last few months and our assumptions regarding the root cause turned out to be inaccurate.
So after the third time it happened, we dove in.

You might be asking yourself at this point why didn’t we dive head first into research the first time it happened — that would be a fair question.

The answer is that when we prioritize our work and do cost/value calculations, one-off incidents get a low-value rank since we don’t immediately assume they’ll repeat.
We also take into consideration the amount of work required to resolve an issue, even if not at the root level cause when it occurs. In this case, we managed to resolve the immediate symptom quite quickly without a lot of work.

Another aspect of it was that we thought we knew why it was happening, we were close but not 100% there.

(Lack of) Detection

We’re using Datadog to monitor all our microservices and infrastructure metrics.
Each microservice and infrastructure component is covered with pre-defined monitors (using Terraform) with static thresholds or anomaly detections that will notify us via Slack, and Pagerduty if it’s a critical issue.

Since the issue affected a single table, the combined latency for the one service accessing it wasn’t as affected as needed to trigger an alert, as such it took a while until we noticed the increase in one of our dashboards.
That meant that the effects were longer lasting than needed.

This sort of thing drove us to improve our monitoring and for our main endpoints, we set up specific monitors separate from the entire service running them.

Investigation

After the issue was detected, we started to investigate the issue as a SR issue due to its impact on the user experience.

We quickly observed that most HTTP request latencies to the relevant microservice had spiked from ~100ms to 1–3 seconds.
Further investigation with Datadog’s flame graphs revealed that the cause of the sudden latency is the sudden surge in SQL query latency and since most requests perform several SQL queries onbrands table it significantly slowed down all the requests for this critical core service.

We started to look for a root cause, looked first at our RDS Aurora cluster (PostgreSQL) metrics (although it is monitored by Datadog as well) to verify that the cluster’s instances are healthy and there is nothing unusual in the metrics or any events.
Finding no clues there, we looked at the Slack releases channels for any recent releases linked to the service (We work in continuous deployment and release to production many times a day) and saw nothing there as well.
We continued to investigate the issue and checked if the issue was coming from the same pod (Our microservices running on AWS EKS), or from the same node (We’re using Karpenter for node selection and auto-scaling), or if there aren’t enough resources allocated to the pods (memory/CPU), looked at the logs of the service.
However, all these checks came up empty.

As you might want to note — checking the pod/service behavior shouldn’t be helpful since changes there shouldn’t affect DB performance as long as the queries stay the same. But that’s not always the case, let's say when we add many queries for example.
We always try and cover a wide breadth of areas when we check for root causes once we eliminate the immediate suspects.

You never know where you’ll find your next clue!

ANALYZE and DB stats

After clearing all the above potential root causes, we began to suspect that the issue might be related to the table itself since we knew that the table is huge and that it’s updated very frequently.

We decided to check when the last VACUUM and ANALYZE operations were executed on the table brands.

ANALYZE collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

We executed the following query:

SELECT 
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze 
FROM 
    pg_stat_user_tables 
WHERE 
    relname = 'brands';

We noticed that the lastVACUUM was a few days ago but ANALYZE wasn’t executed for 5 months, which means that the stats about the table brands weren’t updated for a very long time and during that time we had about 120 Million operations of INSERT/UPDATE on that table.

The table is a valid candidate for auto-analyze operations if:

Total number of Inserts/Deletes/Updates = TotalRecords*autovacuum_analyze_scale_factor + autovacuum_analyze_threshold

Default values are:

autovacuum_analyze_scale_factor: 0.1
autovacuum_analyze_threshold: 50

So it means in our case that it will analyze the table after ~300 million write operations on brands table and we had “only” about 120M.

After the revelation about the state of ANALYZE we executed manually:

ANALYZE brands;

And it instantly restored latency to normal levels.

Same Issue but a lot worse

We decided to update the values so the AUTO_ANALYZE will be executed every 5M write operations (~once a week):

ALTER TABLE brands
  SET (
    autovacuum_analyze_scale_factor = 0,
    autovacuum_analyze_threshold = 5000000
);

After it was triggered the first time after the update, we saw the same symptoms, which was surprising to us, and to mitigate the issue quickly we executed manually the ANALYZE brands again.

But, this time instead of solving the issue, the CPU utilization of RDS spiked to 100% and we saw a very big queue of queries piling up, we canceled the ANALYZE query but it didn’t help.

We needed to scale down the microservice completely and then we were able to execute ANALYZE brands again successfully, scaling up the service and it worked — WEIRD 😵!

Investigation (again)

We hypothesize that because by default ANALYZE using 300 × default_statistics_target (which is 300 × 100 = 30000 records by default) to build up the stats which are then used by the PostgreSQL query planner to help it make more informed decisions when choosing the best execution plan for SQL queries, that means we’re using 0.00001% of the records. It sometimes ends up with stats that don’t reflect the table data correctly and if we’ll continue to execute it once a week it will be like a game of roulette — sometimes the stats are good and sometimes they’re not.

Photo by Free Walking Tour Salzburg on Unsplash

Mitigation plan

First off, we reverted the changes to the default values since we suspect that the first time it happened it was because of outdated stats (~5 months old) so it will give us more time than one week until the potential next incident.

We also added a new dedicated Datadog monitor that will alert us in case of increased latency of SQL queries on brands table.
Additionally, we updated our Terraform monitor module with a new monitor that will alert us in case of unusual SQL query latency, so we won’t rely only on the overall latency of HTTP requests of each microservice.

What’s next

The microservice that is using brands table does not use the read replicas of the RDS cluster, so we’re planning to utilize the replica for this service as well using pgpool2 which will do the query parsing and targeting for us.

We’re planning to increase the default value of default_statistics_target, so the sample size will be bigger to try and avoid the incorrect statistics, and if changing the value will help, we’ll then change the auto-analyze back to more frequent executions.

You can expect an update once we know more.