I am not going to say — we were caught off-guard at the last minute, but during preparation for soak testing, one of our applications became unresponsive for no apparent reason. After a short investigation, we found that a single AWS RDS instance had run out of space. At first glance, we had no idea why 🤭!
Luckily the impacted environment was only used for testing and had limited access from the outside world. When the environment went down, no load tests were being run, and we were pretty confident that there will not have been any significant amount of data processed and stored that day.
One might be tempted to throw more 💰💰💰 at it by increasing the storage space and move on. But what used up all the disk space? No exciting cliff-hanger here — it was a database log file!
Investigating into the root cause, we were taught a lesson by AWS about its product internals. For others to quickly benefit from our learnings, I’ve outlined below a TLDR version (disclaimer: for brevity, we will be discussing a single AZ setup)
How Amazon RDS works?
Amazon RDS is basically an EC2 instance that AWS manages for you (updating it when needed, applying security patches, etc.). It runs a database engine (in our case PostgreSQL server) and a utility tool (that we will call AWS Agent) for connecting it to the AWS console.
Under-the-hood, your managed EC2 instance mounts an EBS volume for data storage (volume mounting makes features like snapshots possible). Not everyone knows that this drive is also used for storing all database log files (in alignment with your data retention policies). In our case, the log files took up all the free space and were the reason for our service going down.
Why there was so much log data?
We were able to pin down the root cause of the service outage. It was a single SQL script file being used to bulk insert test data. In development, we usually prepare fixtures using SQL scripts that get executed on the database. Most of the time, they are short (inserting one or two records at a time), so they don’t take long to complete and are an excellent way to set up the environment for testing against known data inputs/outputs. The bulk insert of test data, in this case, was slightly larger than usual but still within a reasonable size range. A couple of thousands of data inserts. Something that even lower capacity instances should be able to handle easily.
Unfortunately, our system was impacted by running large SQL script files because of the PGAudit extension:
The PostgreSQL Audit Extension (or pgaudit) provides detailed session and/or object audit logging via the standard logging facility provided by PostgreSQL. The goal of PostgreSQL Audit to provide the tools needed to produce audit logs required to pass certain government, financial, or ISO certification audits.
For every SQL statement that modifies our underlying data, PGAudit creates a corresponding entry in the database log file. The oversight was that for every data modification being tracked, PGaudit saves the entire content of the executed script file (for each change). Practically — this meant our data volume suddenly got populated with a few thousand copies of the bulk data loading script. The space occupied by log statements grew exponentially 🚀. For a single INSERT added to the file, we would get an additional copy of the (thousand line+) script file in the log, creating a compounding effect.
This explained how we were able to run out of space with a single bulk INSERT. How did we fix it? After getting service back up and running (restoring from a snapshot and re-allocating storage), we’ve ensured that any test data scripts dealing with bulk inserts, disable database logging at the beginning and re-enable it at the end (to return the configuration to its previous state, minus a multiple-GB log file). That was it!
This story was a great reminder about understanding how security features can very quickly have functional impacts, so should be fully understood before subscribing to the mindset of “log everything, all the time, forever….”. It’s also a nice reminder to make sure you have metrics and alarms in place for your RDS instances so you can react quickly to any unforeseen side-effects that might only show themselves in test environments! 😁.
If you are interested in data-related tips & tricks we have recently written about the role of SQL in Big Data — give it a try and let me know what do you think