Say Goodbye to incidents with Qonto’s database monitoring framework

Vincent MERCIER
The Qonto Way
Published in
6 min readDec 21, 2023
Logo

It’s the first day of the month and our customers’ monthly billing cycles have begun. In the afternoon, the on-call SRE receives a message from the development team: our web services are suffering from significant latency.

All our applications are instrumented with tracing, so we quickly identify that the SQL query response times have suddenly increased.

The on-call engineer checks AWS CloudWatch for the state of the AWS RDS, but nothing seems out of the ordinary. They open AWS RDS Performance Insights, where the slow queries reported by the developers are clearly identifiable. But what’s still unclear is why we’re suddenly experiencing these performance issues.

This drop in performance risks delaying our customers’ monthly invoicing processes. At this point, the on-call SRE decides to escalate and reach out for help. Several engineers are mobilized to diagnose and fix these performance issues.

One staff engineer identifies higher IOPS consumption than usual. With basic math and AWS knowledge, he discovers that we’ve reached the maximum provisioned IOPS for the RDS instance. SQL queries have been slowed down because the disk is saturated.

Initially, our thoughts turn to reducing our abnormal IOPS consumption, but that’s a tricky operation that could take some time. Of course, slowing down the billing cycle might help. But that would be detrimental to our users who need their invoices.

We are on the cloud, right? Then, we should be able to mitigate the saturation by increasing the provisioned IOPS. An engineer launches the operation to increase from 18,000 to 30,000 IOPS.

After a nail-biting wait, the number of provisioned IOPS increases, and the team breathes a sigh of relief.

But the incident is far from over when the read/write IOPS metrics appear in CloudWatch. Consumption hasn’t budged despite the increase in provisioned IOPS (and the related costs!).

Grafana dashboard showing RDS disk IOPS
RDS instance disk IOPS

Later, in the incident post-mortem, we will learn that the usable number of IOPS is in fact limited by instance type, regardless of what’s billed.

We eventually managed to reduce IOPS consumption at application level, but this highlights a lack of control.

Our monitoring alerts were not up to scratch, and our reaction wasn’t appropriate because we had no visibility over these AWS RDS limitations.

Does all this sound familiar? Well, now we can help! We’re proud to announce the open-source release of our Database Monitoring Framework, which includes:

  1. an advanced Prometheus exporter for AWS RDS,
  2. a set of 30 alerts that all AWS RDS users should have,
  3. runbooks on how best to react to alerts.

If you’re running RDS on AWS at scale, try it and let us know how it goes! Follow the Getting Started guide or read on to find out how we built DMF and how it helps us avoid stressful situations.

Building the metrics exporter of our dreams

CloudWatch’s metrics are essential but, unfortunately, incomplete. You can generally access consumption values but not the limits — you’re missing context.

Over time, we have identified one key takeaway:

We have learned that there are three usual suspects for database performance problems: CPU, IOPS usage, and locks.

So, how could we have reduced mitigation time? Well, it turns out that to efficiently monitor IOPS consumption and limits, you have to gather information from several AWS APIs. This led us to develop our own exporter that can give us the full consolidated picture.

To effectively monitor RDS instances, we combine information from no fewer than four different AWS APIs:

  • AWS RDS to collect instance inventory and settings,
  • AWS CloudWatch to collect instance consumption metrics,
  • AWS EC2 to collect physical instance capacity (e.g., number of vCPU, max IOPS, etc.),
  • AWS ServiceQuota to be aware of AWS quotas (e.g., available storage).
Prometheus RDS exporter architecture diagram
Prometheus RDS exporter uses 4 AWS APIs

We must also be familiar with advanced AWS technical details, such as calculating the available IOPS, which depends on the instance’s storage class type.

Here is the logic used to identify how many IOPS are available on an RDS instance:

switch storageType {
case "gp2":
iops = ThresholdValue(gp2IOPSMin, allocatedStorage*gp2IOPSPerGB, gp2IOPSMax)
if allocatedStorage >= gp2StorageThroughputVolumeThreshold {
storageThroughput = gp2StorageThroughputLargeVolume
} else {
storageThroughput = gp2StorageThroughputSmallVolume
}
case "gp3":
storageThroughput = rawStorageThroughput
case "io1":
switch {
case iops >= io1HighIOPSThroughputThreshold:
storageThroughput = io1HighIOPSThroughputValue
case iops >= io1LargeIOPSThroughputThreshold:
storageThroughput = converter.KiloByteToMegaBytes(iops * io1LargeIOPSThroughputValue)
case iops >= io1MediumIOPSThroughputThreshold:
storageThroughput = io1MediumIOPSThroughputValue
default:
storageThroughput = converter.KiloByteToMegaBytes(iops * io1DefaultIOPSThroughputValue)
}
}

We’re happy to announce that we’ve consolidated all these metrics and knowledge into Qonto’s Prometheus RDS exporter open-source project.

Each collected metric stems from lessons learned from production incidents or operational needs. If we had the right metrics five years ago, we would have avoided a lot of troubleshooting time and restored degraded services much faster to our customers. This is why we invested in tooling: to increase confidence in the system’s reliability. Now, new joiners and people outside the SRE storage team reap the rewards.

For a comprehensive experience, the project also includes our Grafana dashboards to fully leverage all these metrics and quickly visualize any and all issues with your RDS instances at a glance.

RDS instance Grafana dashboard
RDS instance metrics

The challenge of raising relevant alerts

We solved the metrics part, but what alerts should we set up to prevent incidents?

We identified 30 mandatory alerts to prevent RDS incidents and released them as Prometheus alerts in our Database Monitoring Framework open-source initiative.

As a result, we have alerts on resource saturation (e.g., disk space usage, IOPS saturation) and a way to detect issues before they occur. Traps like maximum storage autoscaling, pending RDS maintenance, or unapplied configurations are also covered.

Mitigating the stress of database alerts

Receiving an on-call alert about production databases is always stressful.

We mitigate this stress by writing a runbook for each alert, explaining how to evaluate the impact, analyze the situation, and mitigate it.

Screenshot of the RDS disk autoscaling limit alert runbook
Runbook to handle the RDS disk autoscaling limit alert

If you are using Kubernetes, this should sound familiar. It’s the same runbook layout as the Prometheus operator, but for storage systems!

Our RDS and PostgreSQL runbooks are available to the public on Github.

What’s next? Build a community!

Interested? Check out our Getting Started guide — you can deploy it in your environment in under 15 minutes!

We’ll continue to learn and make improvements to the system and publish PostgreSQL dashboards. We’re also looking to build a similar exporter for AWS MSK.

Feel free to contribute to the projects on GitHub or apply to work with Qonto’s SRE storage team!

About Qonto

Qonto is a finance solution designed for SMEs and freelancers founded in 2016 by Steve Anavi and Alexandre Prot. Since our launch in July 2017, Qonto has made business financing easy for more than 350,000 companies.

Business owners save time thanks to Qonto’s streamlined account set-up, an intuitive day-to-day user experience with unlimited transaction history, accounting exports, and a practical expense management feature.

They stay in control while being able to give their teams more autonomy via real-time notifications and a user-rights management system.

They benefit from improved cash-flow visibility by means of smart dashboards, transaction auto-tagging, and cash-flow monitoring tools.

They also enjoy stellar customer support at a fair and transparent price.

Interested in joining a challenging and game-changing company? Consult our job offers!

--

--