Start monitoring the cause, not the consequences

How a few queries were compromising the whole operation of a business.

This situation happened to one of our customers and it shows how important it is to know exactly what is running in your database servers and to keep track of that.

“MeusPedidos is a SaaS application that handle the processing of thousand of orders per second. Those Orders made by Industries, Sales Representative and Distributors can be done by users in their Web App, users of Mobile Apps, third party integrations of a public API and finally by some huge XLS Sheets imported directly in their Web App.”

Overview

Initially, the RDS was running on a “m3.2xlarge” instance. After executing some tests, MeusPedidos’ team found some limitations in the “m3.2xlarge” instance with Multi-AZ (you can see more info about that here). They got in touch with the AWS Support and were told to upgrade to a “m4.2xlarge”.

After the migration, there was a reduction in the frequency and intensity with which instabilities occurred in the system but it was still possible to identify Write Throughput and High DiskQueueDepth spikes with CloudWatch’s monitoring. As you can see in the picture below:

Write throughput and high DiskQueueDepth spikes (from CloudWatch)

Monitoring with Nazar

Then we started monitoring their application with Nazar to identify the offensive queries responsible for the spikes. After that we could then identify some infrequent queries with a high execution time.

By crossing the time these queries were run and the peak times presented in CloudWatch, we identified the relationship between them.

Write throughput and DiskQueueDepth spikes (from CloudWatch)

Cause

When these queries were executed, a relatively large amount of data was manipulated, which generated the need for disk write operations specially for the “Creating sort index” and “Sending data” stages in the execution plan.

Example:

Duration 3.58268150
Creating sort index 2.843240
Sending data 0.721862

Solution

With Nazar it was very easy to identify the 3 queries that were presenting this characteristics and they were optimized by taking the following actions:

  • Index creation;
  • SQL rewritten;
  • Reducing the number of columns in the SELECT list;

Results

After the implementation (04/18/2018) of the optimizations in the 3 queries identified, there was no more Write Throughput and High DiskQueueDepth spikes.

No spikes after the optimizations (04/18/2018)

Conclusion

The secret to anticipate that daily performance problems will scale and become critical issues is to observe them continuously and to monitor the cause, not the consequences. Even after upgrading to a higher instance the queries continued to be executed and to compromise the application’s performance.

"The partnership with Nazar was really helpful to tackle down write throughput problems that were haunting us for a very long time."
Israel Fonseca — Senior Software Developer

About MeusPedidos:

Software development company focused on sales management for industries and sales representatives.

About Nazar:

NAZAR is an offensive query monitoring SaaS. It helps teams to solve performance issues, democratizing the access to the BAD SQL commands and enabling collaboration in order to fix the issues faster. Its a productivity tool to save time and money. NAZAR is an AWS Advanced Technology Partner.