Performance degradation in production — what’s to blame?

Published in

skai engineering blog

8 min readJul 1, 2018

Part 1: Disk latency and MySQL failures

As a performance engineer, I’m responsible for the stability and performance of our production environment.
In my 12 years of experience, I’ve seen a wide range of degradation issues coming from all system layers. In many cases, these issues won’t necessarily originate in the application code.

In this blog series I’ll share our troubleshooting methodology, 3 common degradation use cases, and tips for solving them.

Troubleshooting methodology

At Kenshoo, we use a pyramid-shaped troubleshooting methodology shown to investigate production incidents and figure out which part of the infrastructure is to blame (see figure 1).

Figure 1: Kenshoo troubleshooting methodology

My investigation will go either up or down the pyramid, depending on how close we are to release deployment:

If release deployment is not too close, I investigate the infrastructure starting with storage the base of pyramid and moving up towards the application. I use this approach as a rule of thumb because only 20–30% of degradations are the result of code changes or of legacy code that isn’t compatible with our system’s continuous expansion.
If the degradation starts close to release deployment, I investigate in reverse, starting with the application and moving down the pyramid towards storage.

Use cases

In this blog series, I’ll review the 6 most common production issues we encounter in various layers of the Kenshoo infrastructure, and share best practices for efficient investigation.

Use case 1: High disk latency (storage layer)

At Kenshoo, we use many storage systems such as HP 3Par Storage, Kaminario, and Ceph. All storage systems have the same bottlenecks: IOPS, latency and throughput. Let’s take a look at latency and IOPS.

Application performance is directly linked to storage latency, which can increase due to:

Maintenance
Running cross-production Alter tables statements in the DB, especially those using large I/Os
Storage data backup
Bad blocks on SSD disks
Kernel issues (operating system)
Physical component failures

Issue

Our application has many instances. One time we were alerted by our monitoring system (we use Hosted Graphite for this use case) about degradation in many processes’ performance across production. At the same time we received high disk latency alerts.

Figure 2: High SVCTM in Grafana

How to identify it

To verify that storage latency is high and is the root cause of degradation, we connected to the specific machine via SSH and used a very useful Linux utility for evaluating disk I/O activities called iostat. To identify the disk latency I ran “iostat -x 2”.

Figure 3: Disk I/O activities in iostat

svctm measures how long it took the storage to respond after a command leaves the I/O scheduler and the I/O is no longer controlled by the kernel. In my experience the svctm on SSD storages should be less than ~5ms. Short peaks of up to ~10ms are acceptable. As you can see in figure 3, the svctm during our production issue was about 70ms, which indicates poor storage performance.

await measures how long a given I/O spent in the entire I/O scheduler. Different people and vendors have different definitions of a good await time. I’d say that like scvtm, the average await time should be ~5ms. Short peaks of up to ~30ms are also acceptable. Figure 3 shows that the await during our production issue was about 50–90ms, indicating poor storage performance.

After further investigation we realized that the issue started with a specific storage cluster, so only servers in that cluster were affected. We ran a “show full processlist” SQL command on those servers and figured out that an “alter table” statement for OSC, which was initiated by one of our teams, was running on each server.

How to solve it

If the issue is really down to storage, it might be caused by an insufficient number of drives that service the I/Os, contention in the controller, or inadequate bandwidth in the array’s front-end or back-end ports.

There are many ways to boost SSD storage performance, such as:

Migration between storage arrays in different clusters
Adding memory to controller cache or adding drives
Disabling the dedup and compression mechanism to reduce CPU usage
Adding additional storage arrays
Tuning the environment parameters according to the vendor’s best practices

The issue might also originate in the Virtualization layer, in which case there’s a lot to inspect. I’ll describe this in detail in the next blog in this series.

Tips

If your physical storage is slow, you’ll probably see a high svctm and a high await. If the kernel’s I/O is slow, you’ll see a high await and a low svctm.
Create the relevant storage performance metrics on your monitoring system. We use collectd to send stats to Hosted Graphite, and Grafana to display them.
Create svctm and await alerts with a clear threshold and make sure all the relevant teams receive them. At Kenshoo we get alerts on storage, SRE, and performance.
Educate all the relevant dev and production teams on these metrics and the importance of monitoring them. Teams are often unaware of the impact a small change in a system metric can have on the entire production environment, and might dismiss it as irrelevant. We learned this the hard way when the svctm went up from 2ms to 8ms during Kenshoo’s most sensitive sales season, due to an online maintenance process. This spike was overlooked by our teams until it caused massive degradation in several processes.
Create the following views to better monitor your storage performance metrics:
All servers in a specific storage cluster
All clusters
All servers in all storage systems
Top 10 latency servers in all clusters
Use these helpful metrics to investigate storage degradation: I/O size, read/write latency, IOPS, and throughput.

Use case 2: MySQL index cardinality (database layer)

Cardinality refers to the uniqueness of data values contained in a particular column (attribute) of a database table. Low cardinality means fewer unique values in a column. SQL databases use cardinality to help determine the optimal query plan for a given query.

Issue

We were alerted about huge degradation in one of our processes. After deep investigation (using the Jvisualvm profiling tool) we discovered that an SQL query was the source of the problem. That query was running more slowly than before even though it hadn’t been changed in years, so we started to drill down into the table configuration.

We discovered a bug in the MySQL Innodb engine that caused statistics in this table to be incorrect, and generated incorrect execution plans that resulted in significant degradation in our application queries.

How to identify it

Run a “SHOW INDEX FROM mydb.mytable;” command and check your index cardinality. In our case the degradation was due to a key in column X whose cardinality should have been in the millions, but was dramatically lower at 27 (see figure 4).

Figure 4: Index cardinality

How to solve it

Our solution was to increase the number of pages sampled (per table, not by default) and then run the ANALYZE command to increase the accuracy of the statistics. The caveat with this solution is that ANALYZE takes more time to run, and MySQL runs ANALYZE in the background which creates high loads on the database.

We created our own analysis script with more sampled pages per table (disabling the internal engine’s analysis), and scheduled it to run once a week on maintenance day. The script monitors any LOCKs related to ANALYZE and kills ANALYZE commands if necessary.

Tips

Create metrics and alerts on user active time queries running in the DB, and investigate each big increase.
Create metrics and alerts on queries per second (QPS) and investigate each decrease.

Use case 3: Enabling BinLog on the MySQL TokuDB engine (database layer)

The binary log contains events that describe database changes such as table creation operations or table data changes. The binary log serves two important purposes:

For replication, the binary log on a master replication server provides a record of the data changes to be sent to slave servers.
Certain data recovery operations require the use of the binary log.

Issue

One of our biggest customers wanted to increase their activity on our platform. This meant we needed to support a higher throughput.

We investigated how a code change on our platform would enable this, and detected large performance degradation in one of our central and critical processes that occurred on 5% of our servers. We realized the degradation had been happening for about a month, starting when we opened the binlogs on some of our production servers so they would have MySQL replication. Processes using many short concurrent updates in the DB showed a dramatic drop in performance.

How to identify it

Use your metrics and dashboards! If you aren’t monitoring query throughput yet, use the tips below to get started without delay. When we didn’t see any degradation in metrics but still saw an issue with processing data on time, we compared the server logs to the metrics and found our metrics to be incorrect. When we analyzed our logs, we found the degradation started the same day we opened a binlog. This was the most critical degradation over time we’ve ever had at Kenshoo.

Through some Googling we discovered that TokuDB had a critical bug that causes a large degradation on short update queries when BinLog is enabled.

How to solve it

We decided to optimize how we update the DB in our critical process. We replaced our batch updates with an infile load.

Figure 5: Performance improved upon infile deployment

Tips

Use these metrics to identify degradation: process rate, process run time, and process query run time.
Compare your metrics to the actual processing time you see in logs or by other methods.
Never change DB parameters without testing them in a test environment first. We tested the binlog in production since we trusted documentation that promised 10% degradation max. This number proved wrong since it didn’t account for all the environment components we use.
Use the MySQL Slap Load Emulation tool to benchmark each of your parameter changes.
If you plan to use a binlog, it might be best to use a different MySQL engine instead of TokuDB. InnoDB works great for us.

That’s it for now! I hope you find these tips helpful. Stay tuned for the next blog in the series about virtualization issues and how to solve them.

Performance degradation in production — what’s to blame?

Troubleshooting methodology

Use cases

Use case 1: High disk latency (storage layer)

Issue

How to identify it

How to solve it

Tips

Use case 2: MySQL index cardinality (database layer)

Issue

How to identify it

How to solve it

Tips

Use case 3: Enabling BinLog on the MySQL TokuDB engine (database layer)

Issue

How to identify it

How to solve it

Tips

Written by Genadi Tsvik