Exploring Scylla Disk Performance and Optimizations at Agoda

Agoda Engineering
Agoda Engineering & Design
11 min readMar 5, 2024

by Andrew Lynch

Introduction

During a recent failover event from AMS to ASH, we noticed degraded performance and increased latency for significant periods following the failover event. Preliminary analysis indicated large increases in disk activity during this period, as the Scylla in-memory cache is inactive, and many rows would result in disk reads. Interestingly, the issue appeared insignificant in the other direction: a failover from ASH to AMS.

Benchmarking the performance for a cold(ish) case scenario, we found a difference in throughput of more than 10x between ASH and AMS. Following up on differences between the environments, we found that the most significant difference between the machines in the respective data centers was that the disks in ASH were SATA, and the drives in AMS were NVMe.

NVMe Drives vs SATA

Nowadays, NVMe drives are the industry standard. SATA technology isn’t advancing much anymore. More than five years ago, NVMe drives already had nearly 90% of the data center market. They aren’t more expensive than SATA SSDs for new purchases. Most new drives are NVMe.

An article from ScyllaDB states, “So as fast storage NVMe drives become commonplace in the industry, they practically relegate SATA SSDs to legacy status; they are becoming “the new HDDs”.

Moving from a SATA SSD to an NVMe drive can often be a more significant gain than moving from a spinning HDD to a SATA SDD, especially for latency. The above numbers can be coupled with our real-world results from the Scylla calibration results between the environments.

Scylla Calibration Results:

ASH

AMS

So here, we have an increase of 10 times in terms of total throughput and a 10x increase in the number of IOPS that can be performed.

What Else can be Done?

We have initiated the process of acquiring NVMe drives and deploying them with servers capable of housing them. However, recognizing that this transition will take time, we’ve also critically reviewed our current issues.

To evaluate the status quo and the possible performance gains that can be made without changing the hardware, we constructed a new load test that ensures a cold read from the disk every time. This is worse than typical conditions as the cache will fill up gradually but sets a lower bound on the minimum number of entities we can serve.

Load Test Results:

Here, we cap out at approximately 5000 entities per second when served entirely from disk.
The following graphs show the disk activity during this period:

We have saturated the SATA throughput at this point, though we still have several IOPS available for potential increases.

5K requests per second seems relatively low, and the amount of disk bandwidth is much larger than should be required for this at a few hundred bytes to a few keys per entity, so we started an investigation to find out why.

The following shows the typical disk read per byte:

The numbers are inconsistent, but we can see an average of roughly 70 KB per disk read, with spikes as high as 130 KB or even more.

Investigating the issue through tracing provided via Scylla, we can get an idea of the typical amount of disk activity performed per read.

1. ... mc-3044056-big-Index.db: finished bulk DMA read of size 67515 at offset 0, successfully read 67584 bytes
2. ... mc-3044056-big-Data.db: finished bulk DMA read of size 4138 at offset 13312, successfully read 4608 bytes
3. ... mc-3114662-big-Index.db: finished bulk DMA read of size 69161 at offset 0, successfully read 69632 bytes
4. ... mc-3114662-big-Data.db: finished bulk DMA read of size 2146 at offset 1536, successfully read 2560 bytes

Another example trace of a slightly more optimal case:

... me-48-big-Index.db: scheduling bulk DMA read of size 32768 at offset 69632 [shard 3]
... me-54-big-Data.db: scheduling bulk DMA read of size 5004 at offset 12288

Here, we can note two things:
1) We did approximately 70K reads per disk read when we only needed 1K of data or less.
2) In some circumstances, we hit multiple index tables for a single read.

The challenge was to reduce the amount of indexes and data read from each index during a scan.

Investigations and Changes

1) Data Partitioning: The idea here was that since we have a single key space with around 7 billion partitions in it, isolating certain feature sets of hot data into their table significantly improves performance due to fewer data needing to be analyzed.

One thought was that cleaning up the old table might yield significant performance improvements due to the large amount of unused and now obsolete data in the table.

The possible impact of this is that it is fairly easy to test without performing a large-scale data cleanup and migration beforehand, which could be time-consuming and costly, by simply creating a brand new table and inserting a load-testing data set of significant size directly into this table.

We found no significant improvement after re-running the load test with an isolated table containing a copy of our load test data set. Traces were performed against the new table to see the amount of disk I/O involved and why this is still an issue. Only 50K entities were inserted into the new table, but the excessive disk reads can still be observed.

Snippets from the tracing are below:

/var/lib/scylla/data/metis/feature_store-b9475f400f6511eea4aff3faa8de9c1d/me-48-big-Index.db: scheduling bulk DMA read of size 32768 at offset [shard 3]
/var/lib/scylla/data/metis/feature_store-b9475f400f6511eea4aff3faa8de9c1d/me-48-big-Index.db: scheduling bulk DMA read of size 2749 at offset 102400 [shard 3]

Here, we can see excessive disk reads even with a minimal amount of data in the new table: clearly, the issue isn’t confined purely to very large datasets and is reproducible against much smaller tables. This leads some credence to the theory that the issue is unrelated to the total number of keys within the table.

Reducing the table size and cleaning up unused data will save disk space, make maintenance faster, and reduce the additional memory required for the summary tables (see section below). Still, it is unlikely to impact typical disk throughput directly.

2) Compaction Strategy Change: Currently, on production, we use size-tiered compaction: one of the disadvantages of this strategy is the potential for multiple SSTables that need to be read when searching for data.
We recreated the same set of data in the new table but with the compaction strategy set to leveled:

Leveled compaction strategy (LCS) — the system uses small, fixed-size (by default 160 MB) SSTables divided into different levels and lowers both Read and Space Amplification.

Leveled Compaction Benefits

The leveled compaction strategy offers several key advantages:

  • Efficient SSTable Reads: Despite the presence of many small SSTables, searching for a key is efficient. This is because SSTables at each level have disjoint ranges. Thus, searching through just one SSTable per level is often necessary. Typically, this means only one SSTable needs to be read.

Results with leveled compaction strategy:

We gained a roughly 50% improvement in throughput due to this change, which is roughly in line with what would be expected as we reduced the number of indexes reads from something in the neighborhood of 1.5 or 1.6 to closer to 1.

However, this result is still significantly lower than what we would like, so additional investigation of optimizations is still required.

3) Tunning sstable_summary_ratio
This involves a parameter that is not widely documented, found only in Scylla’s source code, and sporadically mentioned in GitHub issues or discussion groups.

An example can be seen in a discussion thread where a user delves into issues of large disk reads with Scylla: ScyllaDB Users Group Discussion. A Scylla developer explains that:

“There is a server config option called “sstable_summary_ratio” (in scylla.yaml), or — sstable-summary-ratio (cmdline param), which you can use to make summary more dense. It’s a ratio of summary size to data file size. By default, it’s 0.0005, which means 1 byte of summary per 2KB of data file.”

Looking at this, we can see that Scylla generates the summaries based on a portion of the data actually written.
Examining the files on disk for our current tables paints the full picture:

Before:

After:

A few observations can be made here:

1) Our index files are bigger than our data files!

2) The summary indexes used by Scylla to navigate into the index file are tiny. Thus it has limited information available over the ranges to scan in the index file to find the data necessary.

3) The default values and our data model do not align well with our data for the feature store. Scylla is built around the assumption of being a very wide column store, while we (specifically, our clients) use it much more like a key-value store. Scylla’s data model and access patterns are suboptimal for this use case, and performance degrades significantly under these conditions.

So, the sstable_summary_ratio was increased significantly, and then some example traces and a fresh set of load tests were run.

Example traces after:

Generally, our adjustments have led to a requirement of either a 1K or 2K disk read in the index (rounded up to the nearest multiple of 1024), followed by one or two reads from the data. This optimization has brought us down to an optimal scenario of 2K and an average of about 3K in total disk reads needed for each lookup. Compared to the initial readings, which varied from 60K to over 130K, we have reduced disk read requirements by a factor of 30–50x.

Before:

After:

Even under a full load test, we can see that the disk throughput barely moves during the run, and most disk reads come from the regular production traffic.

Bytes per read confirms the reduction in data:

We can observe a massive reduction in average disk read size during this interval. However, the number in this chart is still heavily inflated by the regular production traffic doing 70K plus reads per op.

To fully understand what is happening here, we are required to understand Scylla’s indexing model.

Scylla has a multiple-level index strategy:

The first lookup occurs against the index summary files, controlled by sstable_summary_ratio. This setup functions similarly to a traditional database index: the summaries, stored in memory, provide a snapshot of keys/rows.

These summaries guide the system to specific disk offsets within the index for efficient searching (Read more at Scylla SSTable Summary Documentation). Adjusting this ratio significantly reduces the disk I/O needed for lookups in Scylla, decreasing the requirement from roughly 70K per disk lookup to just 2/3K.

This has to be globally configured in scylla.yaml and requires a restart, so the ratio cannot be dynamically altered per table. Additionally, this only applies to newly written data, so you will not immediately see significant performance changes without a re-ingest.

This method is based on a probabilistic approach, meaning the outcomes are unpredictable. Due to our small amount of data per partition, it was necessary to globally increase the ratio by quite a lot to optimize the read performance of Scylla; otherwise, it would have to seek very large disk ranges to locate the desired partitions properly.

Throughput Numbers Obtained

Despite the significant reduction in the size of disk reads performed, we should not expect to experience a full 25–50x increase in performance due to these changes.

Observing the IOPS for the disk:

After:

We are now capped at the maximum number of operations the disk can perform per second: we have moved to random disk access times, which are far smaller than the bandwidth a disk can perform for large sequential reads.

Results with Summary Tuning (and Compaction)

So, the combination of leveled compaction strategy and summary tuning resulted in a 4x improvement in total throughput.

NVMe Drives:
At its core, operating Scylla without NVMe drastically limits the potential performance, especially for random access disk reads.

The changes we have so far are significant but cannot compare to the raw benefits of NVMe drives. These drives can enhance performance exponentially, significantly boosting throughput, bandwidth, IOPS, and even latency.

Installing the NVMe drives should give us another 10x or so improvement in performance, raising us to 200K requests per second, which should be more than adequate for our immediate future requirements.

Performance Update:
Load tests were rerun on the AMS environment and yielded an improvement greater than 10x, up to 15x for some scenarios, for a total of 300K requests per second or higher. At this point, we are hitting other bottlenecks, so the total throughput possible in Scylla might be higher.

Summary of Results

Cold disk serving capacity:

* These results may understate performance somewhat as when under no external load from production, greater throughput was sometime observed.

** At this point, we appear to also be hitting non disk I/O bottlenecks, further investigation on maximum possible results is needed.

Other Approaches Attempted

We considered bulk-loading strategies as a means to prime the cache efficiently. As we currently perform large disk I/O reads that discard most data (before change) or small random disk reads (post-change), we need to effectively exploit the maximum bandwidth available for the disks to fully read the data from the disk and prime the Scylla cache.

By doing bulk reads of rows into memory, we could re-populate the cache at a rate of 10 to 20 times our current throughput. However, this has significant disadvantages regarding the general case serving throughput and latency.

Due to the changes in Scylla’s caching model, this was evaluated and decided as not worth it in practice. To make this change effective, we need not only the bucket to hold a substantial amount of entities but we also need to select all values without additional where clauses, as Scylla will not cache the results for the rest of the partition. This is quite wasteful and introduces a lot of additional overhead for the normal case situation, so it appears to not be worth it when considering the drawbacks of this approach.

--

--

Agoda Engineering
Agoda Engineering & Design

Learn more about how we build products at Agoda and what is being done under the hood to provide users with a seamless experience at agoda.com.