Improving Amazon Redshift Performance: Our Data Warehouse Story
It is challenging to manage an optimized Amazon Redshift cluster. Scaling the cluster to meet your growing storage and computation needs while maintaining optimized performance is a downright daunting task. Bad queries, poor distribution style and sort key selections, stale statistics and row cruft, insufficient resources, and concurrency and queue configuration issues all affect performance.
In this post we will share our history with Amazon Redshift and highlight some of the pitfalls and lessons learned along the way. We will also discuss some of the tools we use to ensure that we’re getting the most out of our data warehouse.
Beginnings: From MySQL to Redshift
We launched our first Redshift cluster in May 2014. It was modest: only two nodes (dw1.xlarge). Prior to Redshift, all of our analytics workflows were processed in our production MySQL databases. This was a problem for numerous reasons, the most concerning being that a bad query could negatively impact site performance or, worst-case, bring the site down. Migrating the analytics workflows from MySQL to Redshift provided us with a scalable data warehouse that we could use for data ingestion and processing without disrupting our production systems. We also started using it to back our Chartio dashboards. This was a dynamic time at Udemy. Growth was accelerating, ETL pipelines and dashboards were proliferating, and our internal users of Redshift numbered in the hundreds.
Organizing Data for Performance in Redshift
We learned many valuable lessons in our early experience working with Redshift. We learned that storing your data efficiently is critical for performance. Efficient data storage in Redshift allows for faster scanning. It also results in more efficient IO, which is paramount when working with large datasets.
Let’s take a closer look at the internals of Redshift’s storage and retrieval mechanisms. In our experience, taking the time up front to develop an intuitive understanding of Redshift’s underpinnings will pay dividends when deciding how best to store and process your data.
To understand the performance characteristics of Redshift, it’s important to understand the core architecture of the system. Unlike traditional relational databases, which store data in rows, Redshift is a columnar store; it stores data in columns. Whereas relational databases are designed for storing and retrieving smaller datasets, typically in a transactional context, columnar stores like Redshift are designed to support operations on larger datasets where aggregations can be used to derive statistical insights from the data.
The benefit of having data stored in columns is two-fold. First, columnar stores provide better compression than row-based stores. Since all columns are stored as a single data type, we can use type-specific algorithms for increasing compression ratios. Similarly, additional optimizations such as tokenization can be applied to empty or repeating data to reduce the number of bytes stored. This is particularly effective when the data in the column is sorted. Additionally, columnar stores provide the ability to retrieve only those columns we’re seeking. The increased efficiency in both storing and retrieving our data is ideal for the larger datasets common to analytics processing.
So, now we have a high-level understanding of how Redshift uses columnar storage to more efficiently store and retrieve data. But how does Redshift find your data? It’s important to understand that Redshift doesn’t support indices like traditional relational databases. Instead, fast query performance relies on parallelism and efficient data storage. These two facets of performance are managed in Redshift with distribution and sort keys.
Let’s discuss distribution first. The concept is simple: for optimal performance you must ensure your queries are executing evenly across all nodes in the cluster. Effectively, we’re optimizing for parallelism. Redshift provides three different “styles” for distributing your data: EVEN, ALL, and KEY. EVEN distribution leverages a round-robin algorithm for distributing your data evenly. ALL distribution copies the entirety of the table to every node in the cluster. If you don’t explicitly select a style in your table schema, Redshift will default to ALL or EVEN, depending on the size of the data. KEY distribution uses a hash function applied to the key for determining which node to write the data.
Here are some tips when considering KEY distribution style.
- Use KEY distribution style when joining tables with the same distribution keys (dist keys). Performance is optimized as the co-located data avoids unnecessary shuffling when merging the datasets.
- Choose your dist key wisely. If your dist key is inherently skewed, your data will be stored as such. This will reduce parallelism and can result in your queries grinding to a halt.
- Be mindful to use dist keys with high cardinality, and avoid keys with potentially empty or null values, as they can cause skew.
- Consider the source of your data. Bot traffic repeatedly writing the same dist-key value to a table will invariably create data skew.
Next, we need to sort. Optimized compression, parallelism, and efficient data IO are central to Redshift performance, but all are for naught if your data isn’t sorted. We briefly touched on sort and its impact on compression above, but let’s take a closer look at the importance of sorted data. Redshift stores data in blocks. These blocks are composed of data sorted by the key(s) configured in the table schema and are written to disk contiguously. Redshift also provides a secondary data structure called a zone map that maps ranges of sort-key values to their physical location on the cluster. If the data on the cluster is sorted, Redshift can leverage the zone maps to quickly and efficiently locate your data. If the sort is corrupted, your queries result in degraded performance due to inefficient block scans.
Redshift provides two types of keys to help sort your data: compound and interleaved. Compound sort keys are composed of one or more columns and are the most commonly used type in our cluster. If you require multiple columns in your key, make sure to list them in ascending order based on cardinality. If your primary key already has high cardinality, subsequent keys will have little impact on query performance. Similarly, compound sorting provides little benefit when primary sort keys aren’t included in the filter. Interleaved keys are provided to help with the limitations of compound keys. They are designed to weigh each column in the key evenly, allowing improved performance regardless of which columns in the key you’re filtering. Our advice is to proceed with caution. Query performance degrades significantly as you filter fewer of the columns in your key. You may be doing yourself a disservice by not using a compound key, especially if the majority of your queries filter only on the prefix of the sort key. If you’re not sure which key type is best, test both.
Maintaining Table Hygiene
Okay, now your data is stored efficiently. It’s sorted, compressed, and distributed evenly across all nodes in the cluster. How are you going to keep it that way? If you don’t routinely service your tables, entropy will insidiously sneak in and render your cluster useless. How, you might ask. Let’s look at some examples.
Due to the nature of our business, we perform a lot of updates on our data. In Redshift, updates are performed by a combination of INSERT and DELETE statements. The DELETE statements don’t actually delete the data but instead mark it for future deletion. These ghost rows aren’t removed until a vacuum delete-only operation is run (automated delete-only vacuuming coming soon). This ghost data compromises your query performance by scanning rows that will never be returned in your results. It also wastes disk space. Routinely removing this cruft from your cluster is a good first step toward maintaining proper table hygiene and improved performance.
Unless certain conditions are met (see Loading Your Data in Sort Key Order), sort order degrades any time a DML operation is executed on the table. Maintaining sorted data is not trivial, but Redshift provides you with vacuum commands for this. Vacuum operations are expensive, increasingly so as you use more exotic keys. Making matters worse, Redshift doesn’t support running parallel vacuum operations. In short, the fewer columns you use in your key, the easier it is to maintain the sort. For interleaved keys, keep in mind that the performance benefits increase with the size of the table, but the associated vacuum operations for these keys are extremely expensive and near-impossible to execute on very large tables.
The difference in performance between optimally and poorly maintained tables is stark. If you haven’t already, focus your efforts on ensuring this hygiene is properly maintained. I’ve spoken with Redshift administrators who find that weekly vacuuming is enough. To get the most out of our cluster, we do daily vacuuming and analyzing of our tables. We leverage scripts from AWS Labs to assist with this.
Concurrency, Our Last Hurdle
Even with optimal table hygiene you can still suffer from poor performance. In early 2017, we started to see an uptick in timeouts and slow-loading dashboards. Our table hygiene was pristine, so what gives? We knew the data growth was considerable. We were also processing over 500 ETL jobs per day. In March we decided to expand the cluster to 16 nodes. After the expansion, the expected bump in performance never materialized.
We identified concurrency as the issue. We noticed that Chartio was issuing 100 queries at a time. This was their default setting. We immediately lowered this to 25 queries and that helped but not nearly enough. We discussed the issue with the Redshift team. Their only recommendation was to use Spectrum. We also tried leveraging query monitoring rules to no avail. We knew we had to optimize our queue configuration, we just didn’t know how. It was clear to us that we needed better instrumentation of our data platform — something that could give us visibility across our ETL jobs, Redshift cluster and dashboards, ideally in one place. That’s when we came across intermix.io, a product that provides performance insights for data.
Providing Better Tooling for our Data Team
We started using intermix.io in May of 2017. intermix.io is “a real-time analytics platform that collects metrics from your data infrastructure and transforms them into actionable insights about your data pipelines, apps, and users who touch your data.” After instrumenting our cluster, we determined it was under-resourced, regardless of the known concurrency bottleneck. We immediately added six nodes, increasing the cluster size to 22. This helped but queries were still spending considerable time in queue. Concurrency issues with Redshift are difficult. You can isolate resources (memory, concurrency) on a per-queue basis, but in the end it’s a zero sum game. Any resources allocated to one queue negate them from being available to the others. You need a strategy that optimizes throughput and ideally allows you to prioritize queries. Thankfully, we had intermix to help us understand the strategy and techniques needed to achieve both. Here’s what we learned:
How to Solve Redshift Concurrency Bottlenecks
At some point, you will most likely need to adjust your queue configuration to get the most out of your cluster. intermix provides a great article for explaining the details of queue configuration.
- Create separate queues, each with its own purpose and performance characteristics. It’s helpful to think of your queues as serving a specific purpose: data ingress/egress, transformations, and ad hoc usage. Tailor the concurrency and memory settings of the queues to suit their purpose. We created three queues. The load queue has lower memory and concurrency settings and is specifically for COPY/UNLOAD operations. We also created a transform queue with the highest memory allocation to better facilitate update operations. Finally, we created an ad hoc queue with less memory allocated but higher concurrency to meet the higher volume of queries.
- Iterate on queue configuration with the goal of minimizing disk-based queries and queue time. A disk-based query is one in which insufficient memory resources result in the operation spilling over to disk. Using disk resources kills performance. The tools provided by intermix are ideal for this iterative process.
Avoid Disk-Based Queries
Avoid Long Queue Times
Strive for Zero Queue Time
An important note on concurrency: The overall concurrency limit for Redshift is 50 slots, but the docs recommend staying under 15 slots. With intermix we were able to be much more aggressive and make the most of the 50-slot maximum. For example, our ad hoc queue has been running with 18 slots for months now with no issues.
During this process you may find that you don’t have the resources to meet your memory demands. intermix is great at helping you determine what size cluster you need.
Our Cluster Today
Through the process of optimizing our queue configuration, we determined that 32 nodes were needed to minimize queue time and disk-based queries. We made this final expansion in September 2017 and haven’t had to increase since.
There were other important lessons learned during this queue-configuration process. Queues can help to isolate ad hoc queries from your production workflows. This allowed us to focus first on ensuring our workflows were running smoothly and then iterate on ad hoc query performance. Finally, having an optimally running cluster makes debugging and troubleshooting queries easier. Knowing the cluster as a whole is healthy allows you to drill down and focus performance-tuning exclusively at the query level.
Great! Your cluster is now performing optimally. How are you going to keep it that way? As stated earlier, relying on Redshift tooling alone is insufficient. This is where intermix really shines.
- Monitor your table hygiene daily. Proper table hygiene is the first step in optimal performance. intermix allows for configurable alerts, daily updates on general cluster health, and the ability to drill into table-specific stats including sort and skew.
- Nip queue time in the bud as soon as it rises. We monitor the throughput dashboard daily to help course-correct any time significant queueing occurs.
- Schedule your workflows effectively. With intermix dashboards, it’s easy to identify when resources are available to schedule your workflows. Doing so helps ensure you are getting the most out of your cluster and your workflows are running smoothly.
What’s Next In Your Redshift Story?
We hope you’ve found this post helpful, and if it enables you to side-step some of the issues we’ve faced, even better. Redshift can quickly and easily empower your organization with scalable processing power and is a logical first addition to your Big Data toolbelt. It’s a powerful tool when you know how to use it effectively. But Redshift isn’t going to solve all of your processing needs. Eventually you’ll want decoupled storage and compute along with autoscaling. We’re solving these problems at Udemy now. Stay tuned here for the lowdown.
Have you hit the wall with Redshift? What have you done about it? Please leave a comment if you have Redshift questions or any stories to tell.