10 lessons learnt scaling Redshift Cluster at Box

Published in

Box Tech Blog

9 min readOct 29, 2019

Illustration by Jeremy Nguyen / Art direction by Sarah Kislak

Every day millions of enterprise users around the globe rely on Box systems to execute their most critical business workflows. To understand the customer journey through the Box platform, we leverage Amazon Redshift data warehouse as part of the back-end data infrastructure to run mission-critical Business Intelligence dashboards.

The Redshift cluster is widely leveraged across Box by many tenants notably Engineering, Finance, Customer Success and Business Analytics team. At a high-level Box’s data backend collects event data from Box servers, sources them through Kafka pipeline and lands them into S3 buckets for persistence. We then leverage compute clusters, leveraging Hive and Spark to ETL the event data and load aggregates into Redshift.

Figure 1. Conceptual Box Data Architecture

As like most technical outfits, we started leveraging Redshift to offload our on-premise DWH to cloud with a few TB of scale in 2014. Fast forward to 2019, the Redshift Cluster has now scaled up-to 43TB and serves a variety of Scheduled and Adhoc workloads, averaging around 550K weekly queries.

The journey to scale Redshift has not been entirely smooth, and we learnt some hard lessons rethinking scale along the way. In this blog, we discuss the important lessons learnt scaling while Redshift@Box. We hope that these recommendations provide insights into how to scale Redshift in a cost-effective and performant manner while providing a responsive customer experience.

1. Select the right instance type to meet scale needs

The cluster shape and size you will start will, will likely not be the shape that meets tomorrow’s business needs. As usage scales up, users will note that performance is not meeting the business needs. This manifests generally in form of a) query runtime is outside accepted thresholds and b) Queries getting aborted if you have Query Monitoring rules enabled. Many factors impact performance including limited cluster shape, poorly designed table, queries that spill to the disk, heavy query workload and high concurrency etc. We have learnt that a common denominator for performance uplift tends to be selecting the correct cluster instance node type and count. We initially started leveraging Redshift for Analytics in 2014 and used 4 dc1.8xlarge nodes and were under 6TB. Fast forward to 2019 we changed the cluster instance type, count and number few times and have settled on ds2.8xlarge node type with cluster size at 30TB. As use cases expand, periodically check if you are using the optimal instance type for your cluster.

2. Leverage Reserved Instances to optimize total spend

Redshift Reserved Instances can provide tremendous savings, as large as 75% over On-Demand rates. The key is to know what instance type is good for your use case. The three key reservation types of reserved instances with differing commits and pricing benefits; No Upfront, Partial Upfront and All Upfront, can give substantial run-the-business savings if the spend is planned according to business and warehouse growth. If you are not entirely sure on demand and growth, you can always purchase RI’s in increments. Over the years we have moved to a more RI heavy commit for Redshift instances for predictable cost.

3. Adjust WLM periodically based on your workload(s)

Running multiple types of workloads on a cluster requires fine grained tuning between tenants, memory, connections and queues. We use WLM to manage cluster workload to map incoming queries to the correct queues via user groups. Analytics queries may include multiple joins, or may need to scan massive data to filter the result. These queries can come from dashboard or from individual users. Based on priority, we assign higher priority queue to ETL service user group and reporting user group, as it’s critical for business SLAs. In addition, consider enabling Short Query Acceleration to improve short query performance.

We leverage query monitoring rules to abort queries that hog resources and execute longer. If a query is important to business and needs to run, we then work with business to optimize it and further map it to the appropriate WLM group to meet the necessary SLA. We keep one small queue which doesn’t have abort rule specified so specific use cases can still execute. Redshift also provide automatic WLM to dynamically adjust resources and concurrency for queries, however that constraints the flexibility to control queries based on different queues or user group. There is an ongoing effort by Redshift team to improve WLM including query priority, and we would be assessing it.

4. Enable Concurrency Scaling for bursty queries

When concurrency scaling is enabled, Amazon Redshift automatically adds additional cluster capacity when you need to process an increase in concurrent read queries. Write operations continue as normal on your main cluster. Users always see the most current data, whether the queries run on the main cluster or on a concurrency scaling cluster. We’ve deployed this feature on our Redshift cluster to gives us the flexibility to handle the variance in our workloads to account for end-of-quarter reporting and bursty analysis, without the team having to intervene. Under the hood, AWS Redshift spins another managed cluster to run the queries. The first 60 minutes is free in a given 24-hour period. Set up monitoring to understand bursty behavior and do TCO analysis to see if it’s more economical to have a larger cluster as compared to having concurrency scaling alone.

5. Leverage Redshift Spectrum

Certain use cases may need to join tables in Redshift with Hive tables stored in S3. This may require copying S3 data into Redshift via ETL pipeline to let users run analytics queries. Redshift Spectrum, an offering from AWS is able to access external tables stored in S3 with out need for ETL pipeline that may be needed to consolidate data. This may help reduce data duplication and the computational load on RS cluster, and will lead to concurrency improvements. Amazon Redshift Spectrum minimizes data processed with dynamic partition pruning. You can further improve query performance by reducing the data scanned. You do this by partitioning and compressing data and by using a columnar format for storage

6. Vacuum/Analyze to optimize for deletes and sorting

If large numbers of rows remain unsorted on the disk for a given table, the query performance will be degraded. Keeping table rows sorted will optimize query performance by reducing disk IO operations and improved zone maps. In addition, range restricted scans are more efficient and Redshift query engine doesn’t need to sort rows on demand during query execution. If the unsorted section of the table is large you can perform deep copy instead of vacuum sort. Vacuum delete operation will reclaim the table storage by removing the rows marked for deleted. Vacuum delete is performed in background automatically when there is minimal activity on the system. We regularly vacuum tables when the percentage unsorted data in a table exceeds 10%; you can gain additional benefits depending on a size of table to sort with a lower threshold like 5%

7. Optimize Distribution and Sort Keys for faster queries

Distribution keys are used to evenly distribute the data throughout the cluster for parallel processing. Determine if any of the column having better data distribution, high cardinality and queries perform selective filters and if so, select the column as DISTKEY. Best way to check data distribution is to create single column temp table as DISTKEY and check the data distribution. Compare the skewed rows, storage skew and percentage populated with different DISTKEY column involved in testing and select the best DISTKEY among all the columns based on the three key characteristics. If the table contains at least one potential DISTKEY column which also participates in join conditions and query patterns utilizes potential DISTKEY column in join conditions select it as DISTKEY column. If the table doesn’t contain any potential DISTKEY column and the query pattern doesn’t utilize DISTKEY column in join conditions, select DISTSTYLE as ALL. If the table isn’t usually involved in join conditions and table doesn’t have any potential DISTKEY column we can select DISTSTYLE as EVEN

Sort keys help with physical ordering of the data based on sort type and columns selected in the sort key definition. Benefits of adding sort key to reduce disk IO by improving effectiveness of zone maps for read operations, reduce compute overhead and I/O by reducing or avoiding costly sort steps required by operations like ORDER BY, PARTITION BY and GROUP BY. When sorting enables merge joins, improve zone maps and query pattern uses repetitive filters on similar group of columns compound sort key will be better choice. When sorting improves zone maps, query patterns utilizing zone maps and query patterns not using repetitive filters on similar column groups. Also need to perform vacuum reindex if needed before vacuum full and an expensive operation since it needs to rearrange more rows than compound sort keys. We need to review explain plans to investigate join conditions used in the tables involved in the join conditions. If only few blocks exist per column per slice for a given column and column has single/few distinct values SORTKEY won’t improve zone maps

8. Disable STATUPDATE and COMPUPDATE for COPY commands

Skipping Compression Analysis during COPY operation helps improve copy runtimes and avoid the compression analysis of the table each time when the copy operation runs. As a best practice we standardized the skip compression analysis on core ETL jobs and significantly improved copy runtimes. We also implemented STATUPDATE OFF on core ETL jobs to minimize the copy runtime and resolve the blocking issues with other sessions concurrently running and trying to access the same table. We have scheduled weekly analyze stats refresh job to refresh stats. Since we turned statupdate off on copy statements, we refresh all the stats in a scheduled job with no impact to table stats.

Recent change to skip Compression Analysis for Copy Operations and turn off statupdate on one of our primary schema tables has improved the performance of COPY execution by 70%. To put that in perspective, 70.75 minutes of COPY execution time has reduced to 21.67 minutes after the skip Compression Analysis for the few samples taken. Meanwhile, the schedule jobs did necessary stats refresh, which guaranteed the performance well.

9. Develop a good Observability story

Track important metrics like cluster storage Utilization, Active users, Query volumes, Tactical, Long running and Short queries counts, Evicted queries from WLM, Tables counts, Query execution time, Scanrowscount, CPU and disk IO used by evicted queries, Database connections, Query Throughput by WLM Queues, Query Duration by WLM Queues, Concurrency scaling activity and Scaling usage in minutes to name a few important metrics we track daily and week-over-week. As a result of tuning some of these parameters, we have been able to uplift the performance of Redshift cluster tremendously.

10. Invest in Automation early

Last but least, it’s common to start initially managing Redshift cluster manually, but as the cluster workloads and scale grows, it is difficult to keep things in a good stable state without quality automation. It is highly recommended to a) automate User Management via a job to manage proper access b) Schema/Table routine maintenances including reviewing missing statistics can help optimize tactical query workloads (under 60 seconds runtime) and c) Having a proper investment in custom monitoring is essential to understand insights on table level and d) Redshift advisor to find dB level and workload level issues. Monitor the long running, short, medium and tactical workloads on a regular cadence to make sure system performance is stable with a consistent runtime.

Conclusion

This blog should provide a glimpse into the best practices we have leveraged to scale Redshift Cluster at Box. Special Thanks to Rong Qiao and Chandra Bandi for their invaluable contribution to this blog and also for scaling Redshift cluster. Special thanks to Jeremy Nguyen for the illustration and Sarah Kislak for the art direction.

10 lessons learnt scaling Redshift Cluster at Box

Conclusion

Written by Salman Ahmed