Amazon Redshift for Data Warehousing: Migration, Implementation and Improvements
At TrueCar, data is in our DNA. To make better business decisions, making accurate data easily accessible to our business users is critical. Automotive data is complex and exists in large volumes. Our business users rely heavily on clickstream, dealer, pricing, and inventory data, as well as lead and sales information. We read data from various source systems such as PostgreSQL, Amazon S3, FTP/SFTP, APIs, SQL Server, BigQuery and in various formats including, but not limited to, CSV, JSON, and Avro.
Before 2017, our data warehouse was a Microsoft SQL Server database in excess of 35 terabytes, and we were consistently dealing with disk space and performance issues. Log files would fill up while ETL was running and we had to manually shrink the logs to reclaim space. There were external drives on the server which were challenging to maintain, as the database had to be put in offline mode to move data/log files to a different drive. Eventually, we even ran out of drive bays for additional drives.
SSIS packages and Talend were used for ETL pipelines. Server performance and timely data for reporting became serious concerns.
So, in an effort to address these problems as part of our larger technology replatforming initiative (Capsela), we made the decision in 2016 to move TrueCar’s data warehouse to Amazon Redshift. For our team, this migration meant more than just moving existing data to Redshift. We would need to build new data ingestion pipelines as well, using our in-house ETL tool called Armatron.
Amazon Redshift seemed like a solution for our problems of disk space and performance. It is a columnar database which is a fully managed, scalable, fast, and cost-effective data warehouse solution. Amazon Redshift’s columnar data storage, massively parallel processing (MPP) and efficient integration with other AWS services provided a very reliable and effective solution to business intelligence and analytics platform.
The switchover wasn’t without its challenges. After we configured our main BI tools, Tableau, and MicroStrategy to point to our new Redshift-powered data warehouse, the first cracks in our Redshift solution appeared. As analysts across the company stopped using legacy data sources and began reporting from Redshift, overall usage grew and we started to see performance issues.
Before we knew it, we were dealing with complaints regarding long-running queries, query wait time, and disk usage hitting 100%, which caused ETL failures. Because we had chosen Redshift specifically to overcome these kinds of issues, we were worried that we had made the wrong decision.
Working toward Solutions
However, we didn’t give up and reached out to AWS for advice. By carefully following AWS’s recommendations and changing how we were doing our ETL and maintenance of the Redshift cluster, we were able to work through the issues and significantly improve our cluster’s performance.
Vacuum and Analyze Large Tables
The first step we took involved a strategy for vacuuming our Redshift tables. One of the largest datasets we have in our data warehouse is clickstream data. It consumes approximately 70% of disk space. When we moved our clickstream pipeline to Redshift, we also made a lot of changes in the table structure: adding new columns, updating business logic, and backfilling data for new columns. Before we realized it, we were generating many ghost rows, also known as unused space on the disk. To fix this issue, we decided to vacuum our clickstream table, but this process spiked disk usage to 100%, causing the process to fail.
Later, we handled that issue with the VACUUM DELETE ONLY command, which is the fastest way to reclaim the unused disk space. That helped us free up around 40% of the disk space. Next, we did VACUUM FULL and ANALYZE on the table, which sorted the data. ANALYZE command updates the stats of the table which helps query planner to optimize the query execution plan.
We learned that performing VACUUM and ANALYZE is very critical to query performance and efficient CPU and disk usage.
Perform Deep Copy
On further analysis using system tables, we observed data skew (uneven data distribution) in our clickstream table. The solution to this was performing Deep Copy. Another advantage of a deep copy is that it gives the opportunity to correct/improve compression encoding, distribution style, and the sort keys, which can enhance reporting and ad hoc query performance.
There are multiple ways we can perform Deep Copy:
- Using the original table DDL
- Using CREATE TABLE LIKE
- Creating a temporary table and truncating the original table
In all three steps, we need to use an INSERT INTO … SELECT statement to copy data into a staging table from the original table with updated column compression encoding, distribution style, and sort keys. Then, rename and replace the original table with the staging table.
*NOTE: Make sure you handle any kind of dependencies on the original table carefully.
Use Better Compression Encodings
We analyzed compression, recreated tables using the right compression, and performed a deep copy. This was a multi-month effort which gave us much better overall performance with respect to disk space and query time.
We used the following SQL statement to get the recommended column compression:
Use Distribution Style ALL for Small Tables
To understand the query execution and performance better, we started investigating the EXPLAIN PLAN of long-running queries and noticed that some of the dimension tables were redistributed to all nodes during execution. This was causing a lot of performance issues, so we revisited AWS’s recommendations on choosing the best distribution style. Based on one of the recommendations, we decided to change the distribution style to ALL for many dimension tables with less than 2M rows. Changing DISTSTYLE ALL reduced data I/O and redistribution of data when joining tables.
After redesigning, vacuuming, and analyzing the clickstream table, we brought the disk usage down to 21%. Queries were running faster and we had the disk and CPU usage at minimal.
Smarter Workload Management (WLM)
We managed to improve performance by tweaking Workload Management settings. We did this by testing various WLM settings for day vs night. We discovered we could reduce WLM memory settings for ETL during the day and increase them for BI reporting and analysts. We also experimented with concurrency settings for each WLM group until we got something that fit our BI and ad hoc reporting needs.
- BI APP group — 10 concurrency with 60% memory
- Analyst group — 5 concurrency with 30% memory
- ETL group — 10 concurrency with 10% memory
- BI APP group — 10 concurrency with 30% memor
- Analyst group — 5 concurrency with 30% memory
- ETL group — 10 concurrency with 40% memory
This significantly reduced our query wait time, as the illustration below shows.
Enable Concurrency Scaling
Though we had a concurrency level set for each WLM group, queries were waiting in the queue for resources during peak reporting times. Enabling concurrency scaling at WLM group level further reduced query wait time and it was also very cost effective as Amazon provides this feature for free an hour per day. Read-only queries and queries hitting tables without interleaved sort keys will benefit from this feature.
Update Table Statistics
Several large tables were missing stats. Now we run analyze on all tables at the end of ETL, so that stats are updated before reporting starts in the day. By doing this, we noticed much better performance on some of the long-running queries.
Enable Elastic Resize
When we heard about the feature of Elastic Resize (adding nodes to cluster in minutes), we knew it would be very useful. We analyzed our peak usage times and came up with a time window to do the resize. Comparing query and reporting performance stats, we found that monthly (first week of the month) and weekly reports were hitting Redshift with at least 1500 processes including BI reports/dashboards and ad hoc analysis. We decided to resize our cluster at the end of each month and downsize after 4 days for monthly reporting and every Monday for weekly reporting. Reports/queries were showing 60% performance improvement after resizing.
Below is a screenshot of the how table count has grown in the past year in our Redshift data warehouse.
We originally moved to Redshift to address performance and disk space issues. However, at first, we ended up with performance issues that nearly drove us away from using Redshift. Fortunately, we took the recommendations of AWS and by experimenting and tweaking things, managed to get our cluster performant with below steps:
- Amazon Redshift’s columnar data storage, massively parallel processing (MPP) and easy integration with other AWS services make it the best platform for data warehousing and reporting.
- Choosing the best Distribution Style and Sort Style is very critical to the overall performance of queries.
- Explore long-running queries, its explain plan and data redistribution to determine DIST and SORT style.
- Performing VACUUM and ANALYZE enhances query performance, ETL and CPU and disk usage.
- Column compression reduces the size of data and disk I/O, which helps improve query performance.
- Based on tables’ size and structure, you can decide to fully vacuum or perform a deep copy.
- WLM settings played a key role for us in allocating sufficient resources for each group based on the usage.
Our Data Warehouse is performing much better now and we continue to make improvements. We are now happy with Redshift’s performance — and we continue to see opportunities for improvements, so by the time you read this article, we will probably have improved things further.
If you liked this article, you might like our in depth description of our ETL Framework, Armatron.
We are hiring! If you love solving problems please reach out, we would love to have you join us!