Redshift Elastic Resize at TrueCar

Published in

Driven by Code

7 min readJun 18, 2019

Our Redshift cluster occasionally experiences spikes in demand. At the beginning of each month, automatic reports and ad hoc analyses converge on our data warehouse in order to get timely insights out to our partners, dealers, and various internal stakeholders. Every Monday, the load increases as analysts across the company start the week with a fresh look at the data.

Naturally, as the load on our cluster increases, performance for queries decreases. To maintain a good level of service for all users, we tried out a few things to improve performance. After a certain point, though, increased load needs increased raw capacity.

Instead of permanently increasing the cluster size (which we ruled out because of the added cost and downtime), we looked for a way to scale our performance up and down based on demand.

A Promising Solution

In November 2018, AWS announced Elastic Resize, a resizing solution that adds extra nodes to a cluster in a matter of minutes. Once we compared the benefits of Elastic Resize to our own performance metrics, we realized this new Redshift feature could potentially give us the added power we needed while keeping costs under control.

Our heavy queries utilized our worker nodes fully and evenly, which indicated they would benefit from parallelization and could take advantage of extra nodes.
We could scale the cluster up and down with demand, and find an efficient balance between cost and performance.
Resizing in minutes was a time window we could work with, given the schedule of our automated reports and ETL processes.

Because of the obvious potential for increased performance, we decided to test Elastic Resize against our regular monthly and weekly reporting workload.

Testing Elastic Resize

To measure whether or not Elastic Resize would help us out in practice, we needed to test it against our regular workflow. Our plan to test out Elastic Resize was to:

Create a test cluster from a recent snapshot
Run test queries and measure their runtimes (before resizing)
Initiate an Elastic Resize operation
Run test queries and measure their runtimes (after resizing)

Create a test cluster from a snapshot

To avoid impacting our main cluster, we spun up an identically configured test cluster from the latest backup.

Run test queries

Before triggering a resize, we first extracted a representative sample of nine of our heaviest reporting queries, ran them concurrently on the test cluster, and recorded their execution times. This gave us a baseline from which we could reasonably judge the impact of Elastic Resize.

Our approach was to replay these nine heavy reporting queries exactly as they appeared in our production cluster. Running them concurrently meant that we also introduced the same kind of contention for cluster resources as we saw in our production cluster.

Run Elastic Resize

After obtaining baseline numbers for the queries, we triggered the Elastic Resize. This gave us a general idea of how long the resize took and what would happen to the cluster.

The resize process includes four main checkpoints: resize request received, resize process begins, resize process completed, and data transfer operation completed. Below is a screenshot of the different stages and approximate times for one of our elastic resizes, followed by some more detailed explanations of each of the stages.

Resize request received

The “resize request received” event occurs when a request is sent to the Redshift cluster. However, the resize operation does not actually begin until some time after the request is first sent. We’ve seen it take anywhere from one minute to 45 minutes between when we send a resize request and when the resize actually begins, which was a surprisingly large range. This is because AWS requires time and resources for the prep work and logic behind the resize. Officially, AWS suggests creating a snapshot of the cluster right before starting the resize in order to shorten this waiting time, but we saw inconsistent results. You can try that, but regardless, you will need to plan for this time in the resize process.

Resize process begins

Once the prep work is done by AWS, the resize itself begins. When the resize process started, all our running queries at the time were killed, and no queries can be run on the database. This process usually takes only a few minutes (3–5 minutes on average).

Resize process completed

Once the resize process is finished, data still needs to be moved onto the new nodes. This process usually takes around 30 minutes and queries can be run, albeit with slightly degraded performance.

Data transfer operation completed

After the data has finished transferring, the entire Elastic Resize process is complete. Queries can now be run normally with increased computing resources.

Run test queries again

After the Elastic Resize was fully complete and data redistributed to the new nodes, we ran the same queries again. The following table shows our query run times before and after the resize:

Performance increases based on node size

These are some impressive performance increases, even taking into account the contention we intentionally introduced by running all of these queries concurrently. We didn’t expect a doubling in performance with double the nodes since distributed environments tend to lose some performance due to overhead. All in all, the improvements were significant and gave us enough confidence to move forward in applying Elastic Resize to our production cluster.

Implementation in Our Production Environment

With the testing done and consensus that Elastic Resize would benefit us, we took the following steps to roll it out to our production environment.

Decide what time to resize

Because there is a notable lag between the initial request to resize and the actual start of the resize, the timing of when to resize is important. There are ways to mitigate the risk with a more sophisticated retry strategy, but for now, we decided to designate an hour for the resize where any queries made inside the hour-long window would be subject to being cut off.

Decide what days to resize

To be the most cost-effective, we wanted to size up when demand was highest, and size down when demand was lower. As we mentioned at the beginning of this article, our highest demand days were at the start of the month and the start of the week. It took us a little time to reach this conclusion; we had to spend some time meeting with our self-service analytics team, as well as analyze our historical Redshift usage patterns and demands.

Once we finalized a sizing schedule, we could manually enable Elastic Resize on that specified time. Of course, being engineers, we opted to automate this work.

Create a script to do the resizing

After some trial and error, we landed on using a simple Lambda function to handle the resizing. The code snippets provided were written using Python 3.6. The overall Lambda-based solution approached the problem by taking the following steps:

Generate a “calendar” of how many nodes should be active each day.

For example:

On the 1st through 4th of every month, there should be 32 nodes.
Every Monday, there should be 32 nodes.
On all other days, there should be 16.

This gave us the benefit of being able to easily extend our upsized cluster. If more time was needed for our upsized cluster, we simply turned off the Lambda. When the need for more capacity passed, we turned the Lambda back on, and it automatically resumed the regular sizing schedule.

This is an example of the Lambda function we used, written in Python:

Once the calendar was set, we had the function execute at 10 pm PST and 11 pm PST. At 10 pm, the function generates a cluster snapshot. At 11 pm, the function triggers the resize.

We then output alerts to Slack, which gives us more visibility into the resizing of the cluster. We utilized event notifications (specifically, the “Management” category).

Some caveats

There are some sizing limitations based on the types of nodes in the cluster. We use ds2.xlarge so we are limited to only doubling or halving our original node count.
Pay attention to disk space usage. With double the nodes, you also have double the space. If you fill your double-sized cluster with too much data, you might not be able to size back down, as your original cluster size wouldn’t be able to hold the extra data.
We currently use Python 3.7 for our Lambda functions and have to include our own boto3 package. The one included by default in the Lambda execution environment is somewhat out of date and does not have the “Redshift Elastic Resize” function.

Success

For an extra $320 per day, Elastic Resize gives us significant improvements to performance and efficiency, greatly benefiting our reporting team. Some reports even saw superlinear improvements — in particular, one that used to take two hours now takes 15 minutes.

By leveraging the flexibility of the cloud through Elastic Resize, we achieved our goal of improving performance while keeping costs under control.

We are hiring! If you love solving problems, please reach out. We would love to have you join us!