What we learn about migrating Redshift
This is a story about how Chad Lagore and I planned and migrated a couple large tables in AWS Redshift. Chad is the data engineer and the data analyst that did most of the work. I just ask the right questions and help out executing the work.
Problem: Too much waiting
Like many other companies, at Mobify, we collect analytics. We have a data pipeline that involves many AWS services like Amazon Redshift. Our daily operations revolve around doing big data analysis for our customers, so having an efficient system to support that operation is one of our top priorities.
Late 2017, we started to notice our Redshift cluster was not performing as well as we would have liked it to. It was somehow deadlocking and no queries were getting through. We reached out to Amazon support and they increased our vacuum buffer size. It did solved the Redshift deadlock issue at that moment in time, but we ended up encountering the same issue only months later. We knew our data volume consumption rate was growing, and eventually we would have to face the fact that we would need to do something about our disk space issue. It wasn’t until 4 months later that the consequences of leaving our Redshift cluster as it was became apparent. Our Redshift clusters were taking more than 48 hrs to perform vacuum full tasks. To be more specific, our tables were so large that it took more than a day to finish a vacuum full task. When vacuum is running, it impacts our daily operations in the following way:
- Querying is slow
- Table migration isn’t possible
- Querying a table that is being vacuumed can cause a full redshift deadlock for memory resource
Oh yes. We spent a lot of time .. just waiting.
To better understand just how much data we are looking at, here are some metrics:
- 6.4TB total disk space usage
- 1 table at 3.76TB disk space usage
- 1 table at 1.12TB disk space usage
- Daily task that marks data older than 100 days for deletion
- Vacuum full is run daily
There’s got to be a better way! So, we started by asking ourselves:
- What is vacuum doing?
- Can we get rid of vacuum?
- What’s eating disk space?
- Why do we have a table that’s taking so much disk space for just 100 days worth of data?
- What can we do to reduce table size?
- What are our options?
So many questions.
Understanding our Problem
What is vacuum full doing? Vacuum full as explained in the AWS documentation is doing the following:
- Reclaim disk space from rows marked for deletion
- Sort “un-sorted” table region
- Merge the sorted data blocks in sorted order
Okay. That doesn’t help much. We know that already. In fact, we knew vacuum is always taking forever at sorting and merging. So, let’s understand why that is the case. To do that, we need to understand the nature of our own data and how it is inserted into the table.
Our data is monotonically increasing by time. Every minute, we have tasks that write minute batched real time data into redshift tables. This means that everyday, we will have roughly 60 million rows of unsorted data that we need to run vacuum on.
Problem — Our table sort key is not optimal
Our tables had sort keys in the following order: project id, date. To understand the impact of this, let’s do a vacuum full simulation.
Before vacuum full executes, here is an illustration of what the table looks like:
Our table schema has sortkey by project id then date.
Vacuum delete — for every project id, a day’s worth of rows are marked for deletion
Did we .. just paper shredded our table? .. Yes.
Vacuum sort — sort the unsorted region by project id then date
.. I have a bad feeling about this ..
Vacuum merge — connect the memory blocks in sort order
It’s Franken-memory! .. We just created the worst possible conditions for merging.
Our table schema created the following problems:
- We broke disk continuity on sorted regions
- We made it so that sorted regions needed to be merged in blocks
- We then had to insert the newly sorted blocks from the newly inserted data
We finally identified our main problem. We needed to fix our table sort keys! If we think about how majority of our queries will be scoped to a particular project id, it made sense to have it as the primary sort key. However, we never took into account how vacuum will behave with this database schema. Our data volume was too small back then to make this problem appear.
Evaluating Options
Now that we know our main problem, it’s time to understand what options are available to solve it.
First, we eliminated the potential option to increase our cluster nodes. We knew that simply increasing computation power wouldn’t solve our vacuum problem in the long run — it would only delay it into future. That means our options had to fall under optimizing Redshift.
Option 1 — Alter table sort keys
One option was to change our sort keys so that they were ordered by date then project id. This would help vacuum task. Let’s simulate the Vacuum Full task again with the new sort key order.
Before Vacuum Task
Vacuum Delete
Vacuum Sort
Vacuum Merge
Now that is much cleaner.
Option 2 — Rebuild our tables with Time-Series tables
We started looking around at how other developers in the world solve this problem and came across this best practices guide: Amazon Redshift Best Practices for Loading Data
If your data has a fixed retention period, we strongly recommend that you organize your data as a sequence of time-series tables. In this sequence, each table should be identical but contain data for different time ranges.
✅ Our data has fixed retention period
A time-series table is essentially a view that stitches tables with exactly the same table schema together. Each table represents a block of time and when that block expires, we just need to drop the table. A table drop is much more effective comparing to a mass delete. With this, our new table structure will look like the following:
Then we continue reading the guide ..
If you use time-series tables with a timestamp column for the sort key, you effectively load your data in sort key order. Doing this eliminates the need to vacuum to resort the data.
At this point, we were a bit skeptical but unable to confirm our suspicions. There weren’t a lot of documentation around this particular method and there wasn’t much social proof either. However, we did find some indications that going this route is not advised:
- SILOTA — 4 Reasons why it’s time to rethink database views on Redshift
- A thread on SNOWPLOW Analytics about Redshift views
On top of that, we had personal experience with Redshift views preventing table alteration even if the view itself doesn’t use the column in the table.
So after some research, this is what we gathered: Querying against Redshift views will be slower.
However, we still didn’t know how much it would impact us. Should we consider possibly lower query performance for a database that doesn’t require vacuum? We didn’t have an answer for this either.
We still didn’t understand how it would be possible to not vacuum at all, so we continued reading Amazon documents. Then we came across this document: Loading Your Data in Sort Key Order.
If you load your data in sort key order using a COPY command, you might reduce or even eliminate the need to vacuum.
✅ Our ETL workflow uses COPY command to move data from Amazon S3 to Redshift
Then, we looked at the requirements for this work:
- ❌ The table uses a compounded sort key with only one sort column
- ✅ The sort column is NOT NULL
- ✅ The table is 100 percent sorted or empty
- ❌ All the new rows are higher in sort order than the existing rows, including rows marked for deletion
Our table schema had 2 sort columns. It was necessary to have both sort keys to have the most efficient query performance. The only way that we can get a 100% sorted table is by using the first COPY operation on an empty table. Our subsequent copy operation would not be in sort order because project id is our other sort key and it does not increase over time. This confirmed that we couldn’t eliminate vacuum in our process.
Evaluation
So, let’s understand the area of impacts of both options:
Alter table sort keys — These are completely isolated operations inside Redshift. We were not expecting any changes to how external operations query or insert into Redshift.
Time-series tables — This change extended beyond Redshift. Our minute insert tasks would have to change to support the time-series tables.
After much thought and validation with many other engineers, we came to the conclusion that regardless of the option we chose, we would still have to rebuild tables with proper sort keys. In the interest of allotted time for the migration task, we decided to move forward with altering the table sort keys and leave the time-series table option as something to evaluate in the future. With that, we started to devise a plan.
Plan: Create new table and copy data from the old table to the new table.
- Create a new table
- Stop all writes to the old table
- Copy data from the old table to the new table
- Rename the old table to a temp table
- Rename the new table to the old table
- Resume writes to the new table
- Delete the temp table or rollback if problem arises
Side quest: We are re-creating tables anyways. Let’s add column encoding to address our disk space issue.
FUD — Fear, Uncertainty and Doubt
Great! We had a plan! But, it was still a giant database migration we were looking at here. How did we have the confidence and assurance that what we were doing was safe and would actually make a difference? To do that, we came up with a bunch of questions to answer before we even tried to migrate:
We need to copy all data rows of the old table to the new table in a single transaction:
- How much disk space will the copy take?
- How long will the copy take?
- How long will it put our data pipeline on hold?
- How long does our system takes to recover?
How much performance gain are we looking at for each optimization we make:
- What query performance will we gain?
- How fast will vacuum be on a proper sort key table?
What’s our rollback plan for every step of our migration plan?
Have we minimize our risk of failure?
We had some answers to some questions but not all. Forget Googling, it was time to put on our Bill Nye hat and experiment!
Know your systems
There was no way we would be comfortable executing this migration plan unless we had a solid understanding of our own system.
- Does stopping the Luigi workflow prevent storing our raw data that happens in real-time? No, it is a completely separate system.
- How does our system recover? Our Luigi workflow has update marker tables that will backfill any missing data that it detects
- Have we stressed test our system? Yes! We have confidence that our systems will function normally even at 95% Redshift disk full. We knew our Luigi workers are capable of executing 2000 tasks per 15 minutes (Normal rate is 350)
- How will migration affect production data? If it did, can we recover? If migration itself is operated over a certain time period where we perform our daily aggregate workflows, then it will affect production data — we can backfill this data as it is not data loss.
Gathering metrics
To answer our FUD, we decided to get some estimates before we actually ran the migration. What better way to estimate a copy operation than actually doing a copy of a subset of a data! On average, we collect 54 million rows of data a day. It’s a significant sample set to work with.
While we are at it, we also wanted to know what performance gain we would obtain after we had a new proper sort key and column encoded table. Then, we thought, why not also do the same evaluation on the time-series table too! We’re gonna run this test anyways, let’s determine whether our future optimization is viable or not.
Data Copy Duration Estimation
We wanted to know how long Redshift would take to complete a copy. To do this, we ran a mock copy into a new table with only a small subset of data. We copied 300GB worth of data into a new table and timed it. This process took 2 hours to complete. Based on this estimation, we could estimate how much time we would need to copy all the data in a table.
Disk Space Optimization Estimation
We knew the majority of our tables weren’t using any kind of data compression. Since we were rebuilding the tables anyways, we were certain that we would be able to improve our disk usage by adding compression. Redshift provided an interesting command that we could use to analyze how much compression we could get by specifying column encoding.
ANALYZE COMPRESSION tablename;With the above command, Redshift will analyze the current data in the table and suggest an encoding that you can use to maximize compression. When we analyzed our tables, Redshift returned something like this:
+---------+------------+----------+-----------------------+
| Table | Column | Encoding | Estimated Reduction % |
+---------+------------+----------+-----------------------+
| Table_1 | project_id | zstd | 99.94 |
| Table_1 | date | zstd | 99.88 |
| ... | | | |
+---------+------------+----------+-----------------------+A good majority of our columns recommended using zstd column encoding. That got us wondering, what is this magical zstd encoding that can provide almost 99% reduction in disk space? zstd, Zstandard, is a real-time data compression algorithm that builds a dictionary on the available values. This explained why our columns could be compressed so efficiently. There were only a finite number of project ids, so there was no point in keeping them in plain varchars.
Query Performance Comparison
Initially, we just wanted to know the performance gain between the old sort key table schema and the new sort key table schema. Out of curiosity, we also ran the same query performance comparison between table with correct sort key table schema and the table with the correct sort key table schema with time-series table configuration.
To do this, we ran a series of queries against the sample tables. We prepared 2 new table schema setup:
Table Schema 1
- 3 days worth of data copied into the new table with correct sort-key defined
Table Schema 2 — Time-series tables
- 3 tables, same table schema as table schema 1, each with 1 day worth of data copied in
- 1 view that stitches the 3 tables together
We chose query patterns that maximized the capabilities of sort-keys that were also common query patterns that our system executes on Redshift. It looked something like this:
SELECT
{DIMENSIONS},
COUNT(*)
FROM {TABLE}
WHERE project_id = {PROJECT_ID} AND date = {DATE}
GROUP BY {DIMENSIONS}We ran about 1000 queries on each table schema. This was achieved by changing out the project_id with a different one. For each query performed, we collected the time it took to complete it and then calculated the min, max, mean with 95% confidence interval on the result set.
When we did the first performance run, we got stats that proved the single table setup performs way better than the time-series setup, but when we made subsequent runs, we didn’t get the same stats as the first. This is when we realized that Redshift had query caching capability. Cold cache is when the cluster sees the query pattern the first time. This means Redshift would require additional time to compile the query execution plan before executing the query. Warm cache is running the same query pattern after the cold cache run. This is where Redshift makes use of the cached execution plan and executes the query. To simulate a cold cache, we had to rebuild the table from scratch.
In summary, on cold cache, the queries performed 5–22% longer on time-series table setup than on a single table setup. On warm cache, the queries performed 1–5% longer on time-series table setup than a single table setup.
+--------------------+----------+----------+-------------+
| | Min Diff | Max Diff | Medium Diff |
+--------------------+----------+----------+-------------+
| Cold cache | 5% | 22% | 13% |
| Warm cache 1st run | 1% | 5% | 3% |
| Warm cache 2nd run | 1% | 3% | 2% |
| Warm cache 3rd run | 1% | 2% | 2% |
+--------------------+----------+----------+-------------+Note: Diff stats = time-series query execution times — single table query execution times
The above stats are normalized comparison between single table schema vs time-series table schema. In all cases, the time-series table schema performed worse than the single table schema. In our case, our queries will always be a cold cache query. This is due to the fact that we run queries on newly inserted data everyday. This validated the fact that we should not change our table schema to a time-series table setup.
Low Hanging Fruits
Amongst the tables we have in Redshift, we know there are some tables that haven’t been updated since a year ago. Identifying and confirming these tables could be safely removed were easy wins. It could also free up some disk space for our migrations. In the process of this, we freed up 1.27TB of disk space. That is 10% of our cluster disk space.
Mitigating Risks
We had multiple tables in Redshift we needed to migrate and instead of completing it in one go, we looked at a migration plan for individual tables. Starting off with smallest sized tables could bring the most impact and build up our understanding with each migration.
First Migration Execution
First, we started off with our 1.3TB table. Based on the metrics gathering that we did beforehand, here are some estimations that we had:
Estimated time for copy operation: 8 hrs
Estimated temporary disk usage: 1.3 TB
Estimated system recovering time: Unknown
Table disk usage: 1.3 TB
Available disk space: 6.5 TBPlan:
- Pause new writes into old table
- Create new table with proper sort-keys and column encoding
- Copy data from old table to new table
Rollback Plan: Drop new table and resume writes to old table - Rename old table to a temporary name
- Rename new table to the old table name
- Resume writes to old table which is now the new table
- Drop old table
This seemed failproof. We had a plan in place if anything went wrong.
First Migration Retrospectives
The migration was a success. Here are the actual stats:
Actual time for copy operation: 7 hrs
Actual temporary disk usage: 2.95 TB
System recovering time: 2.5 hrs
Final table disk usage: 300 GBWhat did we learn from this migration?
Copy operation actually took a lot more temporary disk space than we expected
It took Redshift:
- 2 hours to plan and allocate space for copy
- 3 hours to do the actual copy
- 2 hours to merge all the memory blocks and release space
This was good to know since we wouldn’t be able to use the same migration plan for our 3.76TB table.
We found a vacuum pattern that we weren’t aware of
Before our migration started, vacuum was running. We knew it was gonna take about 2 days to complete so we manually killed it. In the process of this, we realized that our table was running at default of 95% sorted. What this meant was, if you run Redshift vacuum full command, the sort command would skip if the table had already sorted more than 95%. This created a huge unsorted region for our tables that got data everyday.
The above graph was how we noticed this trend. Every 5th day the total vacuum time took more than days to complete.
Our follow up for this was to make sure vacuum full was set at 99% for tables that had large volumes of daily writes.
Other Migrations
With the first migration completed, we had the confidence to migrate any tables that had disk sizes smaller than 1TB. We went ahead and migrated 3 other tables. Of course, we left the old tables hanging around to do some query performance benchmarking.
Using the same query performance technique described in the previous section, we compared the old sort key table schema and the new sort key table schema. We got amazing performance gains against the old schema. Each table had slightly different gains, but all in all, it was a huge success.
+---------+------------+------------+
| Table | Cold Cache | Warm Cache |
+---------+------------+------------+
| Table_1 | 80 ~ 153% | 46 ~ 204% |
| Table_2 | 65 ~ 88% | 62 ~ 90% |
| Table_3 | 2 ~ 10% | 3 ~ 5% |
| Table_4 | -22 ~ 349% | 55 ~ 92% |
| Table_5 | -88 ~ -65% | -90 ~ -62% | <- 🤨 Huh?
+---------+------------+------------+Interestingly, the query performance for some of the tables that we migrated with column encoding was performing worse comparing to the old schema on cold cache. The only common trait that these tables had was that they were small enough to fit inside a single node memory. Our theory is that compression allows more data to fit into the memory. This lowers I/O operations and improves query performance. However, when the table itself is small enough to fit in the memory, it introduces a decompression penalty. Unfortunately, we don’t have enough sample to confirm this. So, we decided to roll back any tables that aren’t near the memory size threshold to ensure we have the best query performance.
3.7 TB Migration
We knew we couldn’t migrate this massive table with our original plan, but at the very least, we freed up enough Redshift disk space to make this migration a bit more flexible.
Approach 1 — Rebuild table from raw data
We had a good idea how much the final disk space this table will take (about 1TB). Why not simply rebuild the data from our data lake using Luigi workflow?
Plan:
- Create new table
- Setup new Luigi workflow that backfills the new table with lower task priority
It didn’t take long before we abandoned this path. Here’s what was wrong with this approach:
- It took 2 hours for 1 day worth of data (~ 9 days to complete)
- The backfill task completion rate was highly dependent on what other tasks are running
- The rows are not sorted on insert — the initial vacuum will take forever
Approach 2 — Multi-batch Copy into New Table
We knew Redshift sorted on insert when dealing with copied data. So we decide to try to rebuild the table with multiple copy batches.
Plan:
- Copy data into new table in 3 batches
- At the last batch, stop writes to old table and complete migration as before
- Resume writes to table
This approach didn’t work out as well and here’s what we learnt from it:
We were hoping that we could make use of COPY operation so that our subsequent COPY commands would be sorted on insert but that didn’t happen and we ended up with 66% unsorted table that would’ve taken days to vacuum.
We also tried to copy data from a column encoded table to another. This resulted in more than 3 times disk space usage during the copy operation. Redshift will decompress data to copy from one table to another.
Final Approach — Copy only 50% of data
We got to the point where we knew we needed to compromise on the solution. So we decided on the following:
Plan:
- Copy only the last 50% of data into the new table that will be getting daily writes
- Copy the other 50% of data into another table that we will label as historical data
- Create a view that joins these 2 tables together for querying
If querying required dates more than 50 days ago, there would be a negative 20% query speed — but this would go away in 50 days.
Results
Before After
Total disk space usage 6.4 TB 1.7 TB
Average vacuum time 360 mins 60 mins
Maximum vacuum time 1598 mins 90 mins
Query performance gain (cold) — + 115%
Query performance gain (warm) — + 75%It wasn’t easy but we learnt a lot from these migrations that we performed over 2 weeks, and we achieved the criteria that we set out to when we were planning and executing the migrations:
- The migration itself must minimize impact to downstream systems
- A rollback plan must be defined for any high impacting tasks
- The impact of the each migration feature must be positive
None of the migrations would be possible if we weren’t confident in our own system, and most importantly, comfortable making mistakes.
References
- Amazon — Top 10 Performance Tuning Techniques for Amazon Redshift
- Amazon Redshift Best Practices for Loading Data
- Amazon — Loading Your Data in Sort Key Order
- Amazon Redshift — Managing Volume of Merged Rows
- Amazon Redshift — Using Time Series Tables
- Amazon Redshift — Vacuuming Tables
- Amazon Redshift — Choosing Sort Keys
- Amazon Redshift — Resizing Clusters in Amazon Redshift
- Amazon Redshift — Data Warehouse System Architecture
- SILOTA — 4 Reasons why it’s time to rethink database views on Redshift by Ganesh Swami
- SNOWPLOW Analytics — A thread about Redshift views
- Amazon Forum — A thread about a bug with Redshift views
