Best practices: Data migration from Teradata to AWS, Snowflake

With the increase in adoption of the cloud and new architecture patterns and players in place (ex: lake house architecture, snowflake, etc.), we are seeing more customers moving their Historical data to the cloud to gain the advantages of Cost, agility, scalability, and durability. Certain customers, for regulatory reasons, wanted to keep the Legacy Application data ASIS in the cloud and make the data sets accessible in times of need (ex: Audit, etc.). This is quite significant in the health care industry.

We had an opportunity to help multiple customers in the last quarter with their historical data migration journey. Here we would like to collate our experience with one typical use, which involves migrating data warehouse data from Teradata to Snowflake.

Context: Customer do have VPN Connection to AWS and has a more than Hundred terabytes of data and limited time to move the data to Cloud

Top 10 Migration prerequisites:

Below are Top 10 Precursor activities that are recommended for successful cloud migration.

  1. Collect the object inventory. This exercise also involves discussing with stakeholders and eliminating redundant, unused, temporary, and user-created objects. We also need to come up with matching data types for source objects.
  2. Gather Storage statistics: Determine the amount of storage related to the above objects. This is an important step to guesstimate the amount of data/storage and time it may potentially take. If the storage is more than 50 TB and time is of the essence, consider using snowball family of services.
  3. Network Speed: Discuss with your platform team and determine the bandwidth and connectivity available between your on-premise server to AWS ( ex: Direct connect, VPN, Region/location of the source and target, etc.)
  4. Know Compression Rates: Different databases utilize different compression techniques. This is tricky and also an important step to determine the staging server size that you need to procure at on-premise
  5. Access Patterns: Understand who will be using this data and what frequency, and how quickly they would like to see this data to determine hot, cold, warm data.
  6. Type of Retrieval: Understand how the data retrieval request comes. In many cases ( at least in the initial years of migration) data steward community talks in the language of the existing on-premise system. i.e., can we get Payer_Plan_reltn table, etc., with an aim to determine the best access layer?
  7. Lead time for stage servers: Most of the solutions offered do need a placeholder to store the extracted data. As this is on-premise, we all know the pains of procuring an additional server. Timelines required for this have to be baked into the plan.
  8. Timelines: Understand the timelines for migration. Sometimes ( at least in our case), we will have hard deadlines to decommission the source database. This could be due to cost savings or shutting down the data centers etc.
  9. Ideal Time for Extract: Check with your DBA or Support team for an ideal time to extract the data without impacting business operations. Look for alternatives like a sandbox to perform this.
  10. Security Approvals: Last but not least, the Security team has strict policies to move the data out of their landscape to protect the privacy of the data. Lead time for security approvals needs to be well planned.

Note: When you’re using a Snowball Edge to import data into S3, all data transferred to a device is protected by SSL encryption over the network. To protect data at rest, AWS Snowball Edge uses server side-encryption (SSE)

Solution Tenants:

We have evaluated multiple solutions to meet the needs of our customers. However, one distinguishing requirement that drove our custom solution was the ability to make any historical data available to end-users for compliance and regulatory purposes. The target database in our case is Snowflake. We could not find any solution to meet the needs of our requirement. The limited timeline to complete the migration was another important factor.

  1. Direct Upload of the data to AWS: This is the easiest and straightforward approach. However, available bandwidth (data transfer rate, 100s of TB data) is estimated to take months. This is suitable for small data volumes.
  2. AWS SCT/DMS service: This service is provided by AWS, which is efficient in most cases; however, we could not fully leverage this due to the following limitations
  • Supports only AWS Redshift cluster as target/destination.
  • No control to define field delimiter, date formats, resource allocation for extraction, customized/user-friendly data structure for extracted data, encoding style, etc.
  • Source extracted data on S3 bucket is difficult to understand without loading into Redshift Cluster and requires more efforts based on volumes of data migrated.
  • A schema is the minimum task scope and cannot migrate individual tables or subsets of tables using Snowball Edge.
  • The LOB mode limits LOB file size to 32 K. LOBs larger than 32 K aren’t migrated.

TPT with snowball Edge Solution: High-level solution consists of the following steps

  • 0.0 Developed a Utility to convert the DDLs from Teradata to Snowflake and created required tables in Snowflake. A designed framework consisting of driver tables in RDS to drive migration.
  • Step 1,1.1: Extraction: TPT Scripts were developed to extract the data from Teradata to NFS. NFS is mounted on the staging server(s). Data compression is carried out on extracted data placed on NFS storage.
  • Step 1,1.1. All the jobs were divided into batches, and batches were run in parallel to speed up the extracting processing
  • Metadata required (source location, target location, table, execution status, etc.) was placed in MySQL.
  • Step 2 Extracted Files were copied from NFS To snowball. This step assumes that snowball is ordered and established the connectivity to On-premise Linux staging servers.
  • Step 3. Small data sets were directly copied over the Internet by using the AWS command-line interface
  • Step 4,5: Snowball was shipped to AWS, and AWS importing the data to S3.Use Snowball Import job report generated automatically to verify the status of objects imported into the given S3 bucket and take necessary actions for any failed objects if required.
  • Step 6: Identified tables to be made available were created in Snowflake. External Tables were created in Snowflake to make the important tables to be made available to end-users.
  • Step 7: Life Cycle Policies were created to ensure that cold data is transferred to Glacier to gain the cost advantage. This step also includes developing automated mechanisms to quickly fetch the data from Glacier to S3 to Snowflake etc.
  • Step 8: Updating metadata with completion status and record counts etc. This acts as a one-stop mechanism to identify the lineage between the Teradata table to the S3 folder location to Snowflake external table ( where applicable).

Best Practices/Watch outs:

  • Staging servers: During the design stage, the team needs to come up with capacity planning to determine the number of staging servers, configuration required (CPU, Memory) and expected number of parallel jobs, etc.
  • It is recommended to go with NFS storage based on your data volumes and have this mounted across more than one staging server to get the benefits of server failures and more parallelism, meaning you can trigger multiple parallel jobs across servers.
  • Teradata Capacity: Teradata system has static resource allocation for regular workloads and TPT jobs separately, meaning either of the category jobs cannot get complete system resources at any point of time, and max is their quota defined as per ‘Teradata Database limits & Teradata Warehouse Manager limits.’ Ref — https://docs.teradata.com/reader/j9~8T4F8ZcLkW7Ke0mxgZQ/vvRM4OXOayBt3IbWhvMmmg
  • Work with Teradata System administrator and adjust the limits as per your needs to get more parallelism for ‘Export, Load & Upload’ operator TPT jobs.
  • Compression: I noticed that regular compression packages like Gzip would take more time and recommended going with packages like ‘pigz’ for parallel implementation of Gzip. In most of the cases, it is noticed that you would achieve ~ 8–9 times compression ratios than the original data size.
  • Monitoring: Use ‘nload’ utility to monitor network traffic and bandwidth. It is beneficial to calculate the data transfer rates between the Staging Linux server and Snowball Edge device and to estimate approximate timelines required to finish the data copy.
  • Recommended to use AWS OpsHub client tool to interact and data copy operation over GUI.

Note: I’m currently a Senior resident solutions architect at Snowflake. Opinions expressed are solely my own and do not represent the views or opinions of my employer or training department.

Originally published at https://www.linkedin.com.

--

--