Accelerating Netezza to Snowflake Migrations

Using Hashmap Data Migrator (hdm)

Jhimli Bora
Hashmap, an NTT DATA Company
7 min readJan 11, 2021

--

by Jhimli Bora and John Aven

Hashmap Data Migrator (hdm) makes it easy to migrate data from one data platform to another. It supports both on-prem and cloud data warehouse migration. hdm is designed to be very flexible in terms of data source & destination, state management system, and staging location.

Note: This blog post is part of the Hashmap Data Migrator series. Refer to previous blog posts for more details on the terms: hdm (Hashmap Data Migrator)

Let's look at a use case for hdm. Below we will see how easy it is to configure and start migrating data from Netezza to Snowflake (a high demand migration combination from our client base) using a local file system to stage data files locally for manipulations e.g. chunking large files and Azure Blob Storage as storage stage to loading data into Snowflake.

FS is a File System

We can use either JDBC or ODBC driver to connect to Netezza. Also, there are 2 ways to offload data; using Netezza source (execute SQL query and load in a pandas dataframe) and Netezza external table source (execute SQL query to create an external table and write the data to a file).

To get started we just need to create 2 configuration files:

  • profile YAML- holds the connection information
  • pipeline YAML- holds the stage information

Profile YAML

This file stores the connection information to the source, stage, sink, and database for state management. It's stored in local FS and its path is set in the environment variable “HOME”. The below YAML file format is based on Netezza to Snowflake data transport using azure blob staging:

Pipeline YAML

The user should be focused on this until a front-end gets built. This is merely a configuration file. When given parameters by the user it will flow these data points to the relevant classes and move the data from source to sink.

Let's discuss the sections in this file.

Orchestrator:

The orchestrator (internal hdm concept) used to orchestrate the execution of your pipeline. The options are:

  • declared_orchestrator — for manual or fully specified execution
  • batch_orchestrator — for when orchestration is defined in a fully specified batch
  • auto_batch_orchestrator — for when the execution is across all tables in specified combinations of databases and schemas

It is formatted in the YAML as such:

declared_orchestrator type

batch_orchestrator type

auto_orchestrator type

State Manager:

Next, and this should be consistent across all of the pipelines, the State Manager is specified. This is the glue the couples the otherwise independent portions of the pipeline together. The options are:

  • SqLiteStateManager — indicates that SQLite is used for state management
  • MySQLStateManager— indicates that MySQL is used for state management
  • SQLServerStateManager — indicates that SQL Server is used for state management
  • AzureSQLServerStateManager — indicates that Azure SQL Server is used for state management

It is formatted in the YAML as such:

Declared data links:

In this section, we define the declared data links.

Template data links:

In this section, we define the templated data links. This is used along with declared_data_links when using batch orchestrator.

Pipeline YMAL example :

Below is the final pipeline YAML with declared orchestration, SQLite for state management, a chunk size of 200 rows per file, snowflake storage stage name TMP_HDM, azure blob container data, and netezza_jdbc as Netezza environment.

Now, let’s discuss how to run hdm and the pre-requisites.

Catalog

Before we run our code, when we are migrating data from one database to another, we must:

  1. Catalog the existing assets.
  2. Map the assets in the source system to the target system.

Run

Now that the environment is specified, pipeline defined, and so on, all that remains is to run the code. The code is executed from bash (or at the terminal) through:

The parameters are:

  • manifest — path of the manifest to run
  • log_settings — log settings path, default value =”log_settings.yml”
  • env — environment to take connection information, default value=”prod”

Watch a demo of hdm in this Hashmap Megabyte video:

Final Thoughts

Above we showed how easy it is to start migrating data from one data warehouse to another using a Netezza to Snowflake use case.

Hashmap Data Migrator is very flexible. Here’s why:

  • It allows the migration of data from one type of data warehouse to another (cloud to cloud, on-prem to cloud, cloud to on-prem).
  • It allows various types of databases for state management (SQLite, MySQL, SQL Server, Azure SQL Server, PostgreSQL, MongoDB).
  • It allows various staging environments (Local file system, Azure blob storage, AWS S3, GCP storage).

Ready to Accelerate Your Digital Transformation?

Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Hashmap’s Data & Cloud Migration and Modernization Workshop is an interactive, two-hour experience for you and your team to help understand how to accelerate desired outcomes, reduce risk, and enable modern data readiness. We’ll talk through options and make sure that everyone has a good understanding of what should be prioritized, typical project phases, and how to mitigate risk. Sign up today for our complimentary workshop.

Other Tools and Content You Might Like

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Jhimli Bora is a Cloud and Data Engineer with Hashmap, an NTT DATA Company, providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Connect with her on LinkedIn.

John Aven, Ph.D., is the Director of Engineering at Hashmap, an NTT DATA Company, providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes

--

--