Accelerating Netezza to Snowflake Migrations

Using Hashmap Data Migrator (hdm)

Jhimli Bora

Published in

Hashmap, an NTT DATA Company

7 min readJan 11, 2021

by Jhimli Bora and John Aven

Hashmap Data Migrator (hdm) makes it easy to migrate data from one data platform to another. It supports both on-prem and cloud data warehouse migration. hdm is designed to be very flexible in terms of data source & destination, state management system, and staging location.

Note: This blog post is part of the Hashmap Data Migrator series. Refer to previous blog posts for more details on the terms: hdm (Hashmap Data Migrator)

Let's look at a use case for hdm. Below we will see how easy it is to configure and start migrating data from Netezza to Snowflake (a high demand migration combination from our client base) using a local file system to stage data files locally for manipulations e.g. chunking large files and Azure Blob Storage as storage stage to loading data into Snowflake.

We can use either JDBC or ODBC driver to connect to Netezza. Also, there are 2 ways to offload data; using Netezza source (execute SQL query and load in a pandas dataframe) and Netezza external table source (execute SQL query to create an external table and write the data to a file).

To get started we just need to create 2 configuration files:

profile YAML- holds the connection information
pipeline YAML- holds the stage information

Profile YAML

This file stores the connection information to the source, stage, sink, and database for state management. It's stored in local FS and its path is set in the environment variable “HOME”. The below YAML file format is based on Netezza to Snowflake data transport using azure blob staging:

dev:
  netezza_jdbc:  * Note:Add this section if using JDBC driver
    host: <host>
    port: <port>
    database: <database_name>
    user: <user_name>
    password: <password>
    driver:
      name: <driver_name>
      path: <driver_path>
  netezza_odbc:  * Note:Add this section if using ODBC driver
    host: <host>
    port: <port>
    database: <database_name>
    user: <user_name>
    password: <password>
    driver: <driver_name>
  snowflake_admin_schema:
    authenticator: snowflake
    account: <account>
    role: <role>
    warehouse: <warehouse_name>
    database: <database_name>
    schema: <schema_name>
    user: <user_name>
    password: <password>
  azure:
    url: <blob_url>
    azure_account_url: <blob_url starting with azure://...>
    sas: <sas_key>
    container_name: <blob_container_name>
  state_manager:
    host: <host>
    port: <port>
    database: <database_name>
    user: <user_name>
    password: <password>
    driver: ODBC Driver 17 for SQL Server <*Note:only for azure sql server>

Pipeline YAML

The user should be focused on this until a front-end gets built. This is merely a configuration file. When given parameters by the user it will flow these data points to the relevant classes and move the data from source to sink.

Let's discuss the sections in this file.

Orchestrator:

The orchestrator (internal hdm concept) used to orchestrate the execution of your pipeline. The options are:

declared_orchestrator — for manual or fully specified execution
batch_orchestrator — for when orchestration is defined in a fully specified batch
auto_batch_orchestrator — for when the execution is across all tables in specified combinations of databases and schemas

It is formatted in the YAML as such:

declared_orchestrator type

orchestrator:
  name: Manual Orchestration
  type: DeclaredOrchestrator
  conf: null

batch_orchestrator type

orchestrator:
  name: Batch Orchestration
  type: BatchOrchestrator
  conf:
    back_pressure_factor: 10

auto_orchestrator type

orchestrator:
  name: Auto Batch Orchestration
  type: AutoOrchestrator
  conf:
    back_pressure_factor: 10

State Manager:

Next, and this should be consistent across all of the pipelines, the State Manager is specified. This is the glue the couples the otherwise independent portions of the pipeline together. The options are:

SqLiteStateManager — indicates that SQLite is used for state management
MySQLStateManager— indicates that MySQL is used for state management
SQLServerStateManager — indicates that SQL Server is used for state management
AzureSQLServerStateManager — indicates that Azure SQL Server is used for state management

It is formatted in the YAML as such:

state_manager:
  name: state_manager
  type: SQLiteStateManager
  conf:
    connection: state_manager

Declared data links:

In this section, we define the declared data links.

declared_data_links:
  stages:
    - source:
        name: netezza_source
        type: NetezzaSource
        conf:
          env: netezza_jdbc
          table_name: ADMIN.TEST1
          watermark:
              column: T1
              offset: 2
          checksum:
              function:
              column:
      sink:
        name: fs_chunk_stg
        type: FSSink
        conf:
          directory: $HDM_DATA_STAGING

Template data links:

In this section, we define the templated data links. This is used along with declared_data_links when using batch orchestrator.

template_data_links:
  templates:
    - batch_template:
      batch_definition:
          source_name: netezza_source
          scenarios:
            - table_name: ADMIN.TEST1
              watermark:
                 column : T1
                 offset : 2
            - table_name: ADMIN.TEST2
              watermark:
                column: U1
                offset: 2
      source:
        name: netezza_source
        type: NetezzaSource
        conf:
          env: netezza_jdbc
          table_name: <<template>>
          watermark:
              column: <<template>>
              offset: <<template>>
      sink:
        name: fs_chunk_stg
        type: FSSink
        conf:
          directory: $HDM_DATA_STAGINGdeclared_data_links:
  stages:
    - source:
      sink:

Pipeline YMAL example :

Below is the final pipeline YAML with declared orchestration, SQLite for state management, a chunk size of 200 rows per file, snowflake storage stage name TMP_HDM, azure blob container data, and netezza_jdbc as Netezza environment.

version: 1

orchestrator:
  name: Manual Orchestration
  type: DeclaredOrchestrator
  conf: null

state_manager:
  name: state_manager
  type: SQLiteStateManager
  conf:
    connection: state_manager

declared_data_links:
  stages:

# Netezza to FS - single
    - source:
        name: netezza_source
        type: NetezzaSource
        conf:
          env: netezza_jdbc
          table_name: ADMIN.TEST1
          watermark:
              column: T1
              offset: 2
          checksum:
              function:
              column:
         # directory: $HDM_DATA_STAGING
         # note: directory only for NetezzaExternalTableSource typesink:
        name: fs_chunk_stg
        type: FSSink
        conf:
          directory: $HDM_DATA_STAGING

# Chunk FS to FS
    - source:
        name: fs_chunk_stg
        type: FSChunkSource
        conf:
          directory: $HDM_DATA_STAGING
          chunk: 200
      sink:
        name: fs_stg
        type: FSSink
        conf:
          directory: $HDM_DATA_STAGING

# FS to Azure Blob
    - source:
        name: fs_stg
        type: FSSource
        conf:
          directory: $HDM_DATA_STAGING
      sink:
        name: azure_stg
        type: AzureBlobSink
        conf:
          env: azure
          container: data

#cloud storage create staging and run copy
    - source:
        name: azure_stg
        type: AzureBlobSource
        conf:
          env: azure
          container: data
      sink:
        name: sflk_copy_into_sink
        type: SnowflakeAzureCopySink
        conf:
          env: snowflake_knerrir_schema
          stage_name: TMP_HDM
          file_format: csv
          stage_directory: data

Now, let’s discuss how to run hdm and the pre-requisites.

Catalog

Before we run our code, when we are migrating data from one database to another, we must:

Catalog the existing assets.
Map the assets in the source system to the target system.

Run

Now that the environment is specified, pipeline defined, and so on, all that remains is to run the code. The code is executed from bash (or at the terminal) through:

python -m hashmap_data_migrator {manifest} -l {log settings} -e {env}

or

hashmap_data_migrator {manifest} -l {log settings} -e {env}

The parameters are:

manifest — path of the manifest to run
log_settings — log settings path, default value =”log_settings.yml”
env — environment to take connection information, default value=”prod”

Watch a demo of hdm in this Hashmap Megabyte video:

Final Thoughts

Above we showed how easy it is to start migrating data from one data warehouse to another using a Netezza to Snowflake use case.

Hashmap Data Migrator is very flexible. Here’s why:

It allows the migration of data from one type of data warehouse to another (cloud to cloud, on-prem to cloud, cloud to on-prem).
It allows various types of databases for state management (SQLite, MySQL, SQL Server, Azure SQL Server, PostgreSQL, MongoDB).
It allows various staging environments (Local file system, Azure blob storage, AWS S3, GCP storage).

Ready to Accelerate Your Digital Transformation?

Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Hashmap’s Data & Cloud Migration and Modernization Workshop is an interactive, two-hour experience for you and your team to help understand how to accelerate desired outcomes, reduce risk, and enable modern data readiness. We’ll talk through options and make sure that everyone has a good understanding of what should be prioritized, typical project phases, and how to mitigate risk. Sign up today for our complimentary workshop.

Data & Cloud Migrate & Modernize Workshop | Hashmap

We help map out your digital transformation journey to the cloud with insights, perspectives, team activities, and…

www.hashmapinc.com

Accelerating Netezza to Snowflake Migrations

Using Hashmap Data Migrator (hdm)

Profile YAML

Pipeline YAML

Orchestrator:

State Manager:

Declared data links:

Template data links:

Pipeline YMAL example :

Catalog

Run

Final Thoughts

Ready to Accelerate Your Digital Transformation?

Data & Cloud Migrate & Modernize Workshop | Hashmap

We help map out your digital transformation journey to the cloud with insights, perspectives, team activities, and…

Other Tools and Content You Might Like

Netezza to Snowflake Migration | Hashmap

We help companies migrate and modernize from Netezza to Snowflake using a proven approach with a predictable…

Snowflake Utilities & Accelerators | Do more with Snowflake | Hashmap

Try out all the Snowflake utilities that Hashmap has available and do more with Snowflake: Snowflake Inspector…

Hashmap Megabytes | Bite-Size Video Series

Hashmap Megabytes is a weekly video series in which mega cloud ideas are explained in bite-size portions.

Hashmap on Tap | Hashmap Podcast

A rotating cast of Hashmap hosts and special guests explore different technologies from diverse perspectives while enjoying a beverage.

Written by Jhimli Bora