How We Use Hashmap Data Cataloger (hdc) in Cloud Data Migrations: Part 1

Chinmayee Lakkad
Hashmap, an NTT DATA Company
5 min readMay 10, 2021

This article is the first of a two-part series to showcase show we use Hashmap Data Cataloger or hdc, a Python library created to assist in the migration of on-premise data sources to the cloud.

Background

The effort of modernizing one’s data infrastructure is a journey of 1000 miles. The goal being, to build a more efficient system to serve the existing business cases or go a step further and create newer business cases driven by data insights. However, no matter the end goal, modernization will either start from or culminate with that single step called data migration.

The current data migration trend shows a general beeline towards cloud platforms, moving away from infrastructure on-premise. Towards that end, one could choose to organize the data within either of the following inside the cloud:

  • A data lake
  • A data warehouse
  • A combination of both, a lake and a warehouse

Still, irrespective of the type of target data infrastructure, the migration process itself will involve the following high-level activities, typically:

  • Crawl or discover the data assets at the source that qualify as candidates for transfer
  • Map and replicate the container structures at the target to hold the data that’s moving in
  • Transfer the data into the newer platform

Opportunity

These are simple activities with many repetitive actions both within themselves and then, in the manner each hands-off or interacts with its logical successor that can be abstracted out and automated.

E.g., Let us take the case of migrating data from Oracle to Snowflake. Given these two endpoints, a Data Engineer would likely do the following:

Data Migration Phases

Once this is worked out, any future migrations between these same endpoints can reuse the same methodology, logic, processes to accelerate the task.

And then, the same idea can be extended to a combination of different endpoints. This makes the entire migration process repeatable and pattern-based that lends itself to automation quite well. And with automation comes acceleration.

Accelerator

There are excellent commercial cloud applications out there that have been designed to tackle this under the Data Integration space. However, to fill a temporary gap that might occur when such tooling may not be immediately available, our team at Hashmap developed an open-source suite of tools called the Hashmap Data Suite (hds) to assist in the data migration journey. The underlying philosophy behind this suite was to provide an open-source, extendable collection of libraries customized by the users (Data Engineers) based on their specific needs.

Under this umbrella of hds, the Hashmap Data Cataloger (hdc) is one such library that has been designed to automate the Crawl & Map phases of a data migration journey.

Hashmap Data Suite

The overall goal for hdc is to be used as an integrated part of a companion tool under the hds suite, called the Hashmap Data Migrator (hdm).

hdc helps lay the groundwork by identifying the data assets from a source system and then replicating them in the target system (crawl data assets and mapping their structures in the target).

Once this stage is set, hdm takes over to orchestrate the actual data transport into the newly minted structures. Put together, hdc and hdm enable the complete process of data migration from point A to point B.

Furthermore, hdc also has a CLI interface that allows for its independent use to crawl and/or map data assets from a source to a target.

Up next

The part-2 of this series will take a deeper dive into the hdc tool's mechanics and explain the current endpoints it can handle, its usage from CLI or as a library, and the future roadmap.

Stay tuned!

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better together.

If you consider moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, please contact us.

Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements.

Hashmap’s Data & Cloud Migration and Modernization Workshop is an interactive, two-hour experience for you and your team to help understand how to accelerate desired outcomes, reduce risk, and enable modern data readiness. We’ll talk through options and make sure that everyone understands what should be prioritized, typical project phases, and how to mitigate risk. Sign up today for our complimentary workshop.

Other Tools and Content For You

Chinmayee Lakkad is a Regional Technical Expert and Cloud/Data Engineer at Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--