How We Use Hashmap Data Cataloger (hdc) in Cloud Data Migrations: Part 2
This is the second of the two-part series on Hashmap Data Cataloger, a Python library created to assist in the migration of data from on-premise sources to cloud.
About
Hashmap Data Cataloger, or hdc, is an open-source Python library created to replicate the schematic structures from system A to system B during the data migration journey between those endpoints. It is Apache 2.0 licensed and is available from PyPi as ‘hashmap-data-cataloger.’
To quickly recap from the previous blog post, data migrations between source to target systems usually follow a pattern of actions that can be broadly categorized as:
- Crawl: Connect with a source data system, search through the collection of data assets, and put together an inventory list of assets to move over.
- Map: With the assets discovered, map (translate) their structures to those made available by the target system. This may include data type mapping, structure creation, etc.
- Transfer: Once the underlying structures are created in the target system, i.e., the skeleton built, it's time to pour in the concrete — move the actual data in. This activity will include orchestrating the processes that extract the actual data from older structures and load it into the new structures.
hdc helps automate the first two phases above.
Originally, it was designed only as a library that exposes the necessary API for use by a complementary utility called hdm that does the actual data extract and load within the larger Hashmap Data Suite collection of utilities. But it has now evolved to include a CLI interface that enables it to be used independently to either get a quick look at the data assets discovered or do the full structural mapping independently as preparation for imminent data movement.
As of this writing, hdc v0.2 supports migration between sources like Netezza, Oracle, Hive, and HDFS to Snowflake.
Architecture
hdc has an Object-Oriented design and tries to follow the SOLID principles for a robust, modular, and extensible application architecture.
It is composed of the following elements:
At its very core, it has 3 primary abstractions:
- Crawler: The interface that provides the search & discovery of data assets within a given source endpoint.
- Mapper: The interface that converts the discovered assets' schematic structures from the source of their target compliant counterparts.
- Creator: The interface that actually creates the mapped structures in the target endpoint.
The following API abstractions use these core abstractions:
- Cataloger: Provides the ‘obtain_catalog’ API. Cataloger helps to instantiate a certain concrete Crawler based on the configurations given and through that object retrieve a list of data assets available.
- Asset Mapper: Provides the ‘map_assets’ API. Asset Mapper helps to instantiate concrete Crawlers, Mappers, and Creators based on the configurations given. Through those, run the end-to-end process of discovering data assets, mapping their structures for a target, and re-creating the mapped structures in that target.
These same library APIs are also exposed via the CLI module that gives options to invoke these parts as:
usage: hdc [-h] -r {catalog,map} -s source [-d destination] [-c config] [-l log_settings]
optional arguments:
-h, Show this help message and exit
-r {catalog,map}, One of ‘catalog’ or ‘map’
-s source, Name of any one of sources configured in hdc.yml
-d destination, Name of any one of destinations configured in hdc.yml
-c config, Path to application config (YAML) file if other than default ‘hdc.yml’
-l log_settings, Path to log settings (YAML) file if other than defaultFor example:
hdc -r catalog -s oracle
hdc -r map -s oracle -d snowflake
hdc -r map -s netezza -d snowflake
Another important aspect of hdc is that it is configuration-driven. It has externalized the metadata it needs as YAML configuration files. The tool auto-discovers from a location pointed to by the HDS_HOME environment variable or as provided at the CLI.
Overall there are 3 types of configuration files, each with a different purpose, explained in the following sections.
Application Configuration
The starting point for the tool is a file called hdc.yml, which defines the different endpoints as ‘sources’ and ‘destinations.’ Each source/destination is defined with a ‘type’ (class) and other relevant key-value properties as ‘conf.’ In addition to ‘source’ and ‘destinations,’ there is a third section called ‘mappers’ which captures the relevant properties required to map a named source to a named destination.
The default version comes with pre-configured sources, destinations, and mappers that can be used as-is.
The user would only need to update the ‘conf.profile’ property for each source/destination individually. This is a referenced property as it points to one of the connection profiles specified in the second configuration file used by the tool called ‘profile.yml.’
hdc.yml can be placed in the HDS_HOME directory or overridden from CLI with the ‘-c’ option.
Connection Profile Configuration
The second type of config file used is called profile.yml, and it contains the relevant connection details for each of the source and destination systems configured in the hdc.yml file
It looks like this:
No default version is created for this config file as the details it captures are environment-specific connections and credentials. So the user would need to provide the information based on the actual endpoints.
Furthermore, since this file is intended to contain credentials, it is not accepted as an argument from the command line rather discovered from the directory pointed to by HDS_HOME. The recommended approach, for now, is to use a hidden directory to store all configuration files and capture their path in this environment variable.
Log Settings Configuration
log_settings.yml is a typical log configuration used by Python’s logging module and spells out log relevant properties for each class hierarchy:
The default version of this file creates the basic configuration for all the major classes in the hierarchy and can be used to extend further as needed.
This file can be placed in the HDS_HOME directory or overridden from CLI using the ‘-l’ option.
Resources
For more information, check out the below resources:
GitHub: https://github.com/hashmapinc/hdc
GitLab: https://gitlab.com/hashmapinc/oss/hdc
PyPi: https://pypi.org/project/hashmap-data-cataloger/
Ready to Accelerate Your Digital Transformation?
At Hashmap, we work with our clients to build better together.
If you are considering moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, please contact us.
Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements.
Hashmap’s Data & Cloud Migration and Modernization Workshop is an interactive, two-hour experience for you and your team to help understand how to accelerate desired outcomes, reduce risk, and enable modern data readiness. We’ll talk through options and make sure that everyone understands what should be prioritized, typical project phases, and how to mitigate risk. Sign up today for our complimentary workshop.
Other Tools and Content For You
Chinmayee Lakkad is a Regional Technical Expert and Cloud/Data Engineer at Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.