How to Discover Your Sap HANA Assets in Google Data Catalog
Learn how to ingest SAP HANA metadata to Google Data Catalog and extend it with your user needs
Google Data Catalog has community-maintained tools like the open source connectors, to ingest metadata from different data sources:
The target audience for this article should have a basic understanding of Data Catalog features and the available connectors. This article is focused on the SAP HANA connector and a new feature that enables the existent RDBMS connectors to ingest any SQL statement into Data Catalog.
For more context in the existent connectors please check the previous post talking about the RDBMS connectors.
Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available on: GitHub.
What is SAP HANA?
SAP HANA is a high-performance in-memory database that speeds data-driven, real-time decisions and actions. As the data foundation of SAP’s Business Technology Platform, it provides advanced analytics on multi-model data, on premise and in the cloud. From SAP official docs.
SAP HANA data masking
One of its interesting features is using data masking and obfuscation on columns like a social security number data:
In SAP HANA, when you create Tables you can specify the type of data mask used — inline or with functions.
So users without the proper access, like in the image above, won’t be able to view the sensitive data.
A sample of how it works, first creating the Function:
Then we can use the Function we’ve created in a Table:
The main idea here is, besides ingesting the basic metadata like Schemas, Tables and Columns.
We can ingest metadata about those Functions and discover all Tables that depend on it, and could contain sensitive data.
It is available in the GitHub: google-datacatalog-saphana-connector. The usual structure of the RDBMS connector, ingests the basic metadata from the source SQL database:
Now with the SQL Objects component, it also supports ingesting any SQL statement into Data Catalog:
We are going to use it to ingest the Functions type from SAP HANA, and look at each required file to enable its usage.
SQL Objects component
This component is set up by creating 3 files:
- Metadata Definition
The Metadata Definition file contains the mapping between the SQL Query and the Data Catalog Entry and Tag types to be ingested.
- SQL Query
The SQL Query file contains the SQL statement with the fields queried from the source database.
- Feature Flag
And finally, the Feature Flag enables users to activate/disable the SQL Object type according to their needs.
When the RDBMS connector implements those files, the connector will ingest that SQL Object type. At the moment of this writing we have two RDBMS connectors using the component:
But it can be extended to all the other RDBMS connectors by just providing the demonstrated configuration files.
Execution in 4 steps
At last let’s ingest the SAP HANA metadata with the Functions SQL Object config, following 4 steps:
- Install the Connector
2. Set up the Service Account
If you are running the connector in a GCP environment, like a compute engine VM, Cloud Functions, Cloud Run, App Engine… you don’t need to download the service account file.
3. Set up environment variables
4. Run Connector
In case you want a quick overview, I’ve put together a demo video showing the execution:
After running the connector, we are able to discover the main assets from a SAP HANA database like Tables, Schemas, and Functions.
1 Search results
Data Catalog indexes the Table and Columns metadata, and we have access to Functions that mask data on each Column and may work with sensitive data:
2 Ingested Table
We can even look at the Function metadata, and leverage Data Curators to enrich it with additional Tags, like if the Function is responsible for anonymizing data.
3 Ingested Function
Finally, we can discover the Functions which work with anonymized data and Tables that use them:
4 Finding dependent table
Having that metadata discoverable can help us in many ways, like when we think about data lineage. If we need to change that
credit_mask function, how do we know which data assets will be affected?
A good start is by making it easily discoverable :).
The sample connector
All topics discussed in this article are covered in a sample connector, available on GitHub: google-datacatalog-saphana-connector.
It is also available on PyPI, google-datacatalog-saphana-connector, if you don’t want to bother cloning the GitHub repo.
Feel free to get it and run it according to the instructions. Contributions are welcome, by the way!
It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
In this article, we have covered how to ingest metadata from SAP HANA into Google Data Catalog. We also went through the SQL Objects component that helps us ingest user-defined Functions into Data Catalog, by doing it we can improve our decision making, knowing what data assets depend on them and could possibly contain sensitive data. That’s it, cheers!