How to extend your Apache Atlas metadata to Google Data Catalog

From design decisions to step by step execution learn how to ingest Apache Atlas metadata in Google Data Catalog doing full and incremental runs.

Image created on Canva.

Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available on: github.

If you missed the latest post talking about how Apache Atlas and Data Catalog structure their metadata, please check a-metadata-comparison-between-apache-atlas-and-google-data-catalog.

The Dress4Victory company

In this article, we will start by creating a fictional scenario, where we have the Dress4Victory company. They help their users getting the best deals when buying clothes, and over the years they have grown from a few servers to several hundred servers.

Dress 4 Victory company logo

This company has many analytics workloads to handle their user data and to support it, their tech stack is composed mostly of Hadoop components.

Generally speaking, their metadata management was a mess. So last year their CTO added Apache Atlas to their tech stack, to better organize their metadata, and visualize the data structures of their enterprise.

By improving their metadata management, it helped them solve many problems like:

  • Analysts taking a long time to find meaningful data.
  • Customer data spread everywhere.
  • Issues with access controls on their data.
  • Compliance requirements being ignored.

Now they are migrating some workloads to Google Cloud Platform, and he’s scared that it will be much harder to manage their metadata since Apache Atlas has starting to work for them.

He found out about Google Data Catalog, and he would love to use it, since it’s fully managed and will reduce his operational costs, but they can’t migrate everything to GCP, at the moment.

Luckily for him, he found out that there’s a connector for apache-atlas. He wants to start right away and test it out.

Full ingestion execution

Let’s help Dress4Victory and look at the Apache Atlas connector full ingestion architecture:

Full Ingestion Architecture

On each execution, it’s going to:

  • Scrape: connect to Apache Atlas and retrieve all the available metadata.
  • Prepare: transform it in Data Catalog entities and create Tags with extra metadata.
  • Ingest: send the Data Catalog entities to the Google Cloud project.

Currently, the connector supports the below Apache Atlas asset types:

  • Entity Types
    Each Entity Type is converted into a Data Catalog Template with their attributes metadata. Since Google Data Catalog has pre-defined attributes, we create an extra Template to enrich the Apache Atlas metadata.
  • Classification Types
    Each Classification Type is converted into a Data Catalog Template, so we are able to empower users to create Tags, using the same Classification they were used to work within Apache Atlas. If there are Classifications attached to Entities, the connector also migrates them as Tags.
  • Entities
    Each Entity is converted into a Data Catalog Entry. Since we don’t have a Type structure in Google Data Catalog, all entries from the same type, share the same Template, so users can search in a similar way they would do in Apache Atlas.

Since even Columns are represented as Apache Atlas Entities, this connector, allows users to specify the Entity Types list as a command to be considered in the ingestion process.

At the time this was published, Data Catalog does not support Lineage, so this connector does not use the Lineage information from Apache Atlas. We might consider updating this if things change.

Running it

After setting up the connector environment, by following the instructions at the Github repo, let’s execute it using its command line args:

#Environment variables
export GOOGLE_APPLICATION_CREDENTIALS=datacatalog_credentials_file
export DATACATALOG_PROJECT_ID=google_cloud_project_id
export APACHE_ATLAS2DC_HOST=localhost
export APACHE_ATLAS2DC_PORT=21000
export APACHE_ATLAS2DC_USER=my-user
export APACHE_ATLAS2DC_PASS=my-pass
google-datacatalog-apache-atlas-connector sync \
--datacatalog-project-id $DATACATALOG_PROJECT_ID \
--atlas-host $APACHE_ATLAS2DC_HOST \
--atlas-port $APACHE_ATLAS2DC_PORT \
--atlas-user $APACHE_ATLAS2DC_USER \
--atlas-pass $APACHE_ATLAS2DC_PASS

Results

Turn the subtitles on for step-by-step guidance when watching the video.

Full run Apache Atlas demo

Now we have all Apache Atlas Classifications and Entity Types inside Google Data Catalog as Tag Templates:

Apache Atlas ingested classifications and entity types

And Entities as Data Catalog custom entries:

Apache Atlas ingested entities

Remember the Hadoop components from Dress4Victory? We even have some loadProcesses from Hive ingested, like loadsalesdaily and loadsalesmonthly.

Dress4Victory really liked that, but they came to the conclusion that since they have a hundred servers, doing a full ingestion every once in a while, won’t work for them.

In case you are wondering how long a full ingestion would take with 4048 Apache Atlas entries, go to the Execution Metrics section at the end of this post.

Luckily for us, we have an option to do Incremental ingestions, next we will look at how that option works.

Incremental ingestion execution

If you are not familiar with Apache Atlas, this image shows their architecture:

Atlas Architecture from Atlas docs

So we will leverage that Kafka messaging event bus, on the connector incremental run mode:

Incremental Ingestion Architecture

Now that we have done a full run, we can execute incremental ingestions.

On each execution, it’s going to:

  • Scrape: listen to event changes on Apache Atlas event bus, which we saw, is Kafka, and retrieve the metadata for the given event. Since Kafka works with a pull model, the connector works by polling the metadata in pre-configured intervals. The polling mechanism, listen to the ATLAS_ENTITIES topic.
  • Prepare: transform it in Data Catalog entities and create Tags with extra metadata.
  • Ingest: send the Data Catalog entities to the Google Cloud project.

The main difference here, is we have a much faster execution, we only need to deal with a piece of the whole metadata. And if we have a high event throughput, we can even run multiple instances of the connector assigning them different partitions.

Running it

After setting up the connector environment, by following the instructions at the Github repo, let’s execute it using its command line args:

#Environment variables
export GOOGLE_APPLICATION_CREDENTIALS=datacatalog_credentials_file
export DATACATALOG_PROJECT_ID=google_cloud_project_id
export APACHE_ATLAS2DC_HOST=localhost
export APACHE_ATLAS2DC_PORT=21000
export APACHE_ATLAS2DC_USER=my-user
export APACHE_ATLAS2DC_PASS=my-pass
export APACHE_ATLAS2DC_EVENT_SERVERS=my-event-server
export APACHE_ATLAS2DC_CONSUMER_GROUP=atlas-event-sync
google-datacatalog-apache-atlas-connector sync-event-hook \
--datacatalog-project-id $APACHE_ATLAS2DC_DATACATALOG_PROJECT_ID \
--atlas-host $APACHE_ATLAS2DC_HOST \
--atlas-port $APACHE_ATLAS2DC_PORT \
--atlas-user $APACHE_ATLAS2DC_USER \
--atlas-pass $APACHE_ATLAS2DC_PASS \
--event-servers $APACHE_ATLAS2DC_EVENT_SERVERS \
--event-consumer-group-id $APACHE_ATLAS2DC_CONSUMER_GROUP

Results

Turn the subtitles on for step-by-step guidance when watching the video.

Incremental run Apache Atlas demo

With the incremental execution, in a few minutes we can see the column level Tag we added on Apache Atlas:

Log Data Tag on address column

Awesome, now Dress4Victory is happy. Let’s wrap up by looking at some execution metrics.

Execution Metrics

Finally, let’s look at some metrics generated from a full execution. Metrics were collected by running an Apache Atlas 1.0.0 instance populated with 1013 Tables, 1 StorageDesc, 3026 Columns, 2 Views, 3 Databases, and 3 LoadProcesses, resulting in 4048 entities.

The following metrics are not a guarantee, they are approximations that may change depending on the environment, network and execution.

Metrics summary

For reference, Google Data Catalog provides a free tier of 1 million API calls in a month, and $10 per 100,000 API calls over 1 million API calls.

Also if we look at the execution time, 90 minutes, that execution ingested more than 4000 assets, which is not a small number of cataloged assets. Good thing we have an option to do incremental runs afterward :).

For the most up-to-date info about Data Catalog billing, go to: Data Catalog billing docs.

The sample connector

All topics discussed in this article are covered in a sample connector, available on GitHub: apache-atlas-connector. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!

It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Closing thoughts

In this article, we have covered how to ingest metadata from Apache Atlas into Google Data Catalog, doing full and incremental runs. We also went through a fictional company, which would leverage Google Data Catalog to extend its metadata management, and since it’s a fully managed and serverless solution, it surely adds a lot of value! Cheers!

References

--

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Recommended from Medium

Deploying your own Kafka cluster in AWS via Terraform and Ansible

Mastering GitHub packages like a pro

Introducing AlloyCI

Deploy a Dockerized Flask App to Google Cloud Platform

Backporting the latest Google Cloud operators in Apache Airflow and Cloud Composer

Why I think Elm is the Future of Front End Development

Set Sail with Tailwind CSS

Configuring a Virtual Python Programming Environment on EC2 Running Ubuntu

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Marcelo Costa

Marcelo Costa

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai

More from Medium

How DataStax Enterprise Analytics Simplifies Migrating to DataStax Astra DB

Dynamic Pricing Platform (3/5)

Data Lake Workloads Synchronization Using Google PubSub

Build a real-time data analytics pipeline with Airbyte, Kafka, and Pinot