A metadata comparison between Apache Atlas and Google Data Catalog
Learn how your metadata is structured on both systems.
Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available on: github.
If you missed any of the latest posts on how to ingest metadata into Data Catalog, please check the following: Looker, RDBMS, Tableau, Hive.
The one million dollar question
A Data Catalog is usually defined by a collection of metadata, combined with data management and search tools. This enables organizations to quickly discover, understand, and manage all their data.
Now here’s the one million dollar question.
How do you structure your metadata?
Google Data Catalog
Defines their core metadata as:
Google Data Catalog comes with pre-defined structures to represent metadata. If by any chance the built-in attributes are not enough, users are able to work with Templates to add extra attributes to their assets.
Let’s understand each main component of that diagram.
- Entry Group
An entry group keeps related entries together, by using Cloud Identity and Access Management we can even specify the users who can create, edit, and view entries within that entry group.
It’s worth mentioning that Data Catalog automatically creates an entry group for Big Query entries and Pub/Sub topics.
One entry group as an example, showing some entries ingested that belongs to the Tableau entry group:
For further details on those, please check the Tableau connector.
- Entry
The native Data Catalog entity represents an asset’s technical metadata. Comes with pre-defined fields, changing according to its type
.
This means some fields for a BigQuery table will not be the same as the ones representing a PubSub Topic, although, most of them are common.
It even allows users to create their own Entry types, using custom entries. Like the ones from Tableau, we saw above.
Now let’s look at one entry from Big Query:
We will detail later on what the Tags and Schema tabs are used for.
- Tag Template
Data Catalog provides a templating mechanism, where you can create representations of metadata. One quick example for clarification:
This template contains useful attributes for the discover, understand, and manage flow we talked at the beginning of this post.
We can use them to classify our assets and for example, search and troubleshoot all tables which have the failed
status, or add some automation to our ETL pipeline blocking jobs with tables having a data quality score lower than 5
.
Stay with me to understand how we create tags
with them later on.
After a quick coffee break, let’s move on to Apache Atlas.
Apache Atlas
Defines their core metadata as:
Atlas allows users to define a model for the metadata objects they want to manage. The model is composed of definitions called types
.
A type
represents one or a collection of attributes that define the properties for the metadata object.
— Hey Marcelo, can we compare a type
with any Google Data Catalog object?— Sorry! This is not a fair comparison, if you look at the mental models, the hierarchies are different! But let’s dig deeper and we will find some similarities.
There are two Composite Metatypes: Struct and Relationships that are out of the scope of this article. Google Data Catalog does not support lineage at the time of this writing, so we are not using Relationships.
And if you are using
struct
types, I’d love to know your use cases and perhaps improve this article.
Now let’s understand each main component on that diagram.
- Primitive and Enum Meta types
Think about any programming language, those are the most basic types
, that you can use when creating your Entities and Classifications attributes.
- Collection Meta types
This is where things get interesting, you can use arrays
and maps
structures composed of the primitive
and enum
types.
Let’s say you have a Table in Atlas, that Table will surely contain some columns. So here you would represent the columns as an array
meta-type.
- Composite Meta types
Here are the two most important units, Entities, and Classifications.
- Entities
I told you we would find similarities, Entities in Atlas are close to what we call Entries in Google Data Catalog.
They represent an asset’s technical metadata, the difference, is that there are no pre-defined fields.
Sounds scary right? This give users a lot of flexibility, but comes with complexity, so use with care.
Lucky for us, Atlas comes with some pre-defined entity types for various Hadoop and non-Hadoop metadata, and you can even ingest sample models and data by running their quick_start.
Also, entity types can extend from other types, called superTypes
, so you receive attributes from ancestors. Let’s look at one example:
This image shows the attributes for a Table entity called customer_dim
.
If we look at the Table ancestors, we would get this hierarchy: Table -> DataSet -> Asset -> Referenceable.
And the attributes we are looking at, are the combination of that hierarchy.
Bear in mind that DataSet is one of the most important types — according to Atlas documentation: “DataSet can be expected to have a Schema” — allowing us to add classifications on them later on.
- Classifications
Do you remember Google Data Catalog Templates? We can say Classifications are really similar to them.
Just like Templates entities can be associated with Classifications, enabling easier discovery and management.
As Atlas Entities we have the same attributes and superTypes
capabilities when creating Classifications.
To show how similar Classifications are from Google Data Catalog Templates we are going to create one named ETL Governance.
The difference here, is we are adding the PII
superType.
Classifications and Tags
We talked about Classifications and Templates, but how do we apply them?
- Google Data Catalog
Google Data Catalog uses Tags to apply Templates to Entries.
This is what a Tag looks like:
If you remember the ETL
example at the beginning, lets search using it:
So using Tags, we get rich search capabilities, enhancing our metadata management process as a whole.
Next, we will see how the same features work within Apache Atlas.
- Apache Atlas
Apache Atlas does not create a different object like Tags, it uses the same Classification object to apply them to entities.
This is what a Classification attached to an Entity looks like:
Now let's do the same search:
That’s Great! So at the end of the day, both Google Data Catalog and Apache Atlas core capabilities are similar.
A final comparison
At last, we will put the metadata objects we saw in the article side by side.
Entry Groups
don’t have a correlated object in Apache Atlas. You could use Glossaries, to group your assets in Atlas, but they are out of the scope of this article, and they serve a broader purpose.Entries
are mapped to a combination ofEntities
andAttributes
.TagTemplates
are mapped to a combination ofClassifications
andAttributes
.Tags
are mapped to theClassifications
when they are applied toEntities
.
— Hey Marcelo, now tell me which one is the best?
Sorry! This is not the blog post for that, but what I can say is that Google Data Catalog is a fully managed and serverless product, where Apache Atlas you have to manage yourself.
Closing thoughts
In this article, we compared how Apache Atlas and Google Data Catalog structure their metadata. We could see that many concepts are similar since those are a must to have a good metadata management process in place.
The assets you saw in Google Data Catalog were ingested using the Apache Atlas connector, stay tuned for my next post, where I will show how to execute the connector doing both full and incremental ingestions! Cheers!
References
- Google Data Catalog docs: https://cloud.google.com/data-catalog
- Data Catalog connectors GitHub: https://github.com/GoogleCloudPlatform/datacatalog-connectors
- Apache Atlas installation guide: https://atlas.apache.org/#/Installation
- Apache Atlas Type system: https://atlas.apache.org/1.1.0/TypeSystem.html
- Apache Atlas connector:
https://github.com/GoogleCloudPlatform/datacatalog-connectors-hive/tree/master/google-datacatalog-apache-atlas-connector