A metadata comparison between Apache Atlas and Google Data Catalog
Learn how your metadata is structured on both systems.
Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available on: github.
The one million dollar question
A Data Catalog is usually defined by a collection of metadata, combined with data management and search tools. This enables organizations to quickly discover, understand, and manage all their data.
Now here’s the one million dollar question.
How do you structure your metadata?
Google Data Catalog
Defines their core metadata as:
Google Data Catalog comes with pre-defined structures to represent metadata. If by any chance the built-in attributes are not enough, users are able to work with Templates to add extra attributes to their assets.
Let’s understand each main component of that diagram.
- Entry Group
An entry group keeps related entries together, by using Cloud Identity and Access Management we can even specify the users who can create, edit, and view entries within that entry group.
It’s worth mentioning that Data Catalog automatically creates an entry group for Big Query entries and Pub/Sub topics.
One entry group as an example, showing some entries ingested that belongs to the Tableau entry group:
For further details on those, please check the Tableau connector.
The native Data Catalog entity represents an asset’s technical metadata. Comes with pre-defined fields, changing according to its
This means some fields for a BigQuery table will not be the same as the ones representing a PubSub Topic, although, most of them are common.
It even allows users to create their own Entry types, using custom entries. Like the ones from Tableau, we saw above.
Now let’s look at one entry from Big Query:
We will detail later on what the Tags and Schema tabs are used for.
- Tag Template
Data Catalog provides a templating mechanism, where you can create representations of metadata. One quick example for clarification:
This template contains useful attributes for the discover, understand, and manage flow we talked at the beginning of this post.
We can use them to classify our assets and for example, search and troubleshoot all tables which have the
failed status, or add some automation to our ETL pipeline blocking jobs with tables having a data quality score lower than
Stay with me to understand how we create
tags with them later on.
After a quick coffee break, let’s move on to Apache Atlas.
Defines their core metadata as:
Atlas allows users to define a model for the metadata objects they want to manage. The model is composed of definitions called
type represents one or a collection of attributes that define the properties for the metadata object.
— Hey Marcelo, can we compare a
type with any Google Data Catalog object?— Sorry! This is not a fair comparison, if you look at the mental models, the hierarchies are different! But let’s dig deeper and we will find some similarities.
There are two Composite Metatypes: Struct and Relationships that are out of the scope of this article. Google Data Catalog does not support lineage at the time of this writing, so we are not using Relationships.
And if you are using
structtypes, I’d love to know your use cases and perhaps improve this article.
Now let’s understand each main component on that diagram.
- Primitive and Enum Meta types
Think about any programming language, those are the most basic
types, that you can use when creating your Entities and Classifications attributes.
- Collection Meta types
This is where things get interesting, you can use
maps structures composed of the
Let’s say you have a Table in Atlas, that Table will surely contain some columns. So here you would represent the columns as an
- Composite Meta types
Here are the two most important units, Entities, and Classifications.
I told you we would find similarities, Entities in Atlas are close to what we call Entries in Google Data Catalog.
They represent an asset’s technical metadata, the difference, is that there are no pre-defined fields.
Sounds scary right? This give users a lot of flexibility, but comes with complexity, so use with care.
Lucky for us, Atlas comes with some pre-defined entity types for various Hadoop and non-Hadoop metadata, and you can even ingest sample models and data by running their quick_start.
Also, entity types can extend from other types, called
superTypes, so you receive attributes from ancestors. Let’s look at one example:
This image shows the attributes for a Table entity called
If we look at the Table ancestors, we would get this hierarchy: Table -> DataSet -> Asset -> Referenceable.
And the attributes we are looking at, are the combination of that hierarchy.
Bear in mind that DataSet is one of the most important types — according to Atlas documentation: “DataSet can be expected to have a Schema” — allowing us to add classifications on them later on.
Do you remember Google Data Catalog Templates? We can say Classifications are really similar to them.
Just like Templates entities can be associated with Classifications, enabling easier discovery and management.
As Atlas Entities we have the same attributes and
superTypes capabilities when creating Classifications.
To show how similar Classifications are from Google Data Catalog Templates we are going to create one named ETL Governance.
The difference here, is we are adding the
Classifications and Tags
We talked about Classifications and Templates, but how do we apply them?
- Google Data Catalog
Google Data Catalog uses Tags to apply Templates to Entries.
This is what a Tag looks like:
If you remember the
ETL example at the beginning, lets search using it:
So using Tags, we get rich search capabilities, enhancing our metadata management process as a whole.
Next, we will see how the same features work within Apache Atlas.
- Apache Atlas
Apache Atlas does not create a different object like Tags, it uses the same Classification object to apply them to entities.
This is what a Classification attached to an Entity looks like:
Now let's do the same search:
That’s Great! So at the end of the day, both Google Data Catalog and Apache Atlas core capabilities are similar.
A final comparison
At last, we will put the metadata objects we saw in the article side by side.
Entry Groupsdon’t have a correlated object in Apache Atlas. You could use Glossaries, to group your assets in Atlas, but they are out of the scope of this article, and they serve a broader purpose.
Entriesare mapped to a combination of
TagTemplatesare mapped to a combination of
Tagsare mapped to the
Classificationswhen they are applied to
— Hey Marcelo, now tell me which one is the best?
Sorry! This is not the blog post for that, but what I can say is that Google Data Catalog is a fully managed and serverless product, where Apache Atlas you have to manage yourself.
In this article, we compared how Apache Atlas and Google Data Catalog structure their metadata. We could see that many concepts are similar since those are a must to have a good metadata management process in place.
The assets you saw in Google Data Catalog were ingested using the Apache Atlas connector, stay tuned for my next post, where I will show how to execute the connector doing both full and incremental ingestions! Cheers!
- Google Data Catalog docs: https://cloud.google.com/data-catalog
- Data Catalog connectors GitHub: https://github.com/GoogleCloudPlatform/datacatalog-connectors
- Apache Atlas installation guide: https://atlas.apache.org/#/Installation
- Apache Atlas Type system: https://atlas.apache.org/1.1.0/TypeSystem.html
- Apache Atlas connector: