Propagating metadata across our data architecture

How we connected metadata from our systems to Apache Atlas.

Rafael Augusto Monteiro
Blog Técnico QuintoAndar
8 min readOct 4, 2021

--

Photo by Dina Lydia on Unsplash

Introduction

In our previous article, we discussed the first steps of our Data Governance team, how we chose Apache Atlas as our governance tool, and how we needed to create a new service to propagate metadata from different systems to Atlas. If you haven’t read it, make sure to take a look!

In this article, we’ll dive more deeply into our data architecture, what are our use cases for Apache Atlas, and what solutions we developed to make everything work.

A (very) brief overview of our architecture:

Before we discuss more about tools, let’s take a look at our data architecture here at QuintoAndar:

Diagram 1: Data flowing in our architecture. We ingest data from various sources, run it through ETL pipelines, and store it in our data lakehouse.

We ingest data from SQL and NoSQL databases, S3 buckets, APIs and spreadsheets. Then, it passes through ETL pipelines, which run in Databricks and are orchestrated using Apache Airflow. Finally, the data is stored in AWS S3, with its metadata managed by Hive metastore, so our data lovers can access it using Trino, just as if it was any regular data warehouse. This kind of data lake/data warehouse architecture is called data lakehouse.

We also wrote articles about our Hive metastore and Trino deployments, so make sure to check it if you’re interested:

Since we’re dealing with a lot of data, it is fairly easy for someone to not understand what some tables or categories are used for. That’s why it is important to have a data catalog: a tool to categorize information about our data (also called metadata). As we said in our previous article, we have an older data catalog that resides on spreadsheets and was manually populated, and now we’re replacing it for Apache Atlas.

Our first priority with Atlas was to catalog information about different kinds of metadata from a few sources: Table schemas from Hive metastore, which are generated whenever tables are created or updated in the lake house; Data lineage, which is defined by the transformations applied in the ETL pipelines; Data classification, such as what informations are sensible or PII; and documentation of tables and categories, which were being written by data analysts and engineers in spreadsheets.

Here’s an example of each metadata type, defined in YAML format:

Since each metadata type comes from a different source, it was a good ideia to create a service to propagate it across multiple systems. And we couldn’t come up with a funny name to it, so we decided to just call it Metadata Propagator.

Introducing Metadata Propagator

Metadata Propagator is a Python service created to propagate metadata across our systems. Users can interact using a web API or publishing events directly into Kafka topics, and it can send metadata to different destinations.

Diagram 2: Metadata propagator receives http requests from Airflow DAGs and Drone tasks, and Hive events via Kafka, and propagates metadata definitions to Google sheets and Apache Atlas.

Internally, it was implemented using AIOHTTP, which enables it to perform well under heavy loads and allows us to process requests asynchronously. Requests made to the API create events in Kafka, which acts both as an internal queue of tasks to process and as an interface for push-based systems. For example, it can receive a request to create a table documentation, or we can push the same event directly to a Kafka topic. Then, each event will be consumed by a chain of consumers that will publish changes to different destinations, such as Google Sheets or Atlas.

This architecture allows us to decouple the propagation logic from our Airflow DAGs and other scripts, and allows us to easily compose different actions whenever there’s an input event.

Let’s check some use cases where Metadata Propagator is being used:

Data Documentation

Our old data catalog was a spreadsheet in Google Sheets where we described what each table, column and data category represented. While being a very easy approach to start documenting things, this method has some shortcomings, such as lack of versioning, lack of schema validations, and the overall scalability of the solution, since previously it was a manual process that was very prone to user errors.

Since we were planning to move everything to Atlas, but we also had a lot of users that were using the spreadsheet catalog daily, we decided to use a hybrid approach: migrating the documentation from Sheets to YAML files in a Github repo, which would be replicated to both spreadsheet catalog and Atlas whenever new files are merged. In this new format, we could use CI/CD orchestration for schema validations and also publish those documentation changes to other systems.

And that’s where Metadata Propagator comes in: whenever a new PR is approved, a script runs in Drone (our CI/CD orchestrator), sends that documentation to a AWS S3 bucket, and calls an endpoint in Metadata Propagator, which creates events to update the documentation. Then, the events are consumed by specific Atlas and Sheets consumers, the data from the S3 bucket is read, and the documentation definition is updated in each destination.

Diagram 3: Documentation propagation flow

This architecture provides a lot of flexibility. For example, if we decide that we’re ready to stop using the spreadsheet, we just need to disable the Sheets events consumer. If we want to also push data to any other visualization tool, such as Looker, we just need to add another consumer that outputs data to it. Isn’t it easy?

Table schemas updates

In our ETL pipelines, we’re constantly creating new tables and updating existing ones. In order to keep the schema definitions in the catalog up to date, we use a listener that intercepts changes made to Hive and publishes them directly to a Kafka topic. Then, those events are consumed by Metadata Propagator, and the Atlas entities are updated (entities is how Atlas calls databases, tables, columns, etc). Here’s a flow diagram of this process

Diagram 4: Hive table schema update propagation flow.

Using this push-based architecture we’re able to react to all changes happening in Hive and easily propagate them to Atlas (and anywhere we decide to propagate those changes in the future).

Data lineage and tagging

Another important use case was Data Lineage and Tagging. Atlas provides a Data Lineage functionality, which allows us not only to visualize where data comes from and where it is going, but also allows us to propagate tags to derived data. This allows us to tag source columns as PII or sensitive information and all derived columns in the data lake will be correctly classified.

Image 1: Data Lineage representation in Apache Atlas. A PII classification was added to an entity, and it is propagated through that entity lineage. Source

Currently, when we’re writing the code to ETL data, we also write YAML files describing the data lineage and tags. When the code is merged, a CI/CD task sends the YAML files to a S3 bucket, just like in the documentation process. Then, whenever those pipelines run, they send a request to Metadata Propagator, creating events to update table lineage and tags. Finally, Metadata Propagator reads the YAML files and updates the definitions in Atlas.

Diagram 5: Lineage or tags propagation flow

This architecture allows us to avoid tightly coupling our ETL pipelines to the tagging and lineage process, as all the pipelines do is tell Metadata Propagator to asynchronously update the metadata in Atlas. And since the lineage/tags definitions are versioned in our pipelines repository, it is easy to review changes, ensure schema validations using CI/CD, and rollback in case of errors.

Results

After propagating data to Atlas, we are able to quickly search and find some useful informations about it. For example, here’s the visualization of the table user_clean we defined earlier:

Image 2: Table entity view in Apache Atlas for table user_clean. We can see the columns the table has, as well as the documentation with description and owner.

And if we look into a specific column, we can see it’s lineage:

Image 3: Data lineage of the column Name. The column Name from user_clean comes from a column with the same name from a table called source, and it is used to create the Name column in the table user_enrich. It also has a PII classification, which will be propagated to the next columns in the lineage.

Next Steps

While the initial results of integrating Atlas to our tools have been very solid in its current state, we expect to continue improving it.

We’re currently working to integrate Apache Ranger with Atlas, to have a better access control to our data. Since we already have tags describing what data is sensitive or PII, we can use Ranger to decide whether users should be able to access that data as is, or if it should be anonymized.

Currently, we are understanding how our users are using Atlas and how we can improve their experience. We’re also planning to generate metrics to calculate how our users are interacting with Atlas and how much of our data is correctly documented there. We currently have some rough estimates, but we’re going to implement automatic checks to monitor exactly what’s in Atlas and what isn’t.

Another thing that is on our radar for the future is automating the definition of data lineage and tags. Instead of relying on human work, we can parse SQL files to infer data lineage and do some sort of profiling over the source data to infer PII or Sensitive information, using tools such as BigID or developing our own models.

Finally, we’re planning to integrate more data in Atlas, such as data quality metrics, data profiling metrics, or any other kind of metadata that might help users to better understand our data and make Atlas adoption easier and more productive.

Final Thoughts

Integrating a new tool in a complex ecosystem of tools, especially one as central as Atlas, is always a challenge. When developing this architecture, we wanted a simple and easy to maintain solution without tightly coupling all of our tools to Atlas. We were able to fulfill those requirements by developing Metadata Propagator.

While those changes are going to help all of our data citizens in QuintoAndar, we couldn’t afford to stop our business demands for data while developing this project. That’s why we have dedicated analytics engineering teams that keep providing data to our business units, while other data engineering teams focus on implementing new tools and improving our current ones (such as the Data Governance team, which developed this project and I’m a part of).

If you want to be part of an innovative team and contribute to top-notch projects like this one, check out our open roles!

Thanks to Juliana, Marcelo, Lucas and Adilson for being an awesome team in the development of this project!

--

--