Creating a Metadata Architecture From the Ground Up

The first challenges faced by our new Data Governance team

Marcelo Alves Baratela
Blog Técnico QuintoAndar
9 min readSep 27, 2021

--

Creation of data silos, lack of proper documentation, diverging metrics about the same topic, inadequate access privileges to sensitive data. Unless all the challenges that a company might face are crystal clear right from the beginning (which is very unlikely to happen in real life), it is natural that the items on this list start to pop up as the business grows. Natural but still undesirable. That is the reason data governance is such a relevant topic in current times and why QuintoAndar is looking at it with special care.

Photo by Danist Soh on Unsplash

A little bit of context

As companies often referred to as “data-driven” have matured, we begin to see more and more initiatives to improve data governance. This makes companies achieve more sustainable growth, in addition to adapting to new legislation, avoiding data leakage problems and payment of fines, for instance.

At QuintoAndar it was no different. Despite the constant concern with the management of our data, the rapid growth in recent years has ended up showing some bottlenecks in our processes, preventing the scalability of some solutions. Situations where the data engineering team could not allocate enough resources to a certain demand were not as rare as we would like, causing, for example, the data analytics team itself to come up with tools to catalog our data (a.k.a. spreadsheets) and perform certain quality checks.

The arrival of the LGPD (the Brazilian equivalent of GDPR) showed us that, while having to deal with those urgent governance issues, we couldn’t simply stop looking at our other demands. In this article, we show what are the pillars upon which our new Data Governance team was built, and what were the first challenges we faced.

The Data Governance team

In the beginning, it was just a data engineering team inside our tech platform area, but we soon realized that we needed a better interface with the business units. We needed to be closer to the problems faced by other teams, and at the same time, we had to evangelize other areas to use our new solutions. This led us to bring data analysts in, so they can act almost as our product managers.

With the team now complete, we had to determine what objectives and responsibilities we would have, so we decided to focus on the following data governance pillars:

  • Data Quality Management: We must define our data quality processes, as well as its architecture and the necessary tools to implement it;
  • Data Lake Management: It is our responsibility to take care of the governance on how we store and model our data. We define who should deliver each layer’s data, as well as its owners and users;
  • BI Tools Management: We are responsible for maintaining and managing our BI and data visualization tools, as well as their data structures;
  • Reference & Master Data Management: We must define the source of truth for our most important KPIs, so everybody uses the same metrics. We also define the classification of our tables (i.e. gold, silver, bronze, etc).
  • Metadata, Document, Record & Content Management: This pillar focuses mainly on the documentation of our data assets, the creation of data lineages, and the evangelization of other teams about data governance.

Since LGPD is such an important issue, and the lack of a proper data catalog was a topic that was constantly being raised in our conversations with the data analysts, we decided to prioritize the pillar referent to metadata management and documentation.

Having our pillars and priorities well defined, the starting point for this new team was to choose a governance platform that would facilitate the creation of data catalogs, in addition to allowing better metadata management. This was a crucial step because the choice made at that point would impact every other project in the future.

Choosing the right framework

When we started looking for a governance framework, we already had a list of urgent demands we should tackle as soon as possible, so we heavily based our final choice basically on two points.

First, we had to pick a tool that would have nice data catalog features. By that time, we used to catalog our data using spreadsheets. It worked for a while but soon the lacking of versioning and schema validation, along with the overall scalability of the solution, became a problem.

The other point that guided our choice (maybe the most important) was that this tool would have to meet some of our needs regarding the LGPD. It should provide us resources to better protect our data assets, like masking sensitive information and controlling the access to our data.

To narrow down our options, we began by considering only open-source solutions. This would bring the flexibility and extensibility we needed, besides avoiding vendor lock-ins. Another very important aspect for us was the type of architecture supported by each one of the options. As the former LinkedIn engineer, Shirshanka Das, writes in his article about popular metadata architectures, the most flexible and scalable ways to integrate, store and process metadata are provided by event-based architectures. That is, metadata should flow through real-time subscribable events, making it easier to build applications on top of this stream while keeping the consistency and freshness of data.

With these two requirements, we were left with Apache Atlas and DataHub as the last contenders. We also decided to include Amundsen in this benchmarking, given its current popularity and user-friendly UI, even though it needs pull-based ETLs to make its metadata ingestion. A simplified comparative table can be seen below with the main aspects taken into account.

Comparative matrix between Atlas, DataHub and Amundsen. Items marked with * are available as of this writing, but were not at the time of the assessment (January/2021).

As we can see, Amundsen really has important data discovery features, such as data preview, letting us take a glimpse on the content of a table in a quick way, column-level statistics used for data profiling, and an indicator of how fresh that data really is. Although those are very nice characteristics for a data catalog, Amundsen doesn’t provide any means for us to meet the LGPD compliance. Besides, we took architectural aspects as a priority. Given that we would have to pull data into Amundsen using some kind of batch jobs, which was something we didn’t want to, we had to discard it.

Atlas and DataHub are very similar to each other. When comparing their features, both let us document our assets, besides providing ways to define their ownership, to create notifications when something was updated, and to track down the data lineage between our tables and columns.

Two aspects acted as tiebreakers between Atlas and DataHub, and those were tag propagation and integration with Apache Ranger. These characteristics would allow us to implement LGPD demands, such as masking PII information, much more easily. The general idea here is that source columns would be tagged as PII in Atlas and would be propagated to downstream tables through their data lineages. Apache Ranger, in its turn, would be the glue between Atlas, our Trino instance, and the LDAP, making tagged data available only to users with the right permissions.

The architecture behind all this and how it was deployed is a post for another day, so let’s go back to our agenda. We chose Atlas as our governance framework. Now what?

Propagating Metadata

Having our framework properly chosen, now it was time to start thinking about how we could integrate all the moving parts of our environment with Atlas in the best way possible. This solution had to be flexible enough and also avoid coupling with our other systems, like our repository for Airflow DAGs.

The solution we came up with was the creation of a service called Metadata Propagator that is responsible for guess what… Yes, for propagating metadata across our systems. We will have an article dedicated to explaining this service’s architecture, as well as other details related to it, but, generally speaking, it works like this: it exposes endpoints that can be called by all sorts of external sources, turning the requests into Kafka events. These events are consumed and end up triggering actions to load the metadata into Atlas, or any other tool we want.

This kind of architecture was beneficial for us in many aspects. For example, with the use of the API, we could easily integrate the Metadata Propagator with all our DAGs without tightly coupling them. Any modification in a table’s schema, or in the business rules used to create it, is detected right away, avoiding our data catalog being outdated.

It is also very easy for us to add new sources of metadata, or to integrate different tools that will have data constantly being updated. For instance, we can have Metadata Propagator writing metadata into Atlas, Google Sheets and Looker at the same time with just a single schema-changing event. This really helps us to deliver new features much faster and also enhances the company engagement with our solutions.

What’s next

As a very recent team, we have lots of challenges listed in our backlog. The integration between Atlas and Ranger is a very important project for us and should be started soon. Besides that, we might probably work on the Atlas usability, which seems to have some UI issues that could serve as an obstacle to greater adoption by the company.

The very next point we will tackle is data quality. We are working with other data engineering teams to come up with a new framework that will help us track the quality of our data throughout the whole pipeline, along with alerts that will prevent eventual problems in our tables to be discovered only at the finish line by our Data Analysts. The outputs of the quality checks might even be fed into Atlas, so we can more easily see the consistency of our data assets.

Final thoughts

We know data governance is a topic that can easily be set aside by companies because people tend not to give a lot of attention to it, or sometimes the teams are not structured in a way it could be a priority. Here at QuintoAndar we are lucky to have data engineers and analysts dedicated to this discipline.

Here we have been able to build high-quality and scalable solutions to urgent problems that have the potential to impact different areas of the company. This would not be possible if we did not have people looking exclusively at these issues while other teams are free to handle other kinds of demands, like data modeling, the development of new data pipelines, and so on.

Decisions made at the very beginning of a long-term project are hard to make and can have huge impacts on future decisions. Our choice to proceed with Apache Atlas as our governance framework was based on all the information we had available to us, along with our main necessities at that moment.

However, nothing prevents us from changing our minds in the future because new requirements might appear, or perhaps because we find out that Atlas cannot handle all of our needs by itself. We can, though, be prepared for that possibility. We can develop solutions that are flexible enough to let us change our course when we have to, without having to pay a high price for it.

Challenges like the ones mentioned here are great inputs for new articles, and we are full of them, so stay tuned in our tech blog for future posts!

Thanks to our awesome team members Juliana Freire, Rafael Augusto, Lucas Ribeiro, and to Adilson Atalla for putting this team together.

--

--