Partnering for metadata management

Vlad Rișcuția
Data Science at Microsoft
6 min readDec 8, 2020

--

In my previous article, “Common data engineering challenges and their solutions,” I talked about metadata management and promised that we would have more to share soon. Similar to the collaboration I described in my other earlier article about our partnership for data quality, over the past year we worked closely with the Data Governance team in Azure to onboard our assets to the recently announced Azure Purview service. This article covers our journey from realizing we needed a metadata management solution, developing our own bespoke service, and finally to migrating to Azure Purview.

Metadata management

I covered the core concepts of metadata management in that previous article, but it’s worth a recap.

Metadata management is concerned with information that is not the data itself, but rather is about the data. A big data platform needs to provide this to enable better organizing, searching, and making sense of the data stored within. The two main components of a metadata management system are a data dictionary and a data glossary.

As the volume of datasets grows, it becomes more difficult to know what data is available. Our team uses a large Azure Data Explorer cluster that contains a couple dozen databases and hundreds of tables. At this scale, someone browsing the cluster would take a long time to see whether a dataset is already available there or not, which could lead to multiple people duplicating ETL and ingesting the same dataset in different databases. Understanding the schema would be another major challenge: What do all the columns in a table mean? A data dictionary helps make sense of the data by capturing descriptions of tables and columns and presenting them in a searchable UI.

Our team works in a complex business space: Modeling the Azure business with all its nuances is challenging. The various reports and metrics coming out of our platform must have clear definitions and canonical queries to produce them. We can’t have two team members produce different numbers for the same metric due to differences in understanding what the metric measures. The example I gave in the previous article concerns Azure Consumed Revenue, or ACR, which is a key metric the team tracks. ACR has a standard definition and a query everyone should use to produce it. A data glossary helps keep track of these definitions and associated queries, making them easy to find.

--

--