Partnering for metadata management
In my previous article, “Common data engineering challenges and their solutions,” I talked about metadata management and promised that we would have more to share soon. Similar to the collaboration I described in my other earlier article about our partnership for data quality, over the past year we worked closely with the Data Governance team in Azure to onboard our assets to the recently announced Azure Purview service. This article covers our journey from realizing we needed a metadata management solution, developing our own bespoke service, and finally to migrating to Azure Purview.
Metadata management
I covered the core concepts of metadata management in that previous article, but it’s worth a recap.
Metadata management is concerned with information that is not the data itself, but rather is about the data. A big data platform needs to provide this to enable better organizing, searching, and making sense of the data stored within. The two main components of a metadata management system are a data dictionary and a data glossary.
As the volume of datasets grows, it becomes more difficult to know what data is available. Our team uses a large Azure Data Explorer cluster that contains a couple dozen databases and hundreds of tables. At this scale, someone browsing the cluster would take a long time to see whether a dataset is already available there or not, which could lead to multiple people duplicating ETL and ingesting the same dataset in different databases…