Data catalogs. Part 3. Metadata quality observation

Ivan Begtin
2 min readJun 25, 2022

--

I don’t think I’ve ever written about it yet, about how to compare data quality metrics used in Modern Data Stack and open data portals. In many ways, there are different approaches. I wrote about the difference between different types of directories in a text on Medium [1].

In the blog Towards Data Science, a helpful text from Prukalpa, co-founder of the startup Atlan, about the 5WH1 methodology [2]

Source: https://medium.com/towards-data-science/data-documentation-woes-heres-a-framework-6aba8f20626c

5WH1 is a list of data quality questions that need answers: What, Why, Where, Who, When, and How.

This is a list of metadata that should be collected about the data to understand how it is arranged and what to do with it. In the corporate world, using this technique or similar is undoubtedly relevant and essential, especially when working with many teams. In the world of open data, everything is somewhat different. Their owners are often unavailable in the form of files, and there is a lot of historical data for which there is little metadata in principle.

Nevertheless, the most thoughtful metadata quality monitoring standard is the European MQA (Metadata Quality Assurance) [3]. But the criteria are different: Findability, Accessibility, Interoperability, Contextuality, and Reusability.

Source: https://data.europa.eu/mqa/?locale=en

The list of metadata collected as part of the aggregation of descriptions according to the DCAT-AP standard for open data is even more extensive, but the data quality is much lower.

What is missing when analyzing metadata for open data?

Lacking։

  • information about the origin of the data called data lineage
  • links of published data with their use (applications, analytics, etc.)
  • there is a lack of information about the data source in a more structured form than just a link to the source

And quite a lot more. A significant difference is that corporate directories are much better amenable to automation due to control over them within the corporation/ company and a comparatively smaller amount of data. However, this is not always the case. But open data portals often quickly turn into digital garbage dumps if there is no data quality monitoring.

References:
[1] https://medium.com/@ibegtin/data-catalogs-part-1-spectrum-of-data-catalogues-ba75d1dd06c9
[2] https://medium.com/towards-data-science/data-documentation-woes-heres-a-framework-6aba8f20626c
[3] https://data.europa.eu/mqa/?locale=en

--

--

Ivan Begtin

I am founder of APICrafter, I write about Data Engineering, Open Data, Data, Modern Data stack and Open Government. Join my Telegram channel https://t.me/begtin