Are you gatekeeping your metadata contributors?

Pieter Delaere
dScribe data
Published in
4 min readJun 29, 2022

--

crowdsourcing metadata contributions

In a previous article, we discussed in which scenarios the discoverability of data assets (reports, datasets, etc.) might need to be restricted. Now that we’ve got a strategy in place to ensure that the visibility of highly sensitive assets is restricted, let’s talk about a broader question. How can we maximize contributions while also ensuring the high quality and trustworthiness of our metadata documentation?

Before we start, let’s point out a critical success factor in any metadata-driven documentation effort: automation. Trying to manually keep a list of all potentially relevant data assets will not put you in a happy place. Better to rely on connectors to automatically crawl the available metadata from various sources. Apart from the fact that you don’t have to turn into a fulltime typist, you might find that there is a lot more information to be retrieved via metadata than you’d expect. The following insights are often available: stakeholders (asset creator, last updated by), recency (creation date, last update date) and context (location, dependencies with other assets).

Although automation can bring us a long way, further context added by human experts will always be valuable. That brings us back to our initial question. At face value, it’s a tricky balance to find. In a data context, you might assume that the more users you give permission to create and edit documentation, the higher the chance of mistakes and thus the lower the overall quality of that documentation becomes. On the other hand, restricting permissions to contribute to the bare minimum also doesn’t sound too great. You wouldn’t want to be responsible for creating all that documentation on your own now do you?

What will be the right model for your organization will depend on a number of factors. To start this thought exercise, you can ask the following questions:

  • What does my organization structure look like? Are we organized by domain, geographic region, business unit…?
  • Which relevant roles are available within my organization? Which data-related roles do we have in place or are we planning to create?
    Note: data documentation and governance are often perceived as a lot of work. However, with or without a metadata initiative in place, many people in your organization inevitably have been (manually) doing the activities you are looking for. Whether it’s in an Excel list or an automation-enhanced data catalog makes no difference. What you’re trying to do is make their tasks easier and more effective.
  • When it comes to data, does your organization aspire to have an open-by-default or a restricted-by-default mentality? In terms of metadata policies, the former will likely result in a lighter touch than the latter.

The good news is that whatever your organizational model, you should be able to define and implement the policies that suit your needs. Below are 3 examples that might provide you with some inspiration:

Self-service > Domain Validation > Global Validation
This organization makes no distinction between glossary assets (field definitions, metric calculations, business terms, etc. ) and catalog assets (reports and datasets). The bulk of all data assets exist in context of self-service initiatives. These assets can freely be documented by anyone in the organization, but not validated. Assets linked to a specific domain can be validated by one of the appointed domain stewards, indicating their trust level. Cross-domain assets are assigned to a global data (governance) team, who verify these assets are accurate and comply with the organizations’ guidelines and definitions.

Centralized glossary + democratized catalog
This organization combines a more limited, but globally harmonized glossary with a freely evolving and dynamic catalog. The assumption taken is that when teams choose to take up ownership of any report or dataset, they are committed to the accuracy of its documentation. By allowing this, documentation becomes more complete with less top-down effort.

In this organization, glossary assets are managed by recognized domain experts. Catalog assets can by default be contributed to by all people. Teams are free to take up ownership of any catalog asset, after which they can choose to validate them on their own authority.

Assigned ownership
This organization has a matrix structure, mainly domain-driven but with separate geographic structures in place for several domains. In addition, there is a strong project-driven approach, with many multiyear projects consisting of various stakeholders throughout the organization.

Rather than trying to translate this complex and evolving structure into metadata policies, the organization defines flexible rules to automatically assign assets to one of the existing teams. Rules make use of information such as source system, folder location and asset name prefixes (in those cases where naming conventions are in place) to define these assignments. Assets that could not be automatically assigned are reviewed by a central governance team. Once assigned, each team is responsible for the documentation and (where relevant) validation of each asset. People only have contribution permissions if they are member of the assigned team.

Here is a final piece of advice to set you on your way: start small. Rather than trying to cover every potential nook and cranny of your organization right from the start, you might be better off starting with generously assigned permissions. Assigning permissions too freely is much less damaging than being too restrictive. Documentation mistakes can always be corrected and will provide valuable coaching opportunities to increase data maturity. Keeping people engaged and enthusiastic to contribute is much harder. Focus on enabling as many contributions as possible and with the proper resources and guidance in place, overall quality will be much higher as well.

--

--

Pieter Delaere
dScribe data

Fascinated by the possibilities of data. On a mission to enable everyone to generate value through data as CEO of dScribe.