#2 Data Discovery: People, governance and processes

Anas El Khaloui
hipay-tech
Published in
6 min readJun 9, 2022

We discussed in a first story why having a Data Discovery tool, or Data Catalog helps companies achieve more efficient use of the data scattered all over its systems (#1 Data Discovery: how to choose your Data Catalog and why you do need one)

Deploying a Data Catalog is great, but the final goal is to empower people and supercharge data usage for the long term. You need to make sure it comes with everything needed so that people actually use it and contribute to making it better over time: documents, contribution rules, security, etc. You also need to automate and minimize operations as much as possible.

In this story, we share a few ways we used to conduct the deployment of our data catalog and get it to our users in a nice and efficient way.

source

I. Clear and precise guidelines for documentation

First, choose how people should document their data assets. The important point here is to push towards documenting data at the source rather than in the data catalog’s database. This is for two main reasons:

  • Data Catalog apps come and go: but company data sources last longer. Data catalogs are still new, and the market keeps moving every day. Nothing guarantees that the tool you chose in 2022 will still be relevant in 2024. Metadata is valuable, let’s keep it with us as long as possible.
  • Simple is Better: keeping the metadata close to the data, and managed by the same engine is the logical thing to do to avoid duplication and complex workflows. This way, the metadata is stored in a standard way and accessible for just any app you want using plain SQL !

Second, answer this question in a clear way: what is a good table or field description ? what is relevant information that must be mentioned in a description ? We did this work and compiled all of it in a short document for everyone to read.

Clear ownership is paramount. Data ownership is first declared in a big shared Google Sheets file, which is then ingested by the data catalog to fill the owners field in Amundsen. Afterwards, ownership is declared in the Data Discovery tool, and saved periodically.

Finally, give some technical indications about how commenting should be done. We made Markdown possible (for embedding links to internal documents in description), and gave some examples on:

Overall, a good comment is:

  • Explicit: Expressed in simple terms, avoiding mysterious acronyms and overly specific jargon.
  • Complete: describes the contents of the field or the table in a precise and synthetic way, while giving a maximum of details not visible via the observation of the table. For example, the list of modalities of a field can be easily calculated, on the other hand, the constraints on a field are more difficult to access and deserve to be mentioned in a comment. It is also interesting, when it is relevant, to mention the primary key
  • Durable: must have the longest possible lifespan. For example, avoid listing the modalities of a categorical variable if it often undergoes additions
  • Up-to-date: An outdated comment is worse than a non-existent comment because it can cause errors. Data documentation must therefore evolve with the data sources

For a table, a few interesting facts are: a business definition, notable technical characteristics (update mechanism, whether change history is saved, etc.), ingestion mode (real-time, daily or hourly batch, etc.), indexation.

For a field, business meaning, constraints, the formula used to compute it are relevant information.

We are currently working on a comprehensive and concise tag list. Tags will be used to label data and create categories. This system will be useful for differentiating data domains (cf. Data Mesh architecture), products, and helping with governance and technical and compliance audits (PCI-DSS, GDPR, PSD2, etc.).

A new step will be added to our JIRA workflow. Any task that requires new stored data will have a documentation step.

II. Adding a new data asset to the catalog is easy and fast

We formalized clear guidelines on how to add a new data source to the catalog. Here are the few principles we used:

  • Data producers (developers) do most of the work: data teams are not responsible for maintaining the data discovery tool, and providing support, but engineering teams are responsible for making their data’s documentation available to all
  • Airflow to the rescue for making metadata ingestion simple and low-code: connection to a new data source is defined in a JSON file, and the corresponding scheduled ingestion task is created automatically from there.We wrote a wrapper around amundsen-databuilder code, and dockerised it, so that it’s easily callable from Airflow DAGs
source

Simple is good. Simple things get done easily and cause little friction. On the other hand, people just stop doing intricate things if they are not absolutely necessary because they are painful.

III. Keep it safe

  • Use authentication: even metadata can contain sensitive information. As for any other internal app, you should protect you Data Catalog. We use OpenID Connect authentication with Google Cloud Platform as a provider. This called for some more code modification, in order to make all micro-services compatible with authentication and secure (using flaskoidc). We use a VPN on top of that.
  • User profiles: Thanks to user profiles, every data catalog user has their own space
  • Minimal rights should be used: use read-only accesses, and avoid connecting to production databases when using pre-production is fine.

IV. Make management as simple as possible

Building a new tool is the moment to think about what it will cost you and your teams in terms of maintenance over time. It’s the perfect moment to invest in a robust setup and best practices.

  • Use Docker for deployment: simple, reproducible, powerful. Of course, CI/CD is also a must
  • Use an orchestration tool to schedule and monitor data ingestions
  • Back-up you data catalog’s back-end database every week. We do this through saving snapshots of the docker volumes to a remote machine
  • Une one service account key for all your cloud data assets, and use minimal access rights for your data catalog. For example, for BigQuery, the account used by your data catalog should only have “Metadata Viewer” rights

V. Communicate !

  • Communicate along the steps with management and tech/product leaders in order to make sure the shipped product matches everyone’s need. Even better if you can have a Product Owner/Manager aboard !
  • Show people interfaces to let them get a feel of what’s cooking. This will ease adoption later on.
  • Get feed-back, and iterate as always

That’s it, in a future post, we will talk about deployment and technical aspects.

Give us a little 👏 if you found this to be useful ! or leave a comment if you want more focus put on specific aspects in the future 😉

Thanks for reading !

By the way, we are always looking for bright people to join us, check out our open positions ! 🔥

--

--

Anas El Khaloui
hipay-tech

Data Science Manager ~ anaselk.com / I like understanding how stuff works, among other things : )