It’s well-known data is in a continuously changing state, it keeps evolving. Then, new concepts and ways of representing reality appear on stage and, last but not least, sometimes some of these concepts get deprecated.
What is more, not only data evolve, teams & organizations too. That means people joining teams with new knowledge & expertise, and, also the most important thing to care about, people leaving companies with undocumented wisdom.
Besides, I’ve seen in multiple occasions teams moving forward initiatives to start documenting data-sources, triggered by data consumers experiencing some troubles using data wrongly, or by someone leaving the company with a huge amount of data wisdom centralized in one individual.
Someone leaving the company with a huge amount of centralized and non-duplicated data wisdom across the team is quite common, as even in the best families, and this lost wisdom can and will affect the entire organization until somehow & someone recover it.
These initiatives usually start with joy and with good intentions. People start documenting in Confluence, Notion and sometimes in excels about how data is organized, structured, produced and consumed. Nevertheless, all of these initiatives used to end up with outdated & legacy documentation as a result.
All the initiatives consisting of updating documentation manually are mean to fail. Data Engineers prefer coding than documenting; Data Scientists modeling than documenting and Data Analysts & Data Viz playing with data than documenting. Everyone prefers being in the playground than being documenting. That’s why matters having the right process to keep data documentation alive with automated processes.
keep data documentation alive with automated processes
By implementing DataHub as our Data Catalogue we aim to empower all our Data Consumers and Producers by having a better understanding of how all our data is represented and interconnected.
DataHub is LinkedIn’s generalized metadata search & discovery tool.
At Typeform, our datasets are mainly stored in Redshift (two clusters) and S3. Then we have Looker for dashboarding and Airflow for pipelines. That means having 3 types of entities — Datasets, Dashboards, and Jobs. The first one (datasets) is already supported and implemented. Now, we just need to extend their entity to represent the missing ones (already in DataHub’s roadmap).
As we mentioned before, datasets should be documented automatically without requiring a tedious and manual process. Well, how could we do that?
For gathering all the tables available in Redshift and their descriptions, we created an Airflow Operator that queries the pg_catalog tables to fetch all these metadata and then send it to the DataHub MCE consumer by Kafka.
Everyone in the Data team has the responsibility & ownership of documenting the tables they create by providing a table & column level descriptions in Redshift using the comment statements, so our Airflow extractor operator can fetch this information from it later.
Besides, we will register the owner of the table as a producer in the Data Catalogue and the Airflow user as an owner. From another side, once we can register the Data Jobs/Pipelines in charge of modifying & ingesting data, we will add them to the owners’ property.
Redshift provides the Scan feature from where we can fetch which tables are being scanned to update the final one. From there we will generate all the downstream relationships.
The stl_scan table also provides which S3 files are being used, so we will extract these S3 paths from it to include also external datasets to the lineage dependency in the future.
Once we send the relationship metadata to the Data Catalogue, we can visualize their interactions properly.
There is still a long way until we have a fully operative Data Catalogue comprehending every piece of data we have. However, we are completely sure this is the way to follow unleashing the access and knowledge about data making everything visible and understandable.
Having all Redshift datasets documented is great. Even the Lineage between them is really powerful since it allows our Data Consumers to have a better understanding of how the data is flowing in our system.
The plan is to keep adding the S3 datasets, connecting them with the Redshift ones and then start adding all our Dashboards and Jobs information too,
By adding them it will allow us to know which jobs are the owners of these datasets and which dashboards are being populated from each of them.
Links of Interest
- How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions [August 19th, 2019].
- DataHub: A Generalized Metadata Search & Discovery Tool [August 14th, 2019].
- The Evolution of Metadata: LinkedIn’s Journey [Strata NYC September 25th, 2019].
- Open-sourcing DataHub [February 18th, 2020].