Work-Bench Snapshot: The Evolution of Data Discovery & Catalog
The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.
As enterprise tech investors in the infrastructure and developer tooling space, we are seeing an explosive growth in the number of data infrastructure startups. On one hand, this trend reflects the need for data tooling that promotes data-informed decision making and the massive investment in the broader space. On the other hand, data often lives in disparate sources across systems making it hard for data users to have visibility into their data pipelines, discover relevant assets, and derive value from them. Large enterprises that have been collecting data over the years weren’t built from the start to facilitate data accessibility and are now facing a far greater challenge harnessing all of their data.
The rise of solutions for data warehousing, ingestion, and transformation has enabled users to work with massive datasets with greater ease. However, most data teams lack access to a centralized catalog to fully understand the provenance of the metadata on which their reports are built on. This not only limits their access to the right dataset, but it also makes it hard for them to trust the data they are working with. Zhamak Dehghani puts it best:
“I personally don’t envy the life of a data platform engineer. They need to consume data from teams who have no incentive in providing meaningful, truthful and correct data. They have very little understanding of the source domains that generate the data and lack the domain expertise in their teams. They need to provide data for a diverse set of needs, operational or analytical, without a clear understanding of the application of the data and access to the consuming domain’s experts.”
Bigger organizations with large distributed datasets face similar challenges. For example, Netflix built Metacat, its own internal data management solution to keep up with the increasing volume and complexity of its data warehouse, which grew over 60 petabytes. Other tools such as Dataportal (AirBnb), Databook (Uber), Amundsen (Lyft), DataHub (LinkedIn), Marquez (WeWork), and Data Catalog (Google) were all developed internally to reduce friction of converting data into actionable insights and improve the productivity of data scientists. But these tools not only serve as a search interface for data. Today, they are also being leveraged to build the foundational frameworks for data governance in order to increase transparency, fairness and privacy, and improve internal operational efficiency.
With the recent activity and mounting interest in this space, we think the metadata catalog and data discovery space is ripe for investment, and are excited for tools that specifically focus on providing users of all skill levels access to the data they need and when they need it.
Here’s a compilation of the top blog posts, videos, people to follow and projects to know to help you get up to speed on the topic:
Data Catalogue — Knowing your data by Albert Franzi
“All the initiatives consisting of updating documentation manually are meant to fail. Data Engineers prefer coding than documenting; Data Scientists modeling than documenting and Data Analysts & Data Viz playing with data than documenting. Everyone prefers being in the playground than being documenting. That’s why matters having the right process to keep data documentation alive with automated processes.”
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh by Zhamak Dehghani
“As more data becomes ubiquitously available, the ability to consume it all and harmonize it in one place under the control of one platform diminishes. Imagine just in the domain of ‘customer information’, there are an increasing number of sources inside and outside of the boundaries of the organization that provide information about the existing and potential customers. The assumption that we need to ingest and store the data in one place to get value from a diverse set of sources is going to constrain our ability to respond to proliferation of data sources.”
What do we talk about when we talk about “Data Exploration?” by Enrico Bertini
“One crucial, and often overlooked, aspect of this activity is “data semantics.” I personally find that understanding the meaning of the various fields and the values they contain is such a crucial and hard activity at the beginning. An activity that often requires many many back-and-forth discussions and clarifications with domain experts and data collectors.”
“The bottom line: everyone who plans to analyze data with a business intelligence tool should understand the fundamentals behind dataset organization, and where they need to go to find the correct data. This knowledge will help analysts and domain experts find the right data quickly, and effectively analyze it to generate insights that can improve their daily decisions. Fortunately, data modeling techniques and concepts are more accessible than ever with tools like Sigma, so everyone who needs to access data can dive in themselves and contribute to building a better data culture.”
In this video, the speaker, Raghu Murthy, CEO of Datacoral shares his experience and the lessons learned when building shared data infrastructure at big tech companies like Facebook, where he helped scale the company’s data infrastructure from 50 TB to over 100 PB.
This video serves as an introduction to Amundsen’s architecture and discusses how it gathers data from various different sources, namely from Hive, Presto, Airflow, and exposes it in one central place.
This video highlights the need for data discoverability and data lineage and explains how the data team at WeWork built their own in-house tool to incorporate a metadata repository into their data platform for added visibility.
People to Follow on Twitter
Shirshanka is a software engineer at LinkedIn where he’s contributed to several data infrastructure projects including Apache Helix, Espresso, Databus and DataHub.
Willy is a software engineer at Datakin and a former engineer at WeWork who specializes in building modern data management and data lineage tools.
Data Discovery / Metadata — Projects to Know
Amundsen is an open source data catalog framework built by Lyft. From a security and data democratizing standpoint, Amundsen provisions access to data in a programmatic way. It uses a graph database which facilitates the discovery of datasets based on their relevance and popularity across the organization.
Databook is a data discovery tool by Uber that leverages an automated approach to search and exploration, Instead of fetching the data in real-time, Databook stores the metadata in its architecture in order for periodical crawling to be made possible.
Dataportal was developed at Airbnb. It is a self-service system that centralizes tribal knowledge and employee-centric data to provide transparency into the organization’s complex landscape.
- Data Catalog
Data Catalog is a scalable metadata management service, powered by Google search technology, that offers an auto-tagging mechanism for sensitive data.
DataHub is an open source metadata search & discovery tool created by LinkedIn. It supports both online and offline analyses that enable use cases, such as access control and data privacy.
Metacat is Netflix’s open source data tool that makes it easy to discover and manage data at scale. This tool acts as an access layer for metadata from Netflix’s data sources into Netflix’s Big Data platform.
Marquez is an open source data exploration tool created by WeWork to collect and aggregate metadata for consumption. Marquez provides RESTful APIs that integrate with systems such as Airflow, Amundsen, and Dagster.