Member-only story
Introduction to OpenLineage
OpenLineage is an open-source specification for data lineage. The specification is complemented by Marquez, its reference implementation. Since its launch in late 2020, OpenLineage has been a presence at the BuzzWords Summit in Berlin and has been generating increasing interest. Having personally attended discussions among the developers contributing to this project, let’s explore the challenges and questions they face, the solutions they have chosen, the specification definition, and the ongoing developments.
What is lineage?
Lineage is a set of relationships represented by lines connecting tables to various data processing processes, both input and output. In Marquez, for example, it looks like this:
One of the goals is the identification of duplicates. This is a fundamental feature of any data ingestion architecture, not just in big data. With classifications (tags) like “personal data” or “expiration date” that propagate with the lineage, it becomes possible to automate processes. Without lineage, making a copy of a table might lead to exposing data. In terms of use cases, the primary ones are:
- Reliability, for example, by identifying a…