Sitemap
Adaltas

All our publications about Open Source, Big Data, Data Engineering, DevOps and Data Science.

Follow publication

Member-only story

Introduction to OpenLineage

Adaltas
5 min readDec 19, 2023

--

OpenLineage is an open-source specification for data lineage. The specification is complemented by Marquez, its reference implementation. Since its launch in late 2020, OpenLineage has been a presence at the BuzzWords Summit in Berlin and has been generating increasing interest. Having personally attended discussions among the developers contributing to this project, let’s explore the challenges and questions they face, the solutions they have chosen, the specification definition, and the ongoing developments.

What is lineage?

Lineage is a set of relationships represented by lines connecting tables to various data processing processes, both input and output. In Marquez, for example, it looks like this:

Marquez image

One of the goals is the identification of duplicates. This is a fundamental feature of any data ingestion architecture, not just in big data. With classifications (tags) like “personal data” or “expiration date” that propagate with the lineage, it becomes possible to automate processes. Without lineage, making a copy of a table might lead to exposing data. In terms of use cases, the primary ones are:

  • Reliability, for example, by identifying a…

--

--

Adaltas
Adaltas

Published in Adaltas

All our publications about Open Source, Big Data, Data Engineering, DevOps and Data Science.

Adaltas
Adaltas

Written by Adaltas

Open Source consulting - Big Data, Data Science, Node.js

No responses yet