Enabling Data Lineage Using a Graph Database

In this article, we present our approach to supporting data lineage in our analytical environment using a graph database, i.e., neo4j.

Amin
If Technology
5 min readAug 31, 2021

--

Introduction

Data-driven decision-making is one of the key aspects in enabling organizations to remain competitive by achieving their goals. The customer needs are frequently changing so do their needs. Thus, it is important that the means that enable data-driven decision-making can reflect the changes and be adapted quickly. Traditional data warehousing cannot fill this gap alone as the development lifecycle is large, so they may fail to address requirements on time.

Data mesh is a new approach that sounds promising to fill this gap. It proposes decentralizing data ownership, implementing distributed mesh instead of a monolithic warehouse, encapsulating data and code as one unit rather than storing data as a result of running code, shifting the governance to a federated model rather than a top-down central one, focusing on data as a product to share than data as an asset to collect.

We got inspired by Data mesh, and we have designed and implemented a new environment to enable analysts to develop their data products within the Customer Analytics domain, sourcing largely data from If’s common data products which you can read about here. The data products can be developed based on data that are available in different sources, i.e., operational, external, or enterprise data warehouses. It enables the rapid development of new data products which can be shared with other analysts. It is complementing our data ecosystem as it provides a foundation that enables the development of data products that does not fit into the enterprise data warehouse. It also enables extending the enterprise data warehouse if a data product is widely used in the organization.

We named this environment Customer Analytics — Data Product Environment (CA-DPE). This new environment helped us to improve data packages’ discovery, traceability, transparency, understandability, monitoring, governance, and compliance checking. It also enabled better communication among analysts, investigating the ripple effects of a change in a data package and prioritizing batch updates. Although it is interesting to write about all the different aspects that we considered and faced in this project, this article aims to only shed light on how a graph database helped us to deliver some of these values to our analytical ecosystem.

Therefore, we introduce an oversimplified version of our environment in the background section. We will elaborate on our approach to enabling data lineage using a graph database in our environment.

Approach

A data product is the core part of a data mesh that is defined in a context, and it can receive data from upstream systems or other data products. It can also share data with other data products or systems as well. We designed and implemented CA-DPE to enable analysts to create their own data products in a specific domain. The result of the data products can be shared with other domains, and they can read data from other domains as well. Therefore, we designed our domain environment with three distinct layers, i.e., Input, Data products Domain Environment, and Output. In this way, we can govern and trace the data products in a much easier way.

A simplistic overview of layers in the data product environment

In our environment, each data product can be implemented by several different code modules, called views. There are several types of views, i.e., read, insert and delete. The read view can retrieve data from other read views, transform them, and return the result. The insert and delete views are special kinds of views that insert or delete data into/from a table in the local storage.

Each view can read data from several sources, manipulate data, store results locally, delete data from the previously stored result. The data product can read data from the input layer or from other data products. To support these functionalities, analysts can develop different sorts of modules for each data product. A module can read, insert, or delete data. In this way, Create-Read-Update-Delete (CRUD) pattern can be supported for each data product.

Let’s look at a very simple example that contains all these relations to understand them in a better way. The views in Input and Output layers can only be a kind of read views. It is only possible to read data from these views.

A running example for creating data products

In the domain environment, it is possible to develop insert and delete views. An insert view can read data from reading views, and they will insert data into a table. They can also read the data from the table to which they want to insert the data. This read is called read delta as it enables inserting the changes. In oppose to insert view, analysts can develop the delete views to delete data from a table. This view shall be executed before the insert view, and it can also be set from the table to delete a specific part of the data. Note that analysts do not need to implement these relations, and they are automatically identified by the environment through naming conventions.

Our environment parses the developed data products and generates these relations in neo4j, which is a graph database. In this way, it provides many benefits, including better Understandability, Transparency, Prioritizing batch updates, Traceability, What-if analysis, Ripple effect of changes, Monitoring, Governance, Checking compliance for development practices, Enabling good communication between analysts, and Environment verification. Here, we visualize how one of the complicated data products is developed.

An example of Data Lineage visualization using Neo4j

In this graph, the read views within the source layer are colored in Orange to distinguish them from another view. A green circle visualizes the tables, and a blue one represents other nodes. Note that analysts can change the colors. As it can be seen, it is very difficult, if not impossible, to understand such a complex relation without such support.

Conclusion

In this article, we shared our experience in using a graph database to visualize the relations among code modules. It enabled us to provide better understandability, Transparency, Prioritizing batch updates, Traceability, What-if analysis, Ripple effect of changes, Monitoring, Governance, Checking compliance for development practices, Enabling good communication between analysts, and Environment verification.

--

--