Combining OpenLineage, Marquez and Python’s Pandas for an Easy-to-Use Data Lineage Application

Data plays a crucial role in today’s world. It empowers individuals and organizations to make better decisions, drive innovation, and navigate complex challenges in an increasingly competitive landscape. With the growing importance of good, reliable data, the need to manage and understand the data journey increases, too. In order to address the complexity and challenges associated with managing and analyzing large volumes of data within organizations, we need to be able to trace the journey of data from its origin, across all transformations to its destination or final state. This is known under the term Data Lineage.

Data lineage is a visual representation or tracking system that illustrates the end-to-end journey of data as it moves through various stages and processes within an organization’s data architecture. It offers an easy-to-understand, end-to-end view on how the data evolved over time and allows to validate the accuracy and consistency of the data.

Data Lineage can be applied to different aspects of the data lifecycle. One can track where the data is stored (on-premise, data warehouse, data lake, cloud), where the data is from (source systems), who is responsible for the data and who uploaded, changed or downloaded the data. In this article, we are introducing a solution for a data lineage application. We focus on column-level lineage, i.e., the metadata of column-wise operations performed on a dataset. There are many tools already available which focus on data lineage on a higher level (the origin of data and the transformation steps of data tables), but it is fairly hard to find column-wise data lineage implementations. Also, an easy-to-use data lineage solution is crucial for organizations as it directly impacts the efficiency, transparency, and overall effectiveness of data management.

Why is Data Lineage Important?

Data lineage is crucial for understanding and managing data flows. Here’s an example scenario where data lineage is essential. Consider a large retail company with stores across multiple regions. The inventory is managed using a complex system that involves data from various sources. The company wants to optimize its inventory management process to reduce stockouts and overstock situations. The company plans to implement a data analytics solution to forecast demand and automate replenishment orders. The importance of data lineage in this scenario is described below.

  1. Understanding Data Provenance: Data lineage helps trace the origin and transformation of data. For instance, knowing that the inventory data used for analysis originates from which data sources ensures accuracy and reliability.
  2. Identifying Data Quality Issues: By tracing data lineage, the company can identify potential data quality issues. For example, discrepancies between the inventory levels reported by a source system and the actual stock in the warehouse may indicate data synchronization problems.
  3. Compliance and Audit Requirements: Data lineage provides transparency into data processes, facilitating compliance audits and regulatory reporting.
  4. Root Cause Analysis: When issues arise, such as incorrect inventory forecasts leading to stockouts, data lineage helps conduct root cause analysis. By tracing the flow of data from source to destination, the company can identify where errors occurred and take corrective actions.

Available Tools

There are several tools available in the market that facilitate data lineage tracking and visualization.

  • Apache Atlas is a scalable and extensible open-source metadata repository that provides governance capabilities for organizations to classify, manage, and govern their data assets.
  • Collibra provides a comprehensive data governance platform, including features for data lineage. It allows organizations to document, visualize, and understand the flow of data across their systems. Collibra is known for its user-friendly interface and robust governance capabilities.
  • Informatica Enterprise Data Catalog offers a metadata-driven solution for data cataloging and governance. It provides data lineage capabilities to help organizations understand how data is created, transformed, and consumed.
  • DBT (Data Build Tool) is commonly used in the data analytics and data engineering space to transform raw data into meaningful insights. It primarily focuses on data transformations but also plays a role in documenting and managing data lineage.
  • OpenLineage defines a standardized format for representing metadata related to data lineage. Using OpenLineage ensures consistency and interoperability across different tools and systems.
  • Marquez is an open-source metadata service for managing and representing the lineage of datasets. It integrates seamlessly with OpenLineage, which forms the backbone for tracking data lineage.

Differentiators

In the solution presented here, we have used OpenLineage and Marquez along with Python’s pandas library. This combination offers a powerful solution for managing data lineage in a data-driven environment. The advantages of using OpenLineage with Marquez and Python’s pandas for data lineage include:

  • Ease of Use: provides a simpler integration, especially for Python-centric environments. It is well-suited for data scientists and analysts familiar with Python.
  • Cost: open-source solutions generally have lower upfront costs. AWS costs depend on usage but can be optimized based on resource utilization.
  • Flexibility and Customization: offers flexibility and customization, especially in Python-centric environments. It can be extended and modified based on specific needs.
  • Integration with Cloud Services: well-integrated with AWS services, leveraging cloud-native capabilities.
  • Scalability: can scale with AWS services, suitable for different data processing volumes.
  • Tool Compatibility: OpenLineage and Marquez are designed to work with various tools and platforms. This interoperability ensures that your data lineage solution can be integrated with other data management tools, making it part of a broader ecosystem.

Open Source Data Lineage Solution

Our goal is to combine three open-source components — pandas’ core DataFrame, the OpenLineage standard, and Marquez — into an easy-to-use column-level data lineage application allowing us to track every change made to a pandas DataFrame and then store and visualize that change in Marquez.

The solution leverages the object-oriented nature of Python. We override the pandas DataFrame class and its data manipulation function. The overriding function adds functionality to post relevant data lineage information to OpenLineage APIs and then forwards the request to the original function. This is implemented comprehensively by function decorators to relevant functions. By overriding some higher-level functions, we make sure that not each pandas DataFrame operation function needs to be overridden. Instead, all of these operations are caught with just overriding a few major functions that are implicitly called when performing manipulating data operations.

In the example below, two DataFrames df1 and df2 are created and joined. When the join() function is called in line 13, the code at first calls the __getitem__ function defined in our pandas_lineage library ①. Our __getitem__ function is decorated and overwrites pandas's __getitem__ function, such that the corresponding lineage gets parsed and posted ② ③, and panda's __getitem__ function is executed ④. In this case, it gets the columns required for the join (df1[["key"]]).

Next, the join() function itself is called (.join(df2)) ⑤. Our join() function calls panda's join() ⑥, which also calls __getitem__ several times ⑥ ⑦. This results in the parsing and posting of the corresponding lineage as well as the joining of the two DataFrames.

Figure 1: Creation and join of two DataFrames

The resulting lineage graph is displayed below.

Figure 2: Data lineage graph in Marquez

Deployment on AWS

As many of customers are seeking a data lineage (column-level) solution integrated with a data lineage visualization tool on AWS, we have designed an architecture in which our data lineage solution using Python’s Pandas can run and communicate with Marquez to present data lineage. OpenLineage connects with producers (Python’s Pandas running on an AWS service) and consumers (Marquez) of data lineage that support the OpenLineage standard.

Architecture Overview

Creating an AWS architecture for data lineage with OpenLineage, Marquez, and Python Pandas involves orchestrating couple of AWS services to manage metadata, run Python scripts, and visualize the lineage. Below image represents a high-level architecture for the solution.

Figure 3: Deployment of Data lineage solution on AWS

This architecture aims to provide a scalable, secure, and managed environment for running Python Pandas jobs, tracking data lineage, and exposing APIs.

AWS ECS for Pandas jobs and OpenLineage Integration: Containerize Python Pandas jobs and deploy them on ECS. Package Pandas code and dependencies into a deployment package, including the OpenLineage SDK and Marquez Python client. Set environment variables for Marquez API endpoint and credentials. Python Pandas jobs are triggered by events, such as file uploads to S3 or API calls to API Gateway. Emit metadata using the OpenLineage SDK within Python Pandas jobs.

Marquez Setup on AWS: Deploy Marquez on AWS, which can be done using EC2 instances or container services like Amazon ECS or EKS. This architecture represents deployment of Marquez on an Amazon ECS. OpenLineage tracks data lineage metadata and emits it to Marquez. Marquez maintains a metadata store with information about data sources, transformations, and destinations.

Amazon RDS for Marquez: Use RDS for managing Marquez’s metadata database.

AWS Step Functions for Workflow Orchestration (optional): If your data processing involves multiple steps or dependencies, consider using AWS Step Functions for workflow orchestration. Define state machines that coordinate the execution of Lambda functions and manage the flow of data.

Elastic Load Balancing (ELB): Use an Application Load Balancer (ALB) to distribute incoming traffic to Marquez ECS services. Configure the ALB to route traffic to different ECS services based on the path or domain.

API Gateway Exposure: API Gateway exposes RESTful APIs for accessing data lineage information. Implement AWS WAF to protect API Gateway from common web exploits.

User Authentication and Authorization: Cognito handles user authentication and authorization. API Gateway endpoints are secured using Cognito user pools.

Considerations

The suggested architecture aims at providing a scalable, secure and managed environment for running Python jobs, emitting metadata to Marquez and exposing APIs through an API Gateway. When implementing a similar solution, consider the following topics:

  • Security: Implement secure communication between services using HTTPS. Regularly review and update IAM roles and permissions.
  • Scalability: Utilize AWS services like ECS and Auto Scaling to handle varying workloads.
  • Monitoring and Logging: Enable CloudWatch logs and alarms for proactive monitoring. Implement centralized logging for effective troubleshooting.
  • Cost Optimization: Use AWS Cost Explorer to analyze and optimize costs. Consider leveraging serverless options (e.g., Fargate, Lambda) for cost efficiency.

Ensure following AWS best practices, especially regarding security and scalability, and adjust the details based on your specific requirements and performance considerations.

Limitations

While Marquez provides a basic level of visualization for data lineage, it might lack some advanced features available in commercial data lineage tools. Marquez cannot visualize column-level lineage properly, it only shows column-level lineage as JSON-formatted information. Moreover, the solution may have limitations in capturing real-time lineage whereas specialized tools might offer better support for real-time data lineage.

Conclusion

By combining OpenLineage, Marquez, and Python Pandas, you can build a robust data lineage solution that not only captures metadata but also provides valuable insights into the flow and transformation of data within your organization. This can lead to improved data quality, better collaboration, and enhanced decision-making processes. The choice between OpenLineage with Marquez and Python Pandas and other solutions depends on factors such as organizational preferences, specific use cases, integration requirements, budget constraints, and the expertise of the data team. It is advisable to perform a thorough evaluation and if possible, conduct a proof of concept to assess how well a particular solution aligns with your organization’s needs.

You can contact us for further details.

--

--