Published in

Udemy Tech Blog

9 min readAug 19, 2023

Data Quality at Udemy — Part 1

Data Lineage Demystified- Why it Matters and How to Leverage its Magic for Informed Business Success!

In late 2020, upon joining Udemy as a principal engineer for the data platform team, my focus shifted toward enhancing data quality within the organization. My journey began with conducting a comprehensive poll across the data organization, aimed at identifying the key pain points of data users.

The results of the poll were eye-opening, revealing that 78% of users considered the absence of data provenance/lineage as a data quality issue. Furthermore, it was obvious that for a vast majority of the users, the inability to track data lineage was an important problem in impact analysis and detecting unused columns or orphan tables in the system.

Inspired by these insights, I took the initiative to propose and launch two transformative projects. The first one, which is the subject of this article, is an ambitious initiative to establish a comprehensive end-to-end data lineage solution that will revolutionize our data ecosystem.

The second project centers around data monitoring, which will be explored in another post.

Udemy’s sophisticated data architecture revolves around a data lake fed by diverse pipelines: system logs, streaming data from services, CDC listeners for replicated service databases, and more. The backbone of data transformations lies in Hive and Spark, while Airflow takes charge of orchestrating thousands of these pipelines.

Unraveling Data Flow Complexity: Data Lineage in Growing Data Driven Organizations

In the early stages of an organization’s data-driven journey, data lineage may not be deemed crucial. With just a few pipelines managed by a centralized team, the dependency tree of the workflow orchestration typically suffices to comprehend the relationships between data entities.

However, as the business scales up, relying on a single centralized team for all data flows becomes a bottleneck. Consequently, data organizations shift their focus toward developing tools and infrastructure to support self-service analytics.

While self-service analytics brings its benefits, it also introduces new challenges. The ownership of end-to-end data flows becomes distributed across multiple teams. This decentralization makes modifying the data flows/data entities more cumbersome, as there are now several teams responsible for downstream components.

Data lineage serves as a critical tool in facilitating change management within a self-service analytics company. By establishing a reliable lineage system, managing changes becomes significantly smoother. For instance, if a table owner wishes to modify a column, they can initiate a pull request (PR). Utilizing a custom GitHub action, the lineage services can identify all the downstream transformations impacted by the proposed change and add the corresponding teams as reviewers to the PR. This process enables relevant stakeholders to come together and engage in discussions about the modification before it is implemented, promoting a collaborative and well-informed decision-making approach.

In addition to this, within a growing data organization, change is constant. The workforce experiences turnover with departures and new arrivals, projects face cancellations, and data sources are upgraded to ones with superior quality. In such a dynamic setting, challenges can arise concerning orphan workflows or tables. For example, modifications made to the logic of a flow responsible for generating the Gross Merchandise Volume (GMV) might have taken place months ago, yet the outdated flow could have been inadvertently neglected and left unremoved. These situations can lead to disorder and inefficiencies in the data ecosystem, highlighting the critical significance of maintaining vigilant data governance and meticulous lineage tracking.

Upon joining Udemy, I discovered that the company had already transitioned to the self-service analytics stage. However, the change management process for data was far from seamless. The conventional approach involved relying on my team, the data platform, to manually identify the impact and notify the relevant stakeholders. Unfortunately, without any dedicated tool, the only way to analyze the impact was by meticulously reviewing the source code, which proved to be both time-consuming and prone to errors.

Requirements

To alleviate the change management burden within the data organization, we embarked on the lineage project. After a series of brainstorming sessions, we outlined the key information that our lineage should encompass, including:

- The source and destination of a transformation.

- The duration it took to execute the transformation.

- The Airflow DAG/task and the corresponding task run that triggered the transformation.

- The query or logical plan used in the transformation.

- The owner responsible for the transformation.

Our Goal: An Adaptable and Extensible Data Lineage Solution

Our goal was to design an extensible lineage system capable of accommodating future growth and changes, while remaining independent of any specific data catalog or data orchestration tool. This approach ensures the project’s flexibility, making it effortless to integrate with a variety of tools and seamlessly adapt to new data catalog tools.

To achieve this level of independence, we avoid tightly coupling code with specific dependencies. Instead, we establish well-defined endpoint adaptors that act as intermediaries between the data lineage project and different data catalog and orchestration tools. As a result, supporting a new data catalog tool becomes a straightforward process, involving the addition of a custom adaptor tailored to it.

Moreover, we embrace adaptability by designing the project to gracefully handle transitions between data orchestration tools. Whether it’s switching from Airflow to Luigi or any other combination, the data lineage project remains untouched. There’s no need for any code changes, ensuring seamless migration and preserving the project’s stability.

By prioritizing extensibility, we future-proof our data lineage solution, safeguarding it against the uncertainties of evolving data ecosystems. This approach not only enhances maintainability but also promotes the exploration of innovative tools and technologies to continuously improve our data management and analysis capabilities. With “extensibility” as our guiding principle, we embark on a journey of data empowerment and unlock the true potential of our data-driven endeavors.

Query Listeners

Despite the existence of a few promising open-source libraries available for capturing lineage information from Spark and Hive, we discovered that meeting our specific requirements, as mentioned earlier, would demand extensive customization. From my perspective, once the level of customization exceeds a certain threshold, it becomes impractical to continue working on a forked version of a project. The effort required to maintain a separate branch of an open-source project and address potential conflicts becomes more time-consuming compared to taking ownership of the entire project. In such cases, I find it preferable to utilize open-source projects as a reference to approach the problem, leverage open-source submodules as libraries if available, and ultimately develop a new project.

After several iterations and discussions with stakeholders, we concluded that it would be best to develop our own custom implementation tailored to our unique needs. This decision allowed us to have full control over the lineage capturing process and ensure that it aligns precisely with our project’s goals and objectives.

Presently, our listeners have the capability to gather various performance metrics, including the details of the executed query, the Airflow dag+task+runid responsible for triggering this transformation, and the transformation’s owner.

Certain parameters, such as details about input-output tables or query-related information, are extracted from the query plan. On the other hand, some others (e.g., owner, dagid, taskid) are passed by Airflow to Hive and Spark applications’ context parameters/ query variables.

Once this information is captured, it is asynchronously pushed to Kafka, where we maintain a single topic with three partitions. To ensure efficient consolidation of lineage events for the same output, we use the destination table identifier as the Kafka key. This strategic decision allows the lineage consumer to identify and select the latest lineage event within a time window for each table, even in cases of duplicate lineage generations.

In case of any exceptional situation, the lineage libs log the problem and eat the exceptions up to not fail the query.

Below, you can see the code blocks from our ansible playbook responsible for enabling the lineage capturing for Spark and Hive.

- name: Replace/Add spark lineage hook
  become: yes
  become_user: root
  lineinfile:
    path: /etc/spark/conf.dist/spark-defaults.conf
    regexp: '^spark.sql.queryExecutionListeners'
    line: "spark.sql.queryExecutionListeners com.udemy.dataplatform.lineage.spark.listener.SparkQueryListener"

- name: set lineage hooks for hive config
  become: yes
  become_user: root
  xml:
    path: /etc/hive/conf/hive-site.xml
    xpath: /configuration
    input_type: xml
    pretty_print: yes
    add_children:
      - "<property><name>hive.exec.post.hooks</name><value>com.udemy.dataplatform.lineage.hive.listener.LineageHook</value></property>"
  when: hive_hook_check.count == 0
  notify:
    - hive-restart

High-Level Design — Phase 1

In the initial stages of the project, lacking a data catalog tool capable of storing and visualizing lineage, we opted for a solution involving a graph database and a CRUD REST service. To achieve this, we employed a simple Spark structured streaming application to read and deduplicate the lineage events, subsequently pushing them to the Lineage REST Service hosted on EKS. This REST service served as the central point of access and modification for the lineage data stored in the AWS Neptune graph database.

Despite the absence of UI developers in our team, I took the initiative to learn React by watching lessons on Udemy.com. There are many great resources but I followed the one from John Smilga. As a result, I successfully developed a basic React application with the ability to connect to the lineage service and visualize the data.

This design approach proved to be effective, enabling us to query the lineage through a simple web application. For more advanced operations, such as loop detection, we utilized Gremlin query language on Neptune Workbench, thereby fulfilling our requirements without the need for specialized UI developers.

High-Level Design — Phase 2

After reviewing our accomplishments, we were pleased to see that our work had been fruitful. However, we recognized that the lineage towards Redshift and BI tools was still incomplete. As we were in the process of searching for a suitable library to parse the SQL queries stored in Redshift’s STL_QUERY table, an exciting development occurred: Our dedicated Data Services team shared their progress with Datahub

Thanks to our extensibility design principle, by merely replacing the adaptor of our Spark streaming application with a new one, we could seamlessly push our lineage data to Datahub. This transformation was remarkably swift, taking only a couple of days to implement.

Datahub already had plugins readily available for Looker and Redshift, which turned out to be a perfect fit for our needs. By merging the lineage we collected with the lineage obtained through Datahub plugins, we achieved a comprehensive view of the entire data lineage journey — right from the initial source, such as an RDS table, up to the final destination, such as a Looker dashboard. This integration has allowed us to achieve a complete and robust lineage tracking system, enhancing our data management capabilities significantly.

Final Words

Indeed, data lineage is an ongoing and continuous effort due to the dynamic nature of data pipelines and the ever-evolving data ecosystem. Implementing data lineage can be likened to a “Zero to One” project, as described by Peter Thiel, where the goal is to create something entirely new and transformative.

In the data-driven landscape, data is often considered as valuable as oil (I hate these cliches:) ), but it is essential to remember the principle of “Garbage in, garbage out.” The quality of the insights and decisions we derive from data depends on the quality of the data itself. Understanding the origins of data, how it flows through the system, and how it is transformed is vital to ensure accurate and reliable outcomes.

Data lineage serves as a powerful tool to track and trace data from its source to its destination, providing valuable insights into data movements, transformations, and dependencies. By leveraging data lineage, we gain a comprehensive understanding of the data’s journey, allowing us to identify potential issues, enhance data quality, and make informed decisions based on reliable information.