Data Lineage: State-of-the-art and Implementation Challenges
Background
In this day and age data is highly sought-after. From companies to individual, everyone is actively gathering data. Indeed with the right analysis technique, data is more than a jumbled corpus and numbers. But what’s the use of hundreds of petabyte of data if not accompanied by it’s metadata. Without metadata you’ll find that you’re struggling to generate report, clean it, manipulate it, or even doing simple join query. Metadata is key to maximize data utilization.
So what exactly a metadata is? Well, there are several kinds of it. Ranging from simple metadata such as data type, column name, row number, total size, file format, etc; to more complex metadata such as business context, business domain, semantics, annotation, column level security, and lineage. I’ve found that the simplest explanation for metadata is that it is “data about data”. Metadata tells it’s users about many things to understand informations encapsulated inside the data.
While lineage is not really a novel concept, dating back to research by Yingwei Cui in 1997, it’s adoption in industry isn’t necessarily mature. I think there are several reasons why. Small or newly built organization and corporation doesn’t gain much value by having lineage metadata especially compared to the financial expenses and resources incurred to work on building data provenance tracing. Individuals also rarely has the need to use lineage, thus lowering the adoption of this metadata.
Lineage, also referred to as provenance, describes where data come from, how it was derived, and how it was updated over time. By that definition, it’s now obvious why it hasn’t been widely adopted yet. Individuals and smaller organizations usually has simple data pipeline, eliminating the need of such metadata. However as the business grows, the amount of data that is processed and analyzed follows. More complex pipeline that fetch data from hundreds of microservices and doing heavy data transformations starts to emerge. Only in that situation, will lineage metadata become essential.
Lineage can be useful in variety of settings. For example lineage can be used to verify data originality by tracking the original sources. Lineage can help fix faulty data in data warehouse, by back-tracking the transformation steps to figure out the error producer. Data Analyst team can use lineage to understand the bigger picture such as business domain or context. Lineage can help Data Engineer team to investigate and potentially improve data pipeline efficiency by understanding data dependency and relationships.
Even with numerous benefits of lineage, industry still feel reluctant to integrate data provenance into their pipeline. Aside from the economics of the development versus its utility, industry standard is also a big issue. There are also several implementation challenges that may occur during or even before developing provenance tracker. Here in this post I’ll try to explain it from my perspective, including my own personal experience of developing lineage tracker at Blibli.com.
Current State of Industry Standard Lineage Tracing
As recently more and more people pay attention to data lineage, there are several, actively developed, open source tools and libraries that begin to show up and make a ground-breaking changes, tempting to be adopted.
Apache Atlas
The first one on the list is Apache Atlas. Apache Atlas is an open source project that provides open metadata management and governance capabilities, as stated on their homepage. Basically Atlas is a tools to manage metadata especially for the purpose of data cataloging. Atlas is originally intended to support Hadoop, however they have also provide support for other system such as Hive, Sqoop, and Kafka.
One big disadvantage of Atlas is the lack of support on hooks for relational databases such as Postgres and MySQL. Another thing to keep in mind is that Atlas has its own system, separate from your existing environment, meaning that you need extra computing resources to run it. If you simply want to track lineage, Atlas is an overkill. After researching into this system, I found that the lack of integration and the rigid design is unfavorable.
Airflow Lineage Backend
For those who are familiar with ETL, must, at one point, come across with Airflow. Airflow is a workflow executor in which you design tasks to be executed in a directed acyclic graph also known as DAG. Airflow is fairly popular for executing batch data processing. With lots of integrations, various plugins, and growing community; Airflow is the best bet for anyone just learning about data warehouse and also businesses with hundreds of services and data sources.
Starting from version 1.10.0 onwards, Airflow has added support for tracking lineage automatically via lineage backend. This lineage backend is pluggable meaning that you can develop your own backend and integrate it with Airflow. Apache Atlas is also supported as one of the backend. To use this feature, you simply need to configure tasks’ inlets and outlets parameter. The outlets parameter can even be set to automatically pick up from upstream task.
However, this advantages also comes with disadvantages. The experimental status on 1.10.xx version of Airflow is subject to change and are still lacking features. The data recorded by Airflow is also not rich enough for my need.
Marquez
Next up is Marquez. It is a project in incubation from LF AI & Data. Marquez is an open source metadata service for collection, aggregation, and visualization of metadata. It mainly support data lineage but it’s also a powerful tools for data discovery and data governance. Marquez uses an open source data lineage standard called OpenLineage. By using OpenLineage, Marquez allows it’s users to define their own metadata structure using something called Facets.
Marquez supports metadata versioning right out of the box. It has a simple and abstract data model. All data storage in Marquez is referred to as Datasets, a dataset is produced by a Job, and one instance of Job is called Run. Run contains execution specific configuration while Job defines the overall config. Dataset produced by a Job is automatically assigned to a version ID based on their schema.
The biggest disadvantage of Marquez is its infancy. It is still a very immature project. A lot of things are still not supported yet. For example, Marquez’s library has standards for naming a dataset and it’s source. There are several data storage system that are natively supported such as Postgres, MySQL, MongoDB etc. However, if you use any storage system other than the existing ones, you’ll have to create the naming rule yourself. But there is no clear rule on this naming convention. Another disadvantage is the lack of functionality of their current REST API, and it feels almost useless based on my experience.
OpenLineage
OpenLineage is an open source framework for sending lineage metadata between services. This is the standard that is used by Marquez and many other system such as Apache Atlas, Amundsen and Egeria. If you read the documentation, OpenLineage seems very tightly connected to Marquez. But don’t worry, OpenLineage is supposed to be a generalized framework for anyone to use.
By itself, OpenLineage doesn’t provide much or even any feature. But from the perspective of industry standard, this is a good thing to have. Because in the future, more and more data lineage tools will show up and we need to standardize it somehow.
Moving Forward
After analyzing several data lineage tracing tools and framework, one can’t help but say that it is a mess. There are a lot inconsistency especially in tracing method because of the variety of data moving technique. Some system uses query based transformation, some uses API, and some are for streaming data. Moving forward, I think we have to advocate more on standards and abstractions on the basis of compatibility and applicability with multiple ETL executor.
Challenges
There are a lot of challenges that you may face when developing a lineage tracker. By knowing these challenges you’ll understand what you need to prepare before developing a lineage tracker.
Starting on granularity issue. Tracking data movement between tasks is easy, that’s how Airflow is able to track lineage by knowing tasks dependency. But have you ever wondered how to track movement between one row to another row? Or one tuple (row, column) to another tuple? I think you’ve got the idea. Granularity is what makes developing lineage tracker a headache. Therefore, it’s always a good idea to set a hard limit on granularity before starting your tracker project. Personaly I’ve developed lineage capable of tracking table-level movement and I think most industry only need that much detail anyway.
Second is the lack of standard. As I’ve explained before, the lack of standard in the lineage tools is a big issue. Because even if we are capable of developing our own in-house lineage tracker, we still want an off-the-shelf system since we’ll be free from the burden of updating and maintaining it. The currently available open source tools aren’t ready for production, but who knows what we’ll have in 1–2 years down the line.
Next up is the variety of data sources and transformation technique. If you’re already having a lot of trouble with granularity, get ready for this one. DBA have the liberty to choose whatever database or data storage that they’re using, and it’s a great thing too. More choice create more competition which in turn creating the race of innovation. Very rarely in an organization, to use just one single type of database. You’ll usually find numerous kind of data storage system, from relational to columnar, from document-based to graph. For the case of lineage tracker, data source usually dictates data transformation technique. So the more variety of databases that an organization use, the more difficult it is to fully track all of the data movement.
Data are moved from point A to B by something, usually a script or a system. Those script or system are doing something usually referred to as ETL or extract, transform, and load. Another thing to keep in mind is the data moving or ETL technique used in your company. Different technique will obviously require different provenance tracking method.
Finally, one of the biggest issue in tracking lineage is keeping it updated. Timeliness is a metric to measure the recency of something. In tracking lineage, timeliness is important since you want to capture lineage changes as soon as possible to avoid incorrect or stale metadata. Lineage usually changes when the ETL script is modified, so running lineage tracker along with every execution of ETL is wasteful. But running it periodically will produce stale metadata. Currently in Blibli.com, lineage is refreshed on every ETL deployment period. Using that trick even if we can’t fully eliminate stale metadata, we can reduce it to a minimum.
Personal Experience of Developing Lineage Tracker
During my study at Brawijaya University, I got an opportunity to join an internship at Blibli.com. Feeling hungry for challenge, I found myself choosing Data Engineer department despite having very basic understanding of data warehousing, ETL, and data engineer related tech. It was truly a challenge, but with the help of my supervisor; my 4 months working with the team is a very smooth journey.
During the internship program, I was given a big project. As you’ve guessed it, I was tasked to develop a tools to track data movement in their data warehouse. At that time I’ve just stumbled upon the concept of lineage and I immediately given a task related to that. Since I was up for the challenge, I accepted it and soon began the development.
The data lineage is needed for understanding data relationship and dependency. The users of this tool inside Blibli.com is Data Engineer (DE) and Data Analyst (DA). For DE, lineage enables them to understand data dependency and potentially improve the existing ETL scripts. For DA, they can use lineage metadata for understanding relationships between data inside data warehouse and their context in the operational or business side.
Inside of DE team, luckily almost all data is transformed with the help of Airflow. Some exceptions are for streaming data which are processed using different system. Airflow is a workflow executor, in which it manages workflow or task dependency using a directed acyclic graph (DAG). These DAG that exists inside the Airflow machine are parseable and are loaded into Airflow’s internal backend database. By reading into these DAG definition, technically we already have a lineage. But this task-level granularity is obviously not enough, I’m ought to say useless even.
In Blibli.com, data warehouse consists of three different level of layers. First is the raw data layer. Here, data from operational system are stored as-is with minimum changes in schema. This raw layer helps ease the operational database from huge and complex warehousing query by ETL tools. After the data is stored in raw layer, it is then moved by Airflow to the next layer called staging. In this layer, data is cleaned from duplicates and the schema is usually kept the same, although in some cases they are altered a bit. The final layer is data mart, where ready-to-use data is stored for data analyst to access using BI tools or ad-hoc query.
With those layering system in mind, I decided to increase the granularity up to table-level. Knowing which table is extracted and which table the data is loaded into is easy. All tasks in DAG are instantiated with all sorts of parameters and via those params I have access to the transformation query and the destination table. The transformation query contains reference to the input table names, so by parsing it we have list of the input tables. As long as we have input and output (destination) tables, generating lineage is possible.
However, I want to highlight an issue that I stumbled upon during integrating parser to extract list of input tables in a query. Due to the complexity of the query and highly reliant on temporary query (CTE) created using WITH clause, the parser somehow failed to find the correct tables used as inputs. It either detected a CTE as a table, which is not, or it just failed and returned nothing. I’ve tried with several different open-source query parser I can get my hands on, but they all failed to do the thing that they’re built to do. I was certainly frustrated that time, but having no other options I did what an average software developer will do; jump into the code and do it myself. It was my first time learning and working on a parser, luckily it only took me like one work-week to make my code working. But just for reference, if you’re planning to parse SQL query to get the list of input tables then be prepared to pour extra work to make it work properly.
One problem arises because Blibli’s operational system uses several different databases. This creates several transformation pathways, all of which I’ll have to be able to handle as well. So to solve this issue I’m abstracting the transformation parser to support different mechanism. By doing abstraction, I’m also making it easier to update and maintain the parser in case there is new database or new transformation method.
Another issue that I found after developing the lineage is the timeliness of it. Currently, lineage is extracted twice every week, the day after the deployment day. However in the near future, the deployment of DE will be done possibly every day of the week, depending on the requirements. This creates a problem of stale metadata because the ETL is updated while the lineage is not.
For the future of data lineage, I think it’s a good idea to change the mechanism to enable near-realtime lineage update as soon as the data pipeline is updated. Also, as currently lineage is only stored as a table and not visualized it’s a bit inconvenient for the users. In the future, it’s better to show as a diagram to provide easier access and better understanding for the users.
Conclusion
Overall, the currently available open-source lineage ecosystem are still immature. Even if some tools have been developed for years, like Atlas for example, the lack of support for databases is a huge drawback. Hopefully, we’ll have a better tools for lineage in the future.
As for the developers, several implementation challenges can be solved by setting a clear goals and scope. The timeliness and ease-of-use should also be one of the utmost focus during the development. Keeping the lineage updated is important especially in an environment that evolves rapidly. The ease-of-use is crucial for better usage and understanding of the metadata by its users.
References
[1] Cui, Y., Widom, J., & Wiener, J. (1997). Tracing the lineage of view data in a data warehousing environment. Stanford InfoLab.
[2] Ikeda, R., & Widom, J. (2009). Data lineage: A survey. Stanford InfoLab.