How the CDAP team designed a tool to track the journey of data in enterprise production environments.
Part 1 of 2
This is a follow up to a blog on Data lineage in CDAP. In a previous blog, Tony Hajdari described the current Field Level Lineage (FLL) feature in CDAP and its role in the Data Integration journey.
In this two-part blog, I will take you through the use cases, personas as well as the challenges and trade-offs we considered when designing the updated version of FLL.
— — — — — — — —
Let’s start by talking about the data in the enterprise space.
It is well established that data is one of the most important assets of enterprises and there are special considerations related to the management of the data process.
It is critical for enterprise users to have the ability to track the architecture, processes, quality and security of the data throughout its journey.
Enterprise users need insights around how any dataset is created, processed and managed: from understanding which transformation has been applied by which pipelines, to track how the company’s internal policies around data and security are being implemented and maintained.
All the workflows and activities around gathering and managing information on data are commonly referred to as Data Governance and it is arguably one of the greatest challenges in the Big Data space.
Data Governance focuses on the processes and policies around the data journey to ensure the security, compliance and availability of the data.
Many of the needs of data governance can be met through data lineage which can be described as the visualization of the journey of data and the transformations applied to it over time.
In that context, lineage provides a way to trace back to where the data is coming from and where it is going. The user can use that view to analyze:
- The root cause of a particular data event
- The impact of that event.
To understand lineage from the perspective of a user, we need to look at the data journeys and the use cases around it.
In a typical production environment, data is processed and transformed by hundreds of pipelines. These pipelines have common dependencies, such as datasets, resources and schedules.
An example of a data journey may be the following: an unstructured dataset is read from a data storage by pipeline A. Transformations to structure and clean the data are applied and a new dataset is created. That new dataset is then used by pipelines B, C and D, which apply other transformations and write other datasets that in turn are used by other pipelines, and so forth.
This process can be even more complex when a single dataset is derived from multiple data sources.
In large production environments, understanding how the data is originated or the impact downstream is crucial to tasks such as debug production issues, predict the impact of a change in the data, detect data quality issues, etc.
Personas and use cases
After looking at how the data journeys through a production environment, let’s take a look at the different roles within the enterprise world who interface with data.
As part of our design process, the CDAP team took into consideration five different roles, and focused on their use cases and pain points.
The goal was to create a design that answered relevant data-related questions, and solved real pain points.
A person in the role of Data Steward is in charge of creating policies around data. Their role is to track how data is handled and verify that the data policies are correctly implemented.
In our data journey example, Data Stewards will want to track down how any of the datasets were processed, as well as be able to track the transformations of any given field.
Data Stewards worry about:
- Data Compliance: to know if the data is compliant to the company standards and policies.
- Impact Analysis: to get insights on how any data changes affect the overall production environment.
Data Compliance Officers
Data Compliance Officers hold the role of performing audits of the data to ensure its compliance. They validate how specific fields are computed and if the appropriate transformations have been applied.
Data Compliance Officers focus on activities such as:
- Regulatory Reporting: to track what path the data and a single field have taken.
- Data Compliance: to understand on how specific data is calculated and handled.
Data Engineers are in charge of building, maintaining and managing data pipelines. As part of their workflow, they may be tasked with enhancing a data pipeline by adding enrichment to a field. They have to be able to forecast the impact of these changes. They want to forecast which downstream pipelines may be affected, and prevent any downstream impact.
In a troubleshooting scenario, where a pipeline is failing, Data Engineers need to have insights of any upstream activities. These include: changes in the schema and any transformations that might have affected the production environment.
In these scenarios Data Engineers top priorities are:
- Impact and Root Cause Analysis: to get insights on how any data changes affect the overall production environment.
Data Scientists build models to predict a number of behaviors. Accuracy of their predictions depends on the quality of the data they use. The selection of the right dataset to use and test their models is critical.
Data Scientists worry about:
- Trustworthiness of the data: to determine if a certain dataset is right for the planned purpose by looking at the origin and processing. Similarly, on a field level, Data Scientists want to understand how a single field has been generated, how recently has it been updated and if it is suitable for their consumption.
Data Analyst is an umbrella term that describes many different roles. At a macro level, Data Analysts have to derive insights from the data. They are also responsible for interfacing with stakeholders and customers to deliver structured data for reporting purposes. If there are incongruencies in the data delivered, those differences have to be reconciled.
For example, a stakeholder notices that one of the fields in the dataset delivered in the context of a report, doesn’t have the expected data. The Data Analyst will want to trace back the data journey to understand the root cause of the discrepancy.
In this scenario Data Analysts think about
- Data quality triage: to understand the root cause of a detected data quality problem is originated by the raw, original data or by the processes that were applied to it.
In the enterprise world, the journey of the data is complex and often messy. It is common to have hundreds of pipelines in a production environment. These pipelines share data and transform it for a number of different purposes.
The complex architectures of these environments make it important to have a detailed view of all the relations and dependencies across pipelines and datasets.
As part of the design process, the CDAP team took into consideration enterprise roles and analyzed their workflow. We wanted to understand goals, questions and pain points for each of the roles and use that information to design features that addressed them.
In part 2 of this blog, I will describe the challenges we encountered and the considerations that went into the design of the new iteration of Field Level Lineage in CDAP.