How the CDAP team designed a tool to track the journey of data in enterprise production environments.
Part 2 of 2
In the first part of this blog, I discussed the complexity of a production environment and the complex journeys data takes. I also talked about the different enterprise roles whose jobs have dependencies around understanding those journeys.
In this second and final part, I will walk through what is CDAP, how it currently supports data lineage and take a deep dive into the considerations and challenges of designing Field Level Lineage.
— — — — — — — — — —
CDAP and data lineage
CDAP is a framework to create data analytics applications quickly and easily without having to worry about infrastructure and integration.
As data flows through the system, CDAP automatically captures metadata, including information related to lineage. Users are able to trace their data reliably, without having to build it in all their applications.
Currently CDAP provides the user with the ability to view the journey of the data from the perspective of a whole dataset.
Users can track the dependency between datasets and pipelines through a graphical representation of the dataset lineage. In this view, the user has a holistic representation of the interdependencies among datasets and pipelines. For example, users can quickly assess if a dataset is used by multiple pipelines and what datasets are impacted by it.
The graph is interactive and lets the user navigate up and down stream.
Users can browse and review the datasets and pipelines that are the root cause or are impacted by a dataset.
CDAP currently includes an alpha version of Field Level Lineage (FLL) which was reviewed in a previous blog by Tony Hajdari
The new and improved design CDAP Field Level Lineage
An upcoming release of CDAP will introduce a new powerful tool for FLL. The redesigned experience will give users the ability to access a granular representation of the data. In this view, the user can visually follow the journey of any given field, and easily identify the relationship between fields across different datasets.
Design process and challenges
The design process was driven by the use cases (outlined in part 1), with the goal of addressing enterprise users’ questions and pain points.
We also had to confront and manage practical challenges related to Big Data and the enterprise production environment. E.g. how to handle the complexity of the relations between fields, large schema, etc.
Creating designs that scale is hard, but it gets a bit tricky when the design has to scale exponentially.
How can you represent a schema with a hundred fields? How about a few thousand?
What if the lineage of a field is derived from more than one field? What if that single field is derived by dozens of fields each belonging to separate datasets?
These scenarios made designing a consumable view that represents the complexity involved in lineage quite difficult.
Let’s go through some of the major challenges we had to confront and how we solved them.
Multiple lineage levels
The new FLL design had to take into account lineage structures where a dataset may have been derived and/or impacted by dozens of other datasets. The complexity is amplified when viewed at the field level.
The number of fields that are either cause or impact can grow dramatically as lineage gets into deeper levels. The complexity of the design further grows when the originating dataset has several (possibly hundreds) of fields and the number of cause or impact datasets and related fields increases exponentially.
Tracking multiple levels of the FLL felt impractical both from a design and a technical perspective. A lineage graph that represents multiple levels has a level of complexity that is very hard to process by any user. Early in the design process, we explored incorporating multiple levels (see example above). If we wanted to support real use cases, we couldn’t overlook the reality that production environments are a network of hundreds of pipelines and lineage is very complex. The dependencies and number of fields that we needed to surface could grow dramatically from one level to the next. In fact, any field in one level could be derived by multiple datasets and fields, and each of those fields could also have been derived by several others. This meant that in four levels down or up stream, the list of datasets could have been extremely long and complex. Limiting the number of datasets displayed made no sense, as the design would fall short of answering many of users’ questions.
Finally, there is no such thing as “typical” range of level in any given environment. No matter how many levels we were able to represent at one time, the view would always be incomplete.
That same complexity was also reflected by the number of computations required from the back end, which may slow down the rendering and ultimately impact the user experience.
We had to trade-off the idea of presenting the user with a comprehensive overview with a representation that was more focused and easily digestible.
The result is a design where the user is presented with 1 level up and 1 level down stream.
The view is organized from the perspective of the dataset the user is inspecting (aka the target dataset) and the information displays one level for both cause and impact datasets.
The user can navigate up and downstream the data journey by selecting one of the other non-target datasets. The selected dataset becomes the new target and the root cause and impact lineage related to that data is displayed.
Visual representation of the relationships between fields
Field Level Lineage is about the relations between fields and creating a visual representation of the connections between them.
The design had to give a visual representation of the relationships between fields and datasets through a system of lines (aka edges).
Because of the potential number of fields, the number of edges can be significant, and therefore overwhelming to the user. The challenge was to create a design that allowed the user to easily identify the relation between a single field and other datasets’ fields. I wanted the design to let the user interact with the graph and easily make sense of the overall view as well as access details related to the relation between any given fields.
The goal was to create simple, intuitive interactions that guided users through the complexity of the edges, and enable them to organize their view.
The final design solves this problem by leveraging a progressive disclosure pattern. Users can investigate the relations between fields by drilling down to a detailed view and then getting to the details.
Large schema management
In the enterprise world, production environments are set up to handle, move and transform data. The complexity of the datasets handled varies depending on the maturity of the environment. In new environments, the visual representation and user interactions can be fairly straightforward, because of the (relatively) smaller number of processes. However, as the environments mature, the complexity of the processes and transformation grows. Datasets become larger, more complex, with larger schemas. In a mature production environment it is not uncommon for datasets to have hundreds (if not thousands) of fields.
The new FLL design had to support the real world scenarios and solve for:
- Target datasets with large schema.
- Multiple non-target datasets with large schema themselves.
Target dataset and large schema
When we started thinking about datasets with large schema, I wanted to give users a view that was organized, digestible and intuitive.
From a user experience perspective, years of usability research and industry best practices have given us a good understanding of the limitations around long list of text. The TL;DR is that long lists are hard for humans to read, scan and browse. The design had to balance between presenting a list that was not too long and complex to digest, but was not too short and required users extra clicks to browse.
In the final design, the fields are paginated if there are more than 20. The user can browse them by selecting the links at the bottom of the target dataset. The view of non-target datasets updates dynamically as the user goes through the different pages. Non-target datasets are organized so that the user is presented with the most relevant information.
Users may not want to browse through several paginated views, but have questions related to a specific field. In this scenario, users can organize the view by narrowing down the fields displayed through filtering. The design dynamically adjusts to display the non-target fields and dataset that are relevant to the new view.
Multiple non-target datasets
A single target dataset can be derived by and/or have an impact on multiple datasets.
For example, a dataset containing employee compensation information is aggregated from multiple datasets. The fields related to name, age and address are derived from a dataset that contains personal information, while information related to performance and compensation may be combined from other datasets. The main dataset is then read by two pipelines. Each pipeline processes different fields and writes the data onto two different datasets.
Complexity increases, when each of these datasets may have large schema themselves.
The solution was to create a design pattern that handled the number of non-target datasets as well as the size of their schema.
In the new FLL design, the most relevant information is prioritized.
Non-target datasets are organized according to their relation to the fields displayed in the target dataset. The non-target fields are displayed in subsets to give users an overview of lineage and relations between fields.
The collapsed view consists of a set number of fields and a set of ellipses to indicate that additional fields are currently hidden. Users view all the fields related to a dataset, by either drilling down the fields in the target dataset (as described before) or by selecting the non-target dataset. This action will make the selected dataset as the target, therefore displaying all the fields and a full view of the lineage related to the fields in that dataset.
In a situation where there are multiple non-target datasets, only the collapsed view of first few are displayed, while the remaining is paginated.
This pattern allows the design to provide an organized high level view of the system, while allowing enough flexibility and control to users to explore and navigate through the lineage.
CDAP Field Level Lineage and use cases
FLL was designed to answer questions and support workflows of different enterprise users. After describing the challenges of managing and organizing Big Data, let’s take a final look at the use cases we were set to support and of how the design maps to them.
Data Compliance — Users can view operations related to each field both up and down stream. Insights on how data is calculated and handled, give the user an understanding if data is compliant to the company standards and policies.
Regulatory Reporting — Users can use the FLL graph to track the path the data has taken and use that information to generate reports.
Impact and Root Cause Analysis — FLL provides users with the ability to navigate through the data journey related to a single field. Users can follow a single fields across different transformations both up and downstream the lineage. The ability to track single fields, provides granular insights on how any data changes may affect the overall production environment.
Trustworthiness of the data — Users can explore any given field using FLL to understand how it has been generated. By looking at the data upstream and the operations involved during all stages, the user can determine if a dataset is right for the desired purpose.
Data quality triage — The ability to explore activities up and downstream of a production environment lets the user understand if a detected data quality problem is originated by the original data or by the processes that were applied to it.
Last word: design thoughts
When tackling projects that are as complex as field level lineage (and even simpler ones) keep in mind these three tenets:
Understand the use cases and use them as a way to keep your design on track throughout the design process.
Know your users, but if your audience is small and difficult to find/research, leverage best practices. The discipline of UI/UX has been around for decades, established design patterns have been tested by many, making them quite safe to use. If you are able to test your design, use it as a way to fine-tune it.
Design first and foremost for the experience of the user. Technological implementation and details should be secondary. Do right by the user by following common mental patterns and behaviors. Make adjustments in the design if there are major tech issues, but never forget what are the problems you are solving and who you are solving them for.
Field Level Lineage was a challenging and exciting project that required to dive into the world of enterprise Big Data.
The CDAP team had to overcome several challenges both technical and practical. By keeping the needs and goals of the user always in mind, we were able to create a design that meets real-world use cases
The design is an interactive view of the lineage that dynamically changes based on the user actions. It supports the user workflow and activities around lineage by presenting views that are relevant to the stage of the user’s investigation.
This new feature will be released soon as part of CDAP 6.1. Stay tuned for follow-up blogs around the intricacies involved in implementing this design.