Visualizing Field Level Lineage

Published in

cdapio

5 min readNov 25, 2019

In previous blog posts on Field Level Lineage (FLL), Tony Hajdari described the current Field Level Lineage feature in CDAP and Lea Cuniberti-Duran walked through the use cases, personas, and design challenges for the updated FLL.

An upcoming release of CDAP will introduce a newly redesigned Field Level Lineage (FLL). The updated FLL feature will allow users to better visualize the journey of individual fields in datasets and see relations between fields to solve the use cases defined in the previous blog posts.

While the alpha version of FLL is capable of this analysis, the updated design makes relationships even easier to see by connecting fields to each other in the UI. Datasets are represented as tables, with a row for each field.

Redesigned Field Level Lineage. Users can visualize how fields in a dataset are used to produce fields in a subsequent dataset.

While implementing this UI, drawing the edges between fields posed some interesting and unexpected challenges. In this post, we’ll describe some learnings from this project.

Connecting fields — what a tangled web we weave!

There are two main states in the UI, a default and pinned state. In the default view (shown above), the user can see relationships between all fields in datasets adjacent to the center dataset, which we call the target dataset. We also highlight edges touching the currently selected field in the target dataset. In the pinned state, we focus on a single field in the target data set and show all fields of the target data set, as well as fields in other data sets adjacent to the target field (see the following figure).

Users can pin a target field to only see the fields adjacent to the selected field.

Let’s consider how the edges between fields might be drawn. We need to know the positions of the fields on the page to give to the drawing API that we’ll discuss more below. One way to locate the fields would be to track their coordinates in a dedicated data structure. However, we’d need to keep this data up-to-date as the UI changes, which adds extra complexity. A second idea would be to give each field element on the page a unique id. Then we can find it with a selector on demand, which seemed cleaner. Now — how do we assign unique ids to the fields?

Try 1: UUIDs

In the first attempt, we used universally unique identifiers (uuid) such as 4f4b9117–2935–40cc-be18–571c8762f0cd because they are (nearly) guaranteed to be unique. One thing to keep in mind with using uuid is that an id can’t start with a number in HTML4. This can be simply fixed by appending a non-numeric character to the front of the generated uuid. A bigger downside, however, was that uuids are not human-readable or useful for anything other than their uniqueness. This made them less helpful for debugging during development.

Try 2: Human-readable ids

We then set out to make unique-but-human-readable ids. A field in CDAP is uniquely identified by a namespace, a dataset, and the field name. Thus, we can combine these to obtain a unique id. For example, for the id field in the Employee_Data dataset in default namespace, we could simply concatenate the names with an underscore as a delimiter to get default_Employee_Data_id. Unlike UUIDs, these ids carry info useful at debug-time.

This approach also works, almost! One edge case occurs when a user’s pipeline both reads and writes the same dataset. In that case, we actually want to render the field in multiple positions on the page, and a simple namespace-dataset-field combination id does not uniquely identify the DOM element. To address this, we added information about the type of field (target, cause, or impact) to the id.

When a pipeline reads and writes the same dataset, the same field is rendered in multiple tables.

Finally, choosing an appropriate delimiter to combine the namespace, dataset, and field names is helpful for readability. In the end, the field id target_ns-default_ds-Employee_Data_fd-id is used when the field id in Employee_Data dataset in default namespace is in the target dataset. This allows us to have human-readable ids that also uniquely identify each field element.

Drawing curvy edges with d3

Once the field ids are guaranteed to be unique, the field elements can be located using the unique identifier and a selector. Then we draw each edge as an svg path, a native DOM element. The path element is rendered according to a potentially complicated set of instructions contained in its path data attribute. For example, to draw a triangle, we need to specify a path data string that essentially says, “start at point A, draw a vertical line for x units, then draw a horizontal line for y units, then return to the starting point.” In our UI, the shape and style of the edges are even trickier to describe.

We used the d3 library to simplify drawing edges. Using d3’s line generator, it’s a breeze to generate a path data string that says, “start at the source field, go a third of the way toward the destination field, draw a line two thirds of the way to the destination field, then draw a straight line to the destination. Oh, and smooth out any pointy corners with nice smooth curves please!” The line generator takes an array of coordinates and style configurations, such as the shape of curve to use to interpolate between the points, and generates the appropriate path data string.

Conclusion

The data journey of datasets and fields can be complex. The updated CDAP Field Level Lineage enables users to visualize and analyze metadata by interactively showing edges between related fields. We learned that considering all edge cases (pun intended) was necessary for generating unique ids for fields, and d3 greatly simplified drawing the edges. Please stay tuned for more improvements to the CDAP user experience, and try out CDAP field level lineage!