Imagine you are a business leader ready to start your day, but you wake up to find that your daily business report is empty — the data is late, so now you are blind.
Over the last year, multiple teams came together to build SLA Tracker, a visual analytics tool to facilitate a culture of data timeliness at Airbnb. This data product enabled us to address and systematize the following challenges of data timeliness:
- When should a dataset be considered late?
- How frequently are datasets late?
- Why is a dataset late?
This project is a critical part of our efforts to achieve high data quality and required overcoming many technical, product, and organizational challenges in order to build. In this article, we focus on the product design: the journey of how we designed and built data visualizations that could make sense of the deeply complex data of data timeliness.
Yes, Data Can Be Late
To avoid blinding the business, it is critical to deliver data in a timely manner. However, this can be difficult to do because the journey from data collection to final data output typically requires many steps. At Airbnb — and anywhere with large-scale data processing pipelines — “raw” datasets are cleaned up, merged, and transformed into structured data. Structured data then powers product features and enables analytics to inform business decisions.
To ensure timeliness of the final output data, we aim to have owners of each intermediate step commit to Service Level Agreements (SLAs) for the availability of their data by a certain time. For example, the dataset owner promises that the “bookings” metric will have the latest data by 5 AM UTC, and if it is not available by this time, it is considered “late.”
How Often Are My Datasets Late?
As a first step, we set out to enable data producers to understand when data is landing and how frequently they meet SLAs in the Report view (Figure 1). In this view, producers can track real-time and historical trends across multiple datasets they own or care about. We also ensured that producers get value even when there is no formal SLA set by surfacing typical landing times. No SLAs were set when we first launched the tool, and SLAs may be unneeded for datasets that are not widely consumed.
The Report view makes use of traditional lists of data entities, with embedded small visuals that concisely summarize typical and historical landing time data. Data producers can organize their datasets across lists and collaborate on lists with others (e.g., their team).
With this data-rich summary, understanding landing times and SLA performance became as simple as curating a list of datasets.
Reporting Is the Tip of the Iceberg
While the Report view dramatically simplified understanding whether a dataset was late, it did not address two major challenges of SLAs:
- What is a reasonable SLA for a dataset?
- When a dataset is late, how do you understand why?
These questions are challenging because datasets are not independent from each other. Instead, datasets are derived stepwise in a specific sequence, where one or more transformations must happen before another (Figure 2).
Thus, the availability of one dataset is intrinsically linked to a complex hierarchical “lineage” of data ancestors. To set a realistic SLA for a dataset, one has to take into account its entire dependency tree — sometimes comprising 100s of entities — and their SLAs.
To add to the complexity, when things go wrong, trying to match up the hierarchical dependencies with the temporal sequence makes SLA misses hard to reason about without visual aid. Existing tooling at Airbnb enabled data engineers to identify problems within their own data pipeline, but it was exponentially more difficult to do this across pipelines which are often owned by different teams.
Why Is My Dataset Late?
To enable data producers to identify the root cause(s) of an SLA hit or miss across data pipelines, and to set realistic SLAs by taking into account full data lineage, we designed the Lineage view.
Early Design Attempt
To be successful, the Lineage view needed to enable data producers to reason about both dataset dependencies and the timelines of those dependencies. Since data lineages can include 10s — 100s of tables, each with 30 days of historical data, SLAs, and relationships between them all, we needed to concisely represent on the order of 1,000s — 10,000s of individual data points.
In our initial explorations, we heavily emphasized lineage over landing time sequence (Figure 3). Although it was easy to understand dependencies for small lineages, it failed to highlight which dependencies caused delays in the overall pipeline on a given run, and it was difficult to understand where time was spent to produce the dataset overall.
Focus On Time With the Timeline View
We then pivoted to emphasizing temporal sequence over lineage. To do this, we designed a dependency-inclusive gantt chart (Figure 4) with the following features:
- Each row represents a dataset in the lineage, with the “final” dataset of interest at the top.
- Each dataset has a horizontal bar representing the start, duration, and end time for its data processing job on the date or time selected.
- If a dataset has an SLA, it is indicated with a vertical line.
- Distributions of typical start and end times are marked to help data producers evaluate if the data processing job is ahead of schedule, or behind, putting its downstream datasets at risk.
- Arcs are drawn between parent and child datasets so data producers can trace the lineage and see if delays are caused by upstream dependencies.
- Emphasized arcs represent the most important “bottleneck” path (described below).
With this design, it became easy to find the problematic step — often the long, red bar — or to identify system-wide delays, where all steps just took longer than usual (lots of yellow bars, each past their typical landing time). This visualization is used by many teams today at Airbnb to debug data delays.
Finding the Needle in the Haystack — “Bottlenecks”
For datasets with very large dependency trees, we found it difficult to find the relevant, slow “bottleneck” steps that delay the entire data pipeline. We were able to drastically reduce noise and highlight these problematic datasets by developing the concept of a “bottleneck” path — the sequence of the latest-landing data ancestors which prevented a child data transformation from starting, thus delaying the entire pipeline (Figure 5).
Is it Me or You? Diving into the Historical View
Once the bottleneck step was identified, the next important question became whether the delay in that step was due to long runtimes or delays in upstream dependencies. This helps data producers understand whether they need to optimize their own pipeline or instead negotiate with owners of upstream datasets for earlier SLA’s. To enable this, we built a view of the detailed historical runtimes of a single dataset, showing both when they ran, and the duration (Figure 6).
By combining these complementary views in SLA Tracker, we were able to provide a full perspective of data timeliness (Figure 7).
Process and Tooling
We spent roughly 12 months conceptualizing, designing, prototyping, and productionizing SLA Tracker. Much of this time was spent developing the data APIs that power the UI, and iterating on the Lineage view.
For the simpler Report view, we leveraged static designs and click-through prototypes with generic mock data. Throughout alpha and beta product releases, we iterated on visual language and made data more visual to improve comprehension (Figure 8).
We used an entirely different approach in designing the Lineage view. Its information hierarchy is dictated by the shape of the data, which makes prototyping with real data samples critical. We built these prototypes in TypeScript using our low-level visx visualization component suite for React, which allows for partial code reuse during productionization (Figure 9).
After we were confident in our visualization, we refined the visual elements in static mocks in Figma before productionizing (Figure 10).
In this project, we have applied data visualization and UI/UX design — the interdisciplinary craft we refer to as “Data Experience” — to important data timeliness problems that require deep understanding of complex temporal and hierarchical information. This has enabled us to make data timeliness insights accessible, even in the complex data ecosystem of a large-scale company. It requires time and iteration to develop sophisticated visual analytics tools but the resulting product can provide great value to your organization.
SLA Tracker was the culmination of the efforts of many people and teams. While we focus on the data visualization aspect in this article, there were other important challenges we had to overcome in order to make the analytical tool possible. Thanks to the entire team who worked on the frontend, backend, and data engineering to make this product possible: Conglei Shi, Erik Ritter, Jiaxin Ye, John Bodley, Michelle Thomas, Serena Jiang, Shao Xie, Xiaobin Zheng, and Zuzana Vejrazkova.
All trademarks are the property of their respective owners. Any use of these are for identification purposes only and do not imply sponsorship or endorsement.