Data Quality at Airbnb

Vaughn Quoss
Nov 24, 2020 · 11 min read

Part 2 — A New Gold Standard

Authors: Vaughn Quoss, Jonathan Parks, Paul Ellwood

Image for post
Image for post

Introduction

In the first post of this series, we shared an overview of how we evolved our organization and technology standards to address the data quality challenges faced during hyper growth. In this post we’ll focus on Midas, the initiative we developed as a mechanism to unite the company behind a shared “gold standard” that serves as a guarantee of data quality at Airbnb.

Image for post
Image for post

Defining the Gold Standard

This was recognized as a key opportunity to define a consistent “gold standard” for data quality at Airbnb.

A Multi-dimensional Challenge

  • Accuracy: Is the data correct?
  • Consistency: Is everybody looking at the same data?
  • Usability: Is data easy to access?
  • Timeliness: Is data refreshed on time, and on the right cadence?
  • Cost Efficiency: Are we spending on data efficiently?
  • Availability: Do we have all the data we need?

The scope of the problem meant that standards focused on individual data quality components would have limited impact. To make real headway, we needed an ambitious, comprehensive plan to standardize data quality expectations across multiple dimensions. As work began, we named our initiative Midas, in recognition of the golden touch we hoped to apply to Airbnb’s data.

End-to-end Data Quality

Many employees at Airbnb will never directly query a data warehouse table, yet use data on a daily basis. Regardless of function or expertise, data users of all types are accustomed to viewing data through the lens of metrics, an abstraction which does not require familiarity with the underlying data sources. For a data quality guarantee to be relevant for many of the most important data use cases, we needed to guarantee quality for both data tables and the individual metrics derived from them.

In Airbnb’s data architecture, metrics are defined in Minerva — a service that enables each metric to be uniquely defined in a single place — and broadly accessed across company data tools. A metric defined in Minerva can be directly accessed in company dashboarding tools, our experimentation and A/B testing framework, anomaly detection and lineage tools, our ML training feature repository, and for ad-hoc analysis using internal R and Python libraries.

For example, take Active Listings, a top-line metric used to measure Airbnb’s listing supply. An executive looking up the number of Active Listings in a Apache Superset dashboard, a data scientist analyzing the Active Listings conversion funnel in R, and an engineer reviewing how an experiment affected Active Listings in our internal experiment framework will all be relying on identical units for their analysis. When you analyze a metric across any of Airbnb’s suite of data tools, you can be sure you are looking at the same numbers as everybody else.

In Airbnb’s offline data architecture, there is a single source of truth for each metric definition shared across the company. This key architectural feature made it possible for Midas to guarantee end-to-end data quality, covering both data warehouse tables and the metric definitions derived from them.

The Midas Promise

In order to make this claim, the certification process needed to collectively address the multiple dimensions of data quality, guaranteeing each of the following:

  • Accuracy: certified data is fully validated for accuracy, with exhaustive one-off checks of all historical data, and ongoing automated checks built into the production pipelines.
  • Consistency: certified data and metrics represent the single source of truth for key business concepts across all teams and stakeholders at the company.
  • Timeliness: certified data has landing time SLAs, backed by a central incident management process.
  • Cost Efficiency: certified data pipelines follow data engineering best practices that optimize storage and compute costs.
  • Usability: certified data is clearly labeled in internal tools, and supported by extensive documentation of definitions and computation logic.
  • Availability: certification is mandatory for important company data.

As a last step, once data was certified, that status needed to be clearly communicated to internal end users. Partnering with our analytics tools team, we ensured data that was “Midas Certified” would be clearly identified through badging and context within our internal data tools.

Image for post
Image for post
Fig 1: Midas badging next to metric and table names in Dataportal, Airbnb’s data discovery tool.
Image for post
Image for post
Fig 2: Midas badging for metrics in Airbnb’s Experimentation Reporting Framework (ERF).
Image for post
Image for post
Fig 3: Pop-up with Midas context in Airbnb’s internal data tools.

The comprehensive Midas quality guarantee, coupled with clear identification of certified data across Airbnb’s internal tools, became our big bet to guarantee access to high quality data across the company.

The Midas Certification Process

Image for post
Image for post
Figure 4: An overview of the nine steps in the Midas Certification process.

This certification process is followed on a project-by-project basis for individual data models, which comprise a set of data tables and metrics that correspond to a specific business concept or project feature. Example data models at Airbnb cover subjects such as Active Listings, Customer Service Tickets, and Guest Growth Accounting. While there is no perfect set of criteria to define the boundaries of a given data model, aggregating our data tables, pipelines, and metrics at this level of abstraction allows us to more effectively organize, architect, and maintain our offline data warehouse.

While this post won’t describe each step of the certification process in detail, the following sections provide an overview of the most important components of the process.

Broad Stakeholder Input

Furthermore, the process is set up to encourage participation from stakeholders across all teams that consume Midas models. A major goal of certification is ensuring the data models we build meet the data needs of users across the company, rather than just the needs of the team building the model. The certification process gives data consumers from every team the option to sign on as reviewers of new data model designs, and we have found that small requests or feedback early in the design process save substantial time by reducing the need for future revisions.

Prior to Midas, these cross-functional, cross-team partnerships were often difficult to form organically. The formal structure provided by a certification process helps streamline collaboration on data design across the company.

Design Specs

The contents of a design spec are best illustrated with examples. The following figures depict condensed and simplified examples from the design spec for Airbnb’s Active Listings data model.

The spec opens with a description of individual and team data model owners, as well as the relevant design reviewers.

Image for post
Image for post
Fig 5: Owners and reviewers are formalized in the heading for each Midas design spec.

The first section of the spec describes the headline metrics included in the data model, along with plain-text business definitions and specific details relevant to interpreting the metrics.

Image for post
Image for post
Fig 6: An example Metric Definitions section from a Midas design spec.

The following section provides a summary of the pipeline used to build the data tables included in the model. This summary includes a simple diagram of input and output tables, an overview pipeline SLA criteria, context on how to backfill historical data, and a short disaster recovery playbook.

Image for post
Image for post
Fig 7: Example pipeline overview section from a Midas design spec.

The overview of the data pipeline is followed by documentation for the table schemas that will be built.

Image for post
Image for post
Fig 8: Example table schema details from a Midas design spec.

Finally, the spec provides an overview of the data quality checks that will be built into the data model’s pipeline for validation (as discussed further below).

Image for post
Image for post
Fig 9: Example section on data quality checks details from a Midas design spec.

The examples above cover the main design spec sections, but are shown in substantially condensed and simplified form. In reality, descriptions of metric and pipeline details are much longer, and some of the more complex design specs exceed 20 pages in length. While this level of documentation requires a large upfront time investment, it ensures data is architected correctly, provides a vehicle for design input from multiple stakeholders, and reduces dependency on the specialized knowledge of a handful of data experts.

Data Validation

  1. Automated checks are built into the data pipeline by a Data Engineer, and described in the design spec. These checks are required for certified data, and cover basic sanity checks, definitional testing, and anomaly detection on new data generated by the pipeline.
  2. One-off validation checks against historical data are run by a Data Scientist and documented in a separate validation report. That report summarizes the checks performed, and links to shared data workbooks (e.g. Jupyter Notebook) with code and queries that can be used to re-run the validation whenever a data model is updated. This work covers checks that can not be easily automated in the data pipeline, including more detailed anomaly detection on historical time series, and comparisons against existing data sources or metrics expected to be consistent with the new data model.

As with the design specs, this level of validation and documentation requires a larger upfront investment, but substantially reduces data inaccuracies and future bug reports, and makes refreshing the validation easy when the data model evolves in the future.

Certification Reviews

There are four distinct reviews in the Midas process:

  1. Spec Review: Review the proposed design spec for the data model, before implementation begins.
  2. Data Review: Review the pipeline’s data quality checks and validation report.
  3. Code Review: Review the code used to generate the data pipeline.
  4. Minerva Review: Review the source of truth metric definitions implemented in Minerva, Airbnb’s metrics service.

Collectively, these reviews cover engineering practices and data accuracy across all data assets, and ensure certified data models meet the Midas promise: a gold standard for end-to-end data quality.

Bugs and Change Requests

Conclusion

Midas certification does not come without challenges. In particular, quality takes time. Requirements for documentation, reviews, and input from a broad set of stakeholders mean building a data model to Midas standards is much slower than building uncertified data. Re-architecting data models at scale also requires substantial staffing from data and analytics engineering experts (we’re hiring!), and entails costs for teams to migrate to the new data sources.

Offline data is a key technology asset for Airbnb, and this investment is warranted. Certified data models serve as the shared foundation for all data applications, spanning business reporting, product analytics, experimentation, and machine learning and AI. Investing in data quality improves the value of each of these applications, and will improve data-informed decisions at Airbnb for years to come.

With special thanks to Aaron Keys, a key partner on the early design and vision of the Midas initiative.

Airbnb Engineering & Data Science

Creative engineers and data scientists building a world…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store