Data Quality at Airbnb
Part 2 — A New Gold Standard
Authors: Vaughn Quoss, Jonathan Parks, Paul Ellwood
At Airbnb, we’ve always had a data-driven culture. We’ve assembled top-notch data science and engineering teams, built industry-leading data infrastructure, and launched numerous successful open source projects, including Apache Airflow and Apache Superset. Meanwhile, Airbnb has transitioned from a startup moving at light speed to a mature organization with thousands of employees. During this transformation, Airbnb experienced the typical growth challenges that most companies do, including those that affect the data warehouse.
In the first post of this series, we shared an overview of how we evolved our organization and technology standards to address the data quality challenges faced during hyper growth. In this post we’ll focus on Midas, the initiative we developed as a mechanism to unite the company behind a shared “gold standard” that serves as a guarantee of data quality at Airbnb.
Defining the Gold Standard
As Airbnb’s business grew over the years, the company’s data warehouse expanded significantly. As the scale of our data assets and the size of the teams developing and maintaining them grew, it became a challenge to enforce a consistent set of standards for data quality and reliability across the company. In 2019, an internal customer survey revealed that Airbnb’s data scientists were finding it increasingly difficult to navigate the growing warehouse and had trouble identifying which data sources met the high quality bar required for their work.
This was recognized as a key opportunity to define a consistent “gold standard” for data quality at Airbnb.
A Multi-dimensional Challenge
While all stakeholders agreed that data quality was important, employee definitions of “data quality” encompassed a constellation of different issues. These included:
- Accuracy: Is the data correct?
- Consistency: Is everybody looking at the same data?
- Usability: Is data easy to access?
- Timeliness: Is data refreshed on time, and on the right cadence?
- Cost Efficiency: Are we spending on data efficiently?
- Availability: Do we have all the data we need?
The scope of the problem meant that standards focused on individual data quality components would have limited impact. To make real headway, we needed an ambitious, comprehensive plan to standardize data quality expectations across multiple dimensions. As work began, we named our initiative Midas, in recognition of the golden touch we hoped to apply to Airbnb’s data.
End-to-end Data Quality
In addition to addressing multiple dimensions of data quality, we recognized that the standard needed to be applicable to all commonly consumed data assets, with end-to-end coverage of all data inputs and outputs. In particular, improving the quality of data warehouse tables was not sufficient, since that covered only a subset of data assets and workflows.
Many employees at Airbnb will never directly query a data warehouse table, yet use data on a daily basis. Regardless of function or expertise, data users of all types are accustomed to viewing data through the lens of metrics, an abstraction which does not require familiarity with the underlying data sources. For a data quality guarantee to be relevant for many of the most important data use cases, we needed to guarantee quality for both data tables and the individual metrics derived from them.
In Airbnb’s data architecture, metrics are defined in Minerva — a service that enables each metric to be uniquely defined in a single place — and broadly accessed across company data tools. A metric defined in Minerva can be directly accessed in company dashboarding tools, our experimentation and A/B testing framework, anomaly detection and lineage tools, our ML training feature repository, and for ad-hoc analysis using internal R and Python libraries.
For example, take Active Listings, a top-line metric used to measure Airbnb’s listing supply. An executive looking up the number of Active Listings in a Apache Superset dashboard, a data scientist analyzing the Active Listings conversion funnel in R, and an engineer reviewing how an experiment affected Active Listings in our internal experiment framework will all be relying on identical units for their analysis. When you analyze a metric across any of Airbnb’s suite of data tools, you can be sure you are looking at the same numbers as everybody else.
In Airbnb’s offline data architecture, there is a single source of truth for each metric definition shared across the company. This key architectural feature made it possible for Midas to guarantee end-to-end data quality, covering both data warehouse tables and the metric definitions derived from them.
The Midas Promise
To build data to meet consistent quality standards, we created a certification process. The goal of certification was to make a single, straightforward promise to end users: “Midas Certified” data represents the gold standard for data quality.
In order to make this claim, the certification process needed to collectively address the multiple dimensions of data quality, guaranteeing each of the following:
- Accuracy: certified data is fully validated for accuracy, with exhaustive one-off checks of all historical data, and ongoing automated checks built into the production pipelines.
- Consistency: certified data and metrics represent the single source of truth for key business concepts across all teams and stakeholders at the company.
- Timeliness: certified data has landing time SLAs, backed by a central incident management process.
- Cost Efficiency: certified data pipelines follow data engineering best practices that optimize storage and compute costs.
- Usability: certified data is clearly labeled in internal tools, and supported by extensive documentation of definitions and computation logic.
- Availability: certification is mandatory for important company data.
As a last step, once data was certified, that status needed to be clearly communicated to internal end users. Partnering with our analytics tools team, we ensured data that was “Midas Certified” would be clearly identified through badging and context within our internal data tools.
The comprehensive Midas quality guarantee, coupled with clear identification of certified data across Airbnb’s internal tools, became our big bet to guarantee access to high quality data across the company.
The Midas Certification Process
The certification process we developed consists of nine steps, summarized in the figure below.
This certification process is followed on a project-by-project basis for individual data models, which comprise a set of data tables and metrics that correspond to a specific business concept or project feature. Example data models at Airbnb cover subjects such as Active Listings, Customer Service Tickets, and Guest Growth Accounting. While there is no perfect set of criteria to define the boundaries of a given data model, aggregating our data tables, pipelines, and metrics at this level of abstraction allows us to more effectively organize, architect, and maintain our offline data warehouse.
While this post won’t describe each step of the certification process in detail, the following sections provide an overview of the most important components of the process.
Broad Stakeholder Input
An important feature of the process is the cross-functional partnerships it formalizes. Every Midas model requires a Data Engineering and Data Science owner who share ownership of the data model design and provide expert input from their respective functions. Cross-functional input is pivotal to ensuring certification can address the full scope of data quality dimensions, which span technical implementation concerns as well as requirements for effective business usage and downstream data applications.
Furthermore, the process is set up to encourage participation from stakeholders across all teams that consume Midas models. A major goal of certification is ensuring the data models we build meet the data needs of users across the company, rather than just the needs of the team building the model. The certification process gives data consumers from every team the option to sign on as reviewers of new data model designs, and we have found that small requests or feedback early in the design process save substantial time by reducing the need for future revisions.
Prior to Midas, these cross-functional, cross-team partnerships were often difficult to form organically. The formal structure provided by a certification process helps streamline collaboration on data design across the company.
The first step in the Midas process is writing a design spec, which serves as both a technical contract describing the pipeline, tables, and metrics that will be built, as well as the primary ongoing documentation for the data model. Design specs follow a shared template with standardized sub-sections. Collectively, these specs form a library of documentation for Airbnb’s offline data assets. This documentation represents a high-value deliverable, as it reduces dependency on data producers’ specialized knowledge, eases future iteration on existing data models, and simplifies transition of data assets between owners.
The contents of a design spec are best illustrated with examples. The following figures depict condensed and simplified examples from the design spec for Airbnb’s Active Listings data model.
The spec opens with a description of individual and team data model owners, as well as the relevant design reviewers.
The first section of the spec describes the headline metrics included in the data model, along with plain-text business definitions and specific details relevant to interpreting the metrics.
The following section provides a summary of the pipeline used to build the data tables included in the model. This summary includes a simple diagram of input and output tables, an overview pipeline SLA criteria, context on how to backfill historical data, and a short disaster recovery playbook.
The overview of the data pipeline is followed by documentation for the table schemas that will be built.
Finally, the spec provides an overview of the data quality checks that will be built into the data model’s pipeline for validation (as discussed further below).
The examples above cover the main design spec sections, but are shown in substantially condensed and simplified form. In reality, descriptions of metric and pipeline details are much longer, and some of the more complex design specs exceed 20 pages in length. While this level of documentation requires a large upfront time investment, it ensures data is architected correctly, provides a vehicle for design input from multiple stakeholders, and reduces dependency on the specialized knowledge of a handful of data experts.
After a design spec has been written and the data pipeline built, the resulting data needs to be validated. There are two groups of data quality checks relied on for validation:
- Automated checks are built into the data pipeline by a Data Engineer, and described in the design spec. These checks are required for certified data, and cover basic sanity checks, definitional testing, and anomaly detection on new data generated by the pipeline.
- One-off validation checks against historical data are run by a Data Scientist and documented in a separate validation report. That report summarizes the checks performed, and links to shared data workbooks (e.g. Jupyter Notebook) with code and queries that can be used to re-run the validation whenever a data model is updated. This work covers checks that can not be easily automated in the data pipeline, including more detailed anomaly detection on historical time series, and comparisons against existing data sources or metrics expected to be consistent with the new data model.
As with the design specs, this level of validation and documentation requires a larger upfront investment, but substantially reduces data inaccuracies and future bug reports, and makes refreshing the validation easy when the data model evolves in the future.
Certification reviews are a major component of the Midas process. These third-party reviews are performed by recognized data experts at the company, who are designated as either Data Architects or Metrics Architects. By performing Midas reviews, architects serve as gatekeepers of the company’s data quality.
There are four distinct reviews in the Midas process:
- Spec Review: Review the proposed design spec for the data model, before implementation begins.
- Data Review: Review the pipeline’s data quality checks and validation report.
- Code Review: Review the code used to generate the data pipeline.
- Minerva Review: Review the source of truth metric definitions implemented in Minerva, Airbnb’s metrics service.
Collectively, these reviews cover engineering practices and data accuracy across all data assets, and ensure certified data models meet the Midas promise: a gold standard for end-to-end data quality.
Bugs and Change Requests
Lastly, though not part of the initial pipeline development process, the Midas initiative improved our ability to manage offline data bugs and change requests. Organizing offline data into discrete data models and clarifying ownership allowed us to formalize company-wide processes to address requests from data consumers. Employees can now use a simple form to file tickets for bugs and change requests, a system that was not previously feasible.
The Midas initiative has allowed us to define a comprehensive standard for data quality shared across the company. Midas-certified data assets are guaranteed to be accurate, reliable, and cost-efficient, with consistent operational support, and backed by detailed user documentation. As the size of the company and our data warehouse continue to grow at rapid pace, the certification process ensures we are able to provide data consumers with a consistent guarantee for data quality at scale.
Midas certification does not come without challenges. In particular, quality takes time. Requirements for documentation, reviews, and input from a broad set of stakeholders mean building a data model to Midas standards is much slower than building uncertified data. Re-architecting data models at scale also requires substantial staffing from data and analytics engineering experts (we’re hiring!), and entails costs for teams to migrate to the new data sources.
Offline data is a key technology asset for Airbnb, and this investment is warranted. Certified data models serve as the shared foundation for all data applications, spanning business reporting, product analytics, experimentation, and machine learning and AI. Investing in data quality improves the value of each of these applications, and will improve data-informed decisions at Airbnb for years to come.
With special thanks to Aaron Keys, a key partner on the early design and vision of the Midas initiative.