Data Quality at Airbnb
Part 1 — Rebuilding at Scale
Authors: Jonathan Parks, Vaughn Quoss, Paul Ellwood
At Airbnb, we’ve always had a data-driven culture. We’ve assembled top-notch data science and engineering teams, built industry-leading data infrastructure, and launched numerous successful open source projects, including Apache Airflow and Apache Superset. Meanwhile, Airbnb has transitioned from a startup moving at light speed to a mature organization with thousands of employees. During this transformation, Airbnb experienced the typical growth challenges that most companies do, including those that affect the data warehouse. This post explores the data challenges Airbnb faced during hyper growth and the steps we took to overcome these challenges.
As Airbnb grew from a small start-up to the company it is today, many things have changed. The company has ventured into new business areas, acquired numerous companies, and significantly evolved product strategy. Meanwhile, the requirements on our data have also changed. For instance, leadership has set high expectations for data timeliness and quality, and increased focus on cost and compliance. To ensure that we continue to meet these expectations, it was apparent that we needed to make sizable investments in our data. These investments centered around addressing areas related to ownership, data architecture, and governance.
Prior to the Data Quality Initiative described in this post, data asset ownership was distributed mostly among product teams, where software engineers or data scientists were the primary owners of pipelines and datasets. However, data ownership responsibilities were not clearly defined — this was a bottleneck when issues arose.
Most of the pipelines that were constructed during the company’s early days were built organically without well-defined quality standards and an overarching strategy for data architecture. This led to bloated data models and placed an outsized operational burden on a small group of engineers.
In addition to needing to lay out an overarching strategy for data architecture, Airbnb also needed a centralized governance process to enable teams to adhere to the strategy and standards.
The Data Quality Initiative
In early 2019, the company made an unprecedented commitment to data quality and formed a comprehensive plan to address the organizational and technical challenges we were facing around data. Airbnb leadership signed off on the Data Quality initiative — a project of massive scale to rebuild the data warehouse from the ground up using new processes and technology.
In developing a comprehensive strategy for improving data quality, we first came up with 5 primary goals:
- Ensure clear ownership for all important datasets
- Ensure important data always meets SLAs
- Ensure pipelines are built to a high quality standard using best practices
- Ensure important data is trustworthy and routinely validated
- Ensure that data is well-documented and easily discoverable
The following sections detail the specific approach that was taken to move this effort forward, with specific focus on our data engineering organization, architecture and best practices, and the processes we use to govern our data warehouse.
Once momentum on the Data Quality initiative reached a critical point, leadership realigned the company’s limited data engineering resources to kickstart the project. This was sufficient to unblock progress on Airbnb’s most critical data; however, it became obvious that we needed to unite and substantially grow the data engineering community at Airbnb. Below are changes we made to facilitate progress.
Data Engineering Role
For several years, Airbnb did not have an official Data Engineer role. Most data engineering work was done by data scientists and software engineers who were recruited under a variety of different monikers. This misalignment made hiring for data engineering skill sets very challenging, and created some confusion with respect to career progression. To resolve these issues, we reintroduced the role “Data Engineer” as a specialization within the ranks of the Engineering organization. The new role requires Data Engineers to be strong across several domains, including data modeling, pipeline development, and software engineering.
We also committed to a decentralized organizational structure composed of data engineering pods reporting into product teams (as opposed to a single centralized Data Eng org). This model ensures data engineers are aligned with the needs of consumers and the direction of product, while ensuring a critical mass of engineers (3 or more). Team size is important for providing mentorship/leadership opportunities, managing data operations, and smoothing over staffing gaps.
To complement the distributed pods of data engineers, we founded a central data engineering team that develops data engineering standards, tooling, and best practices. The team also manages global datasets that don’t align well with any of the product teams.
We created new communication channels to better connect the data engineering community, and established a framework for making decisions across the organization. We created the following groups to address these gaps:
- Data Engineering Forum — Monthly all-hands meeting for data engineers intended for cascading context and gathering feedback from the broader community.
- Data Architect Working Group — Composed of senior data engineers from across the company. Responsible for making major architectural decisions, and conducting reviews for Midas certification (see below).
- Data Engineering Tooling Working Group — Composed of data engineers from across the company. Responsible for developing vision for data engineering tooling and workflows.
- Data Engineering Leadership Group — Composed of data engineering managers and our most senior Individual Contributors. Responsible for organizational and hiring decisions.
We revamped our hiring process for data engineers, and allocated aggressive headcount towards growing our data engineering practice. We paid particular attention to bringing in senior leaders to provide direction as we make decisions that will affect the organization in the years to come. This is an ongoing effort.
Architecture and Best Practices
The next step was to align on a common set of architecture principles and best practices to guide our work. We provided comprehensive guidelines for data modeling, operations, and technical standards for pipeline implementation, which are discussed below.
The company’s initial analytics foundation, “core_data”, was a star schema data model optimized for ease-of-use. It was built and owned by a central team, and incorporated numerous sources — often across different subject areas. This model worked extremely well in 2014; however, it became more and more difficult to manage as the company grew. Based on this learning, it was clear that our future data model should be designed thoughtfully and avoid the pitfalls of centralized ownership.
Meanwhile, the company built Minerva, a widely-adopted platform that catalogs metrics and dimensions and computes joins across these entities (among other capabilities). Given its broad capabilities and wide-scale adoption, it was obvious Minerva should continue to play a central role in our data architecture, and that our data models should play to Minerva’s strengths.
Based on this context, we designed our new data models to follow 2 key principles:
- Tables must be normalized (within reason) and rely on as few dependencies as possible. Minerva does the heavy lifting to join across data models.
- Tables describing a similar domain are grouped into Subject Areas. Each Subject Area must have a single owner that naturally aligns with the scope of a single team. Ownership should be obvious.
Normalized data and Subject Area based data models are not new ideas in the world of data modeling, and they have recently had a major resurgence (see recent blog posts from other organizations on the “Data Mesh” architecture). We found this philosophy particularly attractive, as it addresses our former challenges and aligns well with the structure of our data organization.
We also made sweeping changes to our recommendations for pipeline implementation. This is discussed below.
Spark and Scala
When we began the Data Quality initiative, most critical data at Airbnb was composed via SQL and executed via Hive. This approach was unpopular among engineers, as SQL lacked the benefits of functional programming languages (e.g. code reuse, modularity, type safety, etc). Meanwhile, Spark had reached maturity and the company had a growing expertise in this domain. For these reasons, we made the shift to Spark, and aligned on the Scala API as our primary interface. Meanwhile, we ramped investment into a common Spark wrapper to simplify reads/write patterns and integration testing.
Another area we needed to improve was our data pipeline testing. This slowed iteration speed and made it difficult for outsiders to safely modify code. We required that pipelines be built with thorough integration tests that run as part of our Continuous Integration processes.
Data Quality Checks
We also built new tooling for executing data quality checks and anomaly detection, and required their use in new pipelines. Anomaly detection in particular has been highly successful in preventing quality issues in our new pipelines.
Data operations was another opportunity for improvement, so we made sure to set strict requirements in this area. All important datasets are required to have an SLA for landing times, and pipelines are required to be configured with Pager Duty.
As we set out to rebuild our data warehouse, it was clear that we needed a mechanism to ensure cohesion between data models and maintain a high quality bar across teams. We also needed a better way to surface our most trustworthy datasets to end users. To accomplish this, we launched the Midas certification process (depicted in the diagram below).
The Midas process requires stakeholders to first align on design specifications before building their pipelines. This is done via a Spec document that provides layman’s descriptions for metrics and dimensions, table schemas, pipeline diagrams, and describes non-obvious business logic and other assumptions. Once the spec is approved, a data engineer then builds the datasets and pipelines based on the agreed upon specification. The resulting data and code is then reviewed, and ultimately granted certification. The certification flags are made visible in all consumer facing data tools, and certified data is prioritized in data discoverability tools.
We will provide more details about the Midas Certification process in a future post.
Last, but not least, we created new mechanisms for ensuring accountability related to data quality. We refreshed our process for reporting data quality bugs, and created a weekly Bug Review meeting for discussing high priority bugs and aligning on corrective actions. We also require that teams incorporate data pipeline SLAs into their quarterly OKR planning.
As a company matures, the requirements for its data warehouse change significantly. To meet these changing needs at Airbnb, we successfully reconstructed the data warehouse and revitalized the data engineering community. This was done as part of a company wide Data Quality initiative.
The Data Quality initiative accomplished this revitalization through an all-in approach that addressed problems at every level. This included bringing back the Data Engineering function, setting a high technical bar for the role, and building a community for this engineering specialty. A new team was also formed to develop data engineering-specific tools. The company also developed a highly opinionated architecture and technical standards, and launched the Midas certification process to ensure all new data was built to this standard. And finally, the company up-leveled accountability by setting high expectations for data pipeline owners, specifically for operations and bug resolution.
At this point in time, the Data Quality initiative is moving at full steam, but there is still plenty of work to be done. We’re accelerating investments into our data foundation, designing our next generation of data engineering tools and workflows, and developing a strategy that will shift our data warehouse from a daily batch paradigm to near real-time. We are aggressively hiring data engineering leaders who will develop these architectures and drive them to completion. If you want to help us achieve these goals, check out the Airbnb Careers page.