How Airbnb Achieved Metric Consistency at Scale

Part-I: Introducing Minerva — Airbnb’s Metric Platform

Data is the voice of our users at scale. In the midst of the COVID-19 pandemic, we saw that travel with Airbnb has become hyper-local.

Introduction

At Airbnb, we lean on data to inform our critical decisions. We validate product ideas through randomized controlled experiments, and we track our business performance rigorously to ensure that we maximize values for our stakeholders. To achieve these goals, we needed to build a robust data platform that serves the internal users’ end-to-end needs.

A Brief History of Analytics at Airbnb

Like many data-driven companies, Airbnb had a humble start at the beginning of its data journey. Circa 2010, there was only one full-time analyst at the company working on data, and his laptop was effectively the company’s data warehouse. Queries were often run directly against the production databases, and expensive queries occasionally caused serious incidents and took down Airbnb.com. In spite of the pitfalls, this simple solution helped Airbnb identify many growth opportunities over the years.

Airbnb and the data that fuels it has grown substantially over the years.

Growing Pains

While `core_data` brought several step-function changes to Airbnb’s data capabilities, our success did not come without some significant cost. In fact, the proliferation of data and use cases caused serious growing pains, both for data producers and for data consumers.

Proliferation of derived tables built on top of `core_data` caused some serious growing pains.

Overcoming Our Growing Pains with Minerva

As these pain points worsened, Airbnb embarked on a multi-year journey to revamp its data warehouse with the goal of drastically improving data quality at the company. As a first step, our data engineering team rebuilt several key business data models from scratch, which resulted in a set of certified, lean, normalized tables that do not use unnecessary joins. These vetted tables now served as the new foundation for our analytics warehouse.

Minerva, Airbnb’s metric platform, plays a central role in Airbnb’s new data warehouse architecture.
Adoption of Minerva at Airbnb has grown tremendously in the past two years.

Data Production in Minerva

From an infrastructure perspective, Minerva is built on top of open-source projects. It uses Airflow for workflow orchestration, Apache Hive and Apache Spark as the compute engine, and Presto and Apache Druid for consumption. From metric creation through computation, serving, consumption, and eventually deprecation, Minerva covers the full life cycle of a metric.

Minerva manages the entire lifecycle of metrics at Airbnb.
  • Metrics Definition: Minerva defines key business metrics, dimensions, and other metadata in a centralized Github repository that can be viewed and updated by anyone at the company.
  • Validated Workflow: The Minerva development flow enforces best data engineering practices such as code review, static validation, and test runs.
  • DAG Orchestration: Minerva performs data denormalization efficiently by maximizing data reuse and intermediate joined results.
  • Computation Runtime: Minerva has a sophisticated computation flow that can automatically self-heal after job failures and has built-in checks to ensure data quality.
  • Metrics / Metadata Serving: Minerva provides a unified data API to serve both aggregated and raw metrics on demand.
  • Flexible Backfills: Minerva version controls data definitions, so major changes to the datasets are automatically tracked and backfilled.
  • Data Management: Minerva has built-in capabilities such as cost attribution, GDPR selective deletion, data access control, and an auto-deprecation policy.
  • Data Retention: Minerva establishes usage-based retention and garbage collection, so expensive but infrequently utilized datasets are removed.

Data Consumption in Minerva

Minerva’s product vision is to allow users to “define metrics once, use them everywhere”. That is, a metric created in Minerva should be easily accessed in company dashboarding tools like Superset, tracked in our A/B testing framework ERF, or processed by our anomaly detection algorithms to spot business anomalies, just to name a few. Over the last few years, we have partnered closely with other teams to create an ecosystem of tools built on top of Minerva.

Minerva’s vision is “define once, use everywhere”.

Data Catalog

First, we partnered closely with the Analytics Product team to index all Minerva metrics and dimensions in the Dataportal, Airbnb’s data catalog. When a user interfaces with the Dataportal and searches for a metric, it ranks Minerva metrics at the top of the search results. The Dataportal also surfaces contextual information, such as certification status, ownership, and popularity so that users can gauge the relative importance of metrics. For most non-technical users, the Dataportal is their first entry point to metrics in Minerva.

Minerva metrics are indexed and catalogued in the Dataportal UI.

Data Exploration

Upon selecting a metric, users are redirected to Metric Explorer, a component of the Dataportal that enables out-of-the-box data exploration. On a metric page, users can see trends of a metric with additional slicing and drill down options such as `Group By` and `Filter`. Those who wish to dig deeper can click into the Superset view to perform more advanced analytics. Throughout this experience, Metric Explorer surfaces metadata such as metric owners, historical landing time, and metric description to enrich the data context. This design balances the needs of both technical and non-technical users so they can uncover data insights in-place seamlessly.

Users can investigate trends and anomalies in Metric Explorer and Superset seamlessly.

A/B Testing

Historically, Airbnb’s Experimentation Reporting Framework (ERF) had its own experiment metrics repository called “metrics repo”. Experimenters could add any business metric to an experiment and compare the results of the control and treatment group. Unfortunately, the metrics repo couldn’t be used for other use cases beyond experimentation, so we decided to integrate Minerva with ERF so all base events for A/B tests are defined and sourced from Minerva. Using the same source across experimentation and analytics means data scientists can be confident in their understanding of how certain experiments could affect the top line business metrics.

Executive Reporting

Long since Airbnb became a public company, we have adopted a practice of reviewing Airbnb’s business performance on weekly, monthly, and quarterly cadences. In these meetings, leaders across different functions meet and discuss the current state of the business. This type of meeting requires executive reports that are high-level and succinct. Data are often aggregated, trends are analyzed and plotted, and metrics movements are presented as running aggregations (e.g., year to date) and time ratio comparison (e.g., year over year).

Here is an example of the reporting configuration for COVID-19 dashboard, built on top of Minerva.

Data Analysis

Last but not least, Minerva data is exposed to Airbnb’s custom R and Python clients through Minerva’s API. This allows data scientists to query Minerva data in a notebook environment with ease. Importantly, the data that’s being surfaced in the notebook environment is computed and surfaced exactly the same way as they were in the aforementioned tools, such as Superset and Metric Explorer. This saves enormous amounts of time for data scientists as they can pick and choose the right tool for the job depending on the complexity of the analysis. Notably, this data API encourages lightweight prototyping of internal tooling, which can later be productionalized and shared across the company. For example, data scientists have built a time series analysis tool and an email reporting framework using this API over the last two years.

A data scientist can use our Python client to retrieve aggregated data in Minerva and conduct analyses.

How We Responded To the COVID-19 Crisis with Minerva Data

As Minerva became a centerpiece of analytics at Airbnb, we saw again and again the power and productivity gain it has brought to the data community at Airbnb. In this last section, we want to give a concrete example of how Minerva aided the business during the COVID-19 crisis.

We were able to dramatically shorten the time from data curation to insight discovery and assess the impact of COVID-19 on Airbnb’s business because of Minerva!

Closing

In this post, we briefly summarized the history of Airbnb’s analytics journey, the growing pains we faced in the last few years, and why we built Minerva, Airbnb’s metric infrastructure. In particular, we covered how data is produced and consumed via Minerva. Toward the end of the post, we also highlighted a recent example of how Minerva helped Airbnb to react to the COVID-19 crisis.

Acknowledgments

Minerva is made possible only because of the care and dedication from those who worked on it. We would also like to thank Lauren Chircus, Aaron Keys, Mike Lin, Adrian Kuhn, Krishna Bhupatiraju, Michelle Thomas, Erik Ritter, Serena Jiang, Krist Wongsuphasawat, Chris Williams, Ken Chen, Guang Yang, Jinyang Li, Clark Wright, Vaughn Quoss, Jerry Chu, Pala Muthiah, Kevin Yang, Ellen Huynh, and many more who partnered with us to make Minerva more accessible across the company. Finally, thank you Bill Ulammandakh for creating the beautiful visualization so we can use it as our header image!

--

--

Creative engineers and data scientists building a world where you can belong anywhere. http://airbnb.io

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store