Metis: Building Airbnb’s Next Generation Data Management Platform
How Airbnb evolved our data catalog into a platform for managing and governing our data warehouse at scale.
At Airbnb, millions of data assets exist in a complex ecosystem to inform our business and improve our products. The Data Management team’s mission is to empower the company to manage its data ecosystem at scale.
To do this, we need an accurate understanding of all of the assets in our ecosystem and how they relate to each other. In other words, it requires accurate metadata. Our data management platform Metis, named for the Greek goddess of good counsel, is our solution to ensure that trustworthy metadata can be captured, managed, and consumed at scale.
From humble beginnings
Metis is an evolution of our existing foundation of metadata products within Airbnb.
Dataportal was our first effort towards democratizing data: successfully enabling data users to find trusted data. It was a huge boon to productivity and pretty ahead of its time.
As data reliability and compliance regulations became important, we needed a more comprehensive and detailed understanding of how data was transformed. This led to our adoption of Apache Atlas as our data lineage solution. Apache Atlas powers products like SLA Tracker (see Visualizing Data timeliness at Airbnb), which combines landing time metadata and lineage to enable debugging upstream data delays.
As our requirements for metadata increased, expanding to more areas like cost management, data quality, etc, our needs for a data catalog have expanded:
- Ability to govern both the data and metadata describing it
- Guardrails and recommendations to improve data quality
- Auditability of a dataset’s history, both for debugging & governance purposes
We soon learned that data management had to be pursued as a discipline, thus building Metis as the one-stop-shop for accessing all data metadata.
What we’ve built
Metis is made up of three core products: Dataportal, Unified Metadata Service (UMS), and Lineage Service. Together, this platform allows Airbnb to manage millions of data assets across many domains. A short list of assets we support include:
- Apache Hive and Trino datasets
- Metrics and Dimensions, powered by Airbnb’s Metric Platform: Minerva
- Charts and Dashboards from Apache Superset and Tableau®
- Data Models, including those certified by Midas
- Machine Learning features and models
- Teams and employees of Airbnb (not technically a data asset, but key to support high quality ownership and ensure metadata remains up to date for all the above data assets)
On a high level, Metis consists of following components:
Dataportal — serves as a catalog and management UI for human users.
Viaduct — Airbnb’s in-house GraphQL API layer modeling offline data ecosystem.
UMS Core service — a backend service holding system schema and business logic needed for metadata management.
- MySQL — primarily storing critical metadata that needs to be centrally managed
- Lineage Graph — a centralized service collecting and serving data lineage
- Elasticsearch — serving search & discovery use cases
Offline Component — external to UMS Core service to perform offline tasks: e.g. offline metadata consistency check, policy enforcement.
Offline Dataset — offline export of metadata for analytics use cases.
Dataportal serves as the UI for Airbnb’s data catalog and is a place for people to find and manage all the assets supported by Metis. It’s built as a Single Page Application using React and TypeScript and is therefore flexible enough to serve the large variety of workflows required for data management and governance. The frontend communicates with UMS and other services via a GraphQL API; this is especially important as we want to prevent both sequential fetches of lineage information and over-fetching large amounts of metadata to ensure a performant user experience.
Search and Discovery
The Dataportal experience starts with search, so that both data consumers and data owners can find the assets they need. We’ve designed our search and discovery experience with a few principles in mind:
- Display relevant metadata directly in the search results to help people find the exact asset they’re looking for
- Uprank high quality and commonly used data assets, in the case that the user is unaware of the exact asset they need
As a result, search results tend to return high quality, certified datasets, along with the description, recent user count, and last time it was modified to help the user find which asset they want to select:
Once the desired asset is located, the user can visit the Entity Page to perform a large variety of consumption, management, and governance actions. We structure all the content on the entity page into tabs grouped by category of data or action:
Consumption and documentation related tabs make it easy for people to learn how to use this table, with column and table descriptions in the Configuration tab, owner and consumer data on the Points of contact tab, and further details on how to use the table on the Documentation tab. Beyond that, these pages also allow users to take on management activities, as seen in the below screenshots:
The above screenshot highlights only a subset of ways we leveled up the Dataportal from a searchable data catalog into the one centralized place to manage and govern all your data assets.
Unified Metadata Service
Unified Metadata Service, or UMS, is the backend core of our centralized data management platform. It provides:
- A centralized schema and Graphql API layer on top of it to access metadata
- A centralized relationship graph to connect siloed metadata
- Centralized metadata management capabilities to enable systems to meet compliance and governance requirements without reinventing the wheel
The centralization of metadata into UMS prevents all metadata providers and consumers from needing to integrate with each other; instead all providers and consumers only must integrate with UMS:
Metadata Integration Patterns
UMS plays various roles across metadata integrations and use cases. In a decentralized data ecosystem, we are very opinionated about what metadata should be stored, replicated to, or served through UMS.
Unified presentation layer proxying requests
UMS supports proxying read requests to many data systems. This includes proxying read requests to:
- Hive Metastore for table schema and table properties.
- Lineage service for raw Hive table data lineage.
- Data Governance service for data governance status for datasets.
Metadata management service
UMS centrally manages a few critical business metadata and stores in its own metadata database with management capabilities:
- Validation and authorization for updates
- Audit history
- Approval workflow for sensitive operations on critical metadata
Supporting online use cases for offline generated metadata
As part of Airbnb’s Data Quality Initiative, we implemented data quality scores that are directly tied to each data asset in the data warehouse. Data quality scores for datasets are generated in an offline manner and ingested into UMS metadata database for online consumption.
Centrally managed search indexes powering data discovery
Similar to traditional data catalog, UMS centrally manages indexes in an Elasticsearch cluster for different entities to power data discovery.
There are cases where metadata needs to be stored or replicated into Metis storage layer. UMS integrates with metadata providers in a variety of paved mechanisms to ingest metadata leveraging Airbnb’s tech stack. These include:
- Stream processing (Flink) jobs ingesting metadata change events.
- ETL(Airflow) jobs that run daily to pull from metadata providers and push to UMS.
- Direct calls to UMS API.
When we onboard a new metadata provider, the key work involved is identifying product requirements and aligning on the scope of metadata integration, followed by finalizing the actual integration mechanism.
The final major piece of Metis is our Lineage Service. We adopted Apache Atlas as Airbnb’s data lineage solution for Data Warehouse back in 2020.
At Airbnb, Apache Atlas holds a large lineage graph containing over 100 million nodes and 300 million edges. The primary volume of lineage data comes from production Hive tables and a large volume of intermediate Hive tables in our Data Warehouse.
We have extensively customized and tuned Apache Atlas to handle the large scale lineage events in our Data Warehouse:
- Apply sharding strategy on lineage events to increase parallelism.
- Improving Atlas server code efficiency on top of a graph database.
- Fine tuning underlying storage systems backing the graph database for scalability and latency.
- Read path optimization and filtering support for accessing lineage data more efficiently.
Atlas’s lineage-related components, including its Graph Engine (JanusGraph), Type System, Ingest (with Hook integrations), and lineage API, have allowed us to efficiently collect and serve lineage data, providing valuable insights into the relationships between various data assets and pipelines. It is powering many critical data compliance, data reliability and data quality products. See Visualizing Data Timeliness at Airbnb.
Conclusion & Appreciations
As shown above, Airbnb’s approach to data management has significantly evolved over the past 6 years. We started building Dataportal with a goal to “democratize data” at Airbnb, and we now have Metis: a platform that enables anyone at Airbnb to search, discover, consume, and manage all the data and metadata in our offline warehouse. Metis has been serving critical roles across data compliance, data reliability, data quality initiatives and is helping 1000+ data users every week.
Our future work will involve two key priorities: firstly, we will focus on evolving our system architecture and underlying technology in order to keep pace with the rapid evolution of our data ecosystem. Secondly, we plan to expand our coverage to more systems and enable more advanced data management capabilities, reflecting our ongoing commitment to investing in data here at Airbnb.
Metis would not have been possible without the members of the data management team as well as our cross functional and cross org collaborators. They include, but are not limited to: Adam Kocoloski, Adam Wong, Cindy Yu, Dave Nagle, Erik Ritter, Jerry Wang, Jiaxin Ye, John Bodley, Jyoti Wadhwani, Liyin Tang, Michelle Thomas, Nathan Towery, Paul Ellwood, Sylvia Tomiyama, Vyl Chiang, Woody Zhou, Xiaobin Zheng, and Zuzana Vejrazkova.
Apache Airflow, Apache Atlas, Apache Hive, Apache Superset, Atlas, and Hive are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
All trademarks, service marks, company names and product names are the property of their respective owners. Any use of these are for identification purposes only and do not imply sponsorship and endorsement.