Automating Data Protection at Scale, Part 1
Part one of a series on how we provide powerful, automated, and scalable data privacy and security engineering capabilities at Airbnb.
Elizabeth Nammour, Wendy Jin, Shengpu Liu
Our community of hosts and guests trust that we will keep their data safe and honor their privacy rights. With frequent news reports of data security breaches, coupled with global regulations and security requirements, monitoring and protecting data has become an even more critical problem to solve.
At Airbnb, data is collected, stored, and propagated across different data stores and infrastructures, making it hard to rely on engineers to manually keep track of how user and sensitive data flows through our environment. This, in turn, makes it challenging for them to protect it. While many vendors exist for different aspects of data security, no one tool met all of our requirements when it came to data discovery and automated data protection, nor did they support all of the data stores and environments in our ecosystem.
In this three-part blog series, we’ll be sharing our experience building and operating a data protection platform at Airbnb to address these challenges. In this first post, we will give an overview of why we decided to build the Data Protection Platform (DPP), walk through its architecture, and dive into the data inventory component, Madoka.
Data Protection Platform (DPP)
Since no one tool was meeting our needs, we decided to build a data protection platform to enable and empower Airbnb to protect data in compliance with global regulations and security requirements. However, in order to protect the data, we first needed to understand it and its associated security and privacy risks.
Understanding Airbnb’s Data
At Airbnb, we store petabytes of data across different file formats and data stores, such as MySQL, Hive, and S3. Data is generated, replicated, and propagated daily throughout our entire ecosystem. In order to monitor and gain an understanding of the ever-changing data, we built a centralized inventory system that keeps track of all the data assets that exist. This inventory system also collects and stores metadata around the security and privacy properties of each asset, so that the relevant stakeholders at Airbnb can understand the associated risks.
Since some data assets may contain sensitive business secrets or public information, understanding what type of data is stored within a data asset is crucial to determining the level of protection needed. In addition, privacy laws, such as the European Union General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), have granted users the right to access and delete their personal data. However, personal data is a less-than-precise term that represents many different data elements, including email addresses, messages sent on the platform, location info, etc. In order to comply with these laws, we need to pinpoint the exact location of all personal data. To do this, we built a scalable data classification system that continuously scans and classifies our data assets to determine what type of data is stored within them.
Enabling Automated Data Protection
Based on the understanding of the data, the DPP strives to automate its protection, or enables and notifies teams across the company to protect it. This automation focuses on a few key areas: data discovery, prevention of sensitive data leakages, and data encryption.
Discovering personal data is the first step to privacy compliance. This is especially true as personal data needs to be deleted or returned to a user upon request. Our platform enables us to automatically notify data owners when new personal data is detected in their data stores and integrate this data with our privacy orchestration service to ensure it gets deleted or returned if needed.
A common cause of data breaches is when sensitive secrets, such as API keys or credentials, are leaked internally and then make their way into the hands of an attacker. This can come from an engineer logging the secret within their service or committing the secret to code. Our data protection platform identifies potential leaks from various endpoints and notifies the engineer to mitigate the leakage by deleting the secret from the code or log, rotating the secret, and then hiding the new secret with our encryption tool sets.
One of the most popular and important methods of data protection is encryption, since even in case of an infiltration, attackers won’t be able to get their hands on sensitive data. However, breaches due to unencrypted sensitive data are unfortunately a common occurrence within the industry.
Why does it still happen? Secure encryption with proper key management is technically challenging, and organizations do not always know where sensitive data is stored. The DPP aims to abstract these challenges by providing a data encryption service and client library that engineers can use. It automatically discovers sensitive data, so that we don’t rely on manual identification.
Platform Architecture
The DPP aims to discover, understand, and protect our data. It integrates the services and tools we built to tackle different aspects of data protection. This end-to-end solution includes:
- Inspekt is our data classification service. It continuously scans Airbnb’s data stores to determine what sensitive and personal data types are stored within them.
- Angmar is our secret detection pipeline that discovers secrets in our codebase.
- Cipher is our data encryption service that provides an easy and transparent framework for developers across Airbnb to easily encrypt and decrypt sensitive information.
- Obliviate is our orchestration service, which handles all privacy compliance requests. For example, when a user requests to be deleted from Airbnb, obliviate will forward this request to all necessary Airbnb services to delete the user’s personal data from their data stores.
- Minister is our third party risk and privacy compliance service that handles and forwards all privacy data subject rights requests to our external vendors.
- Madoka is our metadata service that collects security and privacy properties of our data assets from different sources.
- Finally, we have our Data Protection Service, a presentation layer where we define jobs to enable automated data protection actions and notifications using information from Madoka (e.g., automate integrations with our privacy framework)
Madoka: A Metadata System
Madoka is a metadata system for data protection that maintains the security and privacy related metadata for all data assets on the Airbnb platform. It provides a centralized repository that allows Airbnb engineers and other internal stakeholders to easily track and manage the metadata of their data assets. This enables us to maintain a global understanding of Airbnb’s data security and privacy posture, and provides an essential role in automating security and privacy across the company.
Implemented by two different services, a crawler and a backend, Madoka has three major responsibilities: collecting metadata, storing metadata, and providing metadata to other services.The Madoka crawler is a daily crawling service that fetches metadata from other data sources, including Github, MySQL databases, S3 buckets, Inspekt (data classification service), etc. It then publishes the metadata onto an AWS Simple Queue Service (SQS) queue. The Madoka backend is a data service that ingests the metadata from the SQS queue, reconciles any conflicting information, and stores the metadata in its database. It provides APIs for other services to query the metadata findings.
The primary metadata collected by Madoka includes:
- Data assets list
- Ownership
- Data classification
For each of the above we handle both MySQL and S3 formats.
Data Assets List
The first type of metadata that needs to be collected is the list of all data assets that exist at Airbnb, along with their basic metadata such as the schema, the location of the asset, and the asset type.
For MySQL, the crawler collects the list of all columns that exist within our production AWS account. It calls the AWS APIs to get the list of all clusters in our environment, along with their reader endpoint. The crawler then connects to that cluster using JDBI and lists all the databases, tables, and columns, along with the column data type.
The crawler retains the following metadata information and passes it along to the Madoka backend for storage:
- Cluster Name
- Database Name
- Table Name
- Column Name
- Column Data Type
For S3, the crawler collects the list of all objects that exist within all of our AWS accounts.
At Airbnb, we use Terraform to configure AWS resources in code, including S3 buckets. The crawler parses the Terraform files to fetch the S3 metadata.
The crawler first fetches the list of all AWS account numbers and names, which are stored in a configuration file in our Terraform repository. It then fetches the list of all bucket names, since each bucket configuration is a file under the account’s subrepo.
In order to fetch the list of objects within a bucket, the crawler uses S3 inventory reports, a tool provided by AWS. This tool produces and stores a daily or weekly CSV file of all the objects contained in the bucket, along with their metadata. This is a much faster and less costly way of getting the list compared to calling the List AWS API. We’ve enabled inventory reports on all production S3 buckets in Terraform, and the bucket configuration will specify the location of the inventory report.
The crawler retains the following information and passes it along to Madoka backend for storage:
- Account Number
- Account Name
- Assume Role Name
- Bucket Name
- Inventory Bucket Account Number
- Inventory Assume Role Name
- Inventory Bucket Prefix
- Inventory Bucket Name
- Object key
Ownership
Ownership is a metadata property that describes who owns a specific data asset.
We decided to collect service ownership, which allows us to link a data asset to a specific codebase, and therefore automate any data protection action that requires code changes.
We also decided to collect team membership, which is crucial to perform any data protection action that requires an engineer to do some work or that requires a stamp of approval. We chose to collect team ownership and not user/employee ownership since team members constantly change, while the data asset remains with the team.
At Airbnb, since we migrated to a service-oriented architecture (SOA), most MySQL clusters belong to a single service and a single team. To determine service ownership, the crawler fetches the list of the services that connect to a MySQL cluster and will set the service with the most number of connections within the last 60 days as the owner of all the tables within that cluster. There are many services that connect to all clusters for monitoring, observability, and other common purposes, so we created a list of roles that should be filtered out when determining ownership.
There are still some legacy clusters in use that are shared amongst many services, where each service owns specific tables within the clusters. For those clusters, not all tables will have the correct service owner assigned, but we allow for a manual override to correct these mistakes.
The crawler uses service ownership to determine team ownership, since at Airbnb, team ownership is defined within the service’s codebase on Git.
At Airbnb, all S3 buckets have a project tag in their Terraform configuration file, which defines which service owns the bucket. The crawler fetches the service ownership from that file and uses it to determine the team ownership, as described above for MySQL.
Data Classification
Data classification is a metadata property that describes what type of data elements are stored within the asset — e.g., a MySQL column which stores email addresses or phone numbers would be classified as personal data. Gathering data classifications allows us to understand the riskiness of each data set so we can determine the level of protection needed.
The crawler fetches the data classification from two different sources. First, it fetches data classifications from our Git repositories, since data owners can manually set the classifications in their data schema. However, relying on manual classifications is insufficient. Data owners do not always know what an asset contains, or they may forget to change the classifications when the data asset is updated to store new data elements.
The crawler will then fetch data classifications from our automated data classification tool, called Inspekt, which we will describe in detail in a later blog post. Inspekt continuously scans and classifies all of our major data stores, such as MySQL and S3. It outputs what data elements were found in each data asset. This ensures that our data is constantly monitored, and classifications are updated as data changes. As with any automated detection tool, precision and recall are never 100%, so false positives and false negatives may occur.
Since the crawler fetches the data classifications from two different sources, some discrepancies may arise, where the manual classification contains data elements not found by Inspekt or vice versa. The crawler will forward all findings to the Madoka backend, which will resolve any conflicts. The status of the manual classification is marked as new by default and the status of the Inspekt classification is marked as suggested. If the manual classification aligns with the Inspekt result, the classification is automatically confirmed. If there is any discrepancy, we file tickets to the data owners through the data protection service. If the Inspekt classification is correct, the owners may update the data schema in the Git repository, or they can mark the Inspekt classification as incorrect to resolve the conflict.
Other Security and Privacy Related Attributes
Madoka also stores how data assets have integrated with our security and privacy tools. For example, we may store whether or not the data asset is encrypted using Cipher or is integrated with our privacy compliance service, Obliviate, for data subject rights requests. We built Madoka to be easily extensible and are constantly collecting and storing more security and privacy related attributes.
Conclusion
In this first post, we provided an overview of why we built the DPP, described the platform’s architecture, and dove into the data inventory component, Madoka. In our next post, we will focus on our data classification system that enables us to detect personal and sensitive data at scale. In our final post we will deep dive into how we’ve used the DPP to enable various security and privacy use cases.
Acknowledgements
The DPP was made possible thanks to many members of the data security team: Pinyao Guo, Julia Cline, Jamie Chong, Zi Liu, Jesse Rosenbloom, Serhi Pichkurov, and Gurer Kiratli. Thank you to the data governance team members for partnering and supporting our work: Andrew Luo, Shawn Chen, and Liyin Tang. Thank you Tina Nguyen for helping drive and make this blog post possible. Thank you to our leadership, Marc Blanchou, Brendon Lynch, Paul Nikhinson and Vijaya Kaza, for supporting our work. Thank you to previous members of the team who contributed greatly to the work: Lifeng Sang, Bin Zeng, Alex Leishman, and Julie Trias.
If this type of work interests you, see our career page for current openings.
Tags: data, security