Automating Data Protection at Scale, Part 2
Part two of a series on how we provide powerful, automated, and scalable data privacy and security engineering capabilities at Airbnb
In Part 1 of our blog series, we introduced the Data Protection Platform (DPP), which enables us to protect data in compliance with global regulations and security requirements. We stressed that understanding our data, by keeping track of where personal and sensitive data is stored in our ecosystem, is a necessary building block to protecting the data. In this blog post, we will discuss the challenges companies often face when trying to pinpoint the exact location of personal and sensitive data. As a stopgap, many companies rely on engineers to manually keep track of where and how personal and sensitive data flows within their environments. However, relying on manual data classifications presents some challenges:
- Data is constantly evolving. This makes it challenging for engineers to have a global understanding of the data and how the data flows throughout a company’s infrastructure. Data can be replicated and propagated into different data stores. Also, new types of data can be collected as products arise or change.
- Manual classification is more prone to error. Engineers may forget if a data asset contains personal data, or perhaps, as is the case in freeform user entries, may not know what an asset contains.
- Security and privacy personal data elements keep expanding. Engineers have to perform the manual data classification exercise again for any new data elements required by new privacy regulations and security compliance, which incurs a high cost and a low labor efficiency to companies.
- Secrets can leak into our codebase and various data stores. Secrets, such as production API keys, vendor secrets, and database credentials, are commonly used by engineers. Secret leakage in a codebase is a known issue in the tech industry, usually due to accidental or unintentional code committed by engineers, and they are not always caught by reviewers. Once checked in, secrets become needles in a haystack and cannot be easily discovered.
To address these challenges, we built data classification tools to detect personal and sensitive data in our data stores, logs, and source code. Continue reading as we walk through the architecture of our data classification tools. Specifically, we will dive deep into the technical components of Inspekt, our data classification system for our data stores and logs, and Angmar, our secrets detection and prevention system for our codebase on Github Enterprise.
Inspekt: A Data Classification Service
Inspekt is an automated and scalable data classification tool that determines where personal and sensitive data is stored in our ecosystem. Inspekt consists of two services: the first service, called the Task Creator, determines what needs to be scanned, and the second system, called the Scanner, samples and scans the data to detect personal and sensitive data.
The task creation system is responsible for determining what to scan and splitting it into tasks for the scanning system to ingest.
The Inspekt Task Creator periodically calls Madoka, our metadata service described in our previous blog post, to get a list of the data assets that exist at Airbnb. Fr MySQL and Hive data stores, the service fetches a list of all tables. For AWS S3, the service fetches a list of buckets for each AWS account and their corresponding list of object keys. Due to the sheer volume of data, the Task Creator randomly samples a small percentage of object keys from each bucket. For application logs, the service fetches the list of all services at Airbnb and their corresponding Elasticsearch clusters that store the logs. The Task Creator then creates an SQS message for each table/object/application, which is referred to as a task, and adds it to the scanning SQS queue, which will be consumed by the Scanner in a later stage.
The scanning system is responsible for sampling and scanning the data to detect personal information. Inspekt provides an interface to define scanning methods, algorithms to scan sampled data. For each data element, we define a “verifier” as a combination of one or multiple scanning methods.
Inspekt currently supports four types of scanning methods:
- Regular expressions (Regexes): Regexes are useful for data elements that follow a fixed format, such as longitude and latitude coordinates, birthdates, email addresses, etc. Inspekt allows us to define regexes to either match with the metadata (e.g. column name, object key name) of the data asset or with the content of the asset. Inspekt allows us to define regexes as both allowlists and denylists. For example, we can define a regex to detect data assets where the column name contains “birthdate”, or where the content contains the word “birthdate”, or where the content does not contain the word “birthdate”.
- Tries: Some data elements that we collect don’t follow a fixed pattern, such as first and last name, and cannot be detected using regexes. When the data element is stored in a known source data store, we can make use of the Aho-corasick algorithm, which uses tries to detect substring matches of a finite sample of that data.
- Machine Learning (ML) models: A number of data elements cannot be detected accurately or effectively using regex-based scanning for several reasons. First, some data elements, such as physical addresses, have varying formats or non-exhaustive amounts of content. Second, as a global company that operates in more than 200 countries, Airbnb hosts data in different languages. Third, some data, such as images, are not text-based and thus cannot be recognized using regular scanning methods. Machine learning-based algorithms are natural fits to handle these challenges. We developed different machine/deep learning models, such as multitask CNN, Bert-NER, and WiDeText Classification Model, for the detection of several complex data elements. The models are trained either using data samples from our production database, such as user addresses from Airbnb’s listings tables, or using public datasets or models pre-trained on large text corps. We host these models on the Airbnb machine learning platform called Bighead, which serves API endpoints for Inspekt to detect each data element using machine learning scanning methods.
- Hardcoded methods: Some data elements that we collect follow a fixed pattern, but are either too complicated to describe in a regex, or there already exists an open-source solution that detects the data elements with high quality. Inspekt allows us to define a code block to detect a data element. For instance, we created an International Bank Account Number (IBAN) data element verifier leveraging a validator from an open-source library.
In Inspekt, verifiers are defined as JSON blobs and stored in a database that the Scanner reads from. This allows us to easily modify existing verifiers or add new verifiers to detect new data elements on the fly without redeploying the service.
Here is an example of a verifier configuration that aims to detect any column name that contains the word “birthdate”, or where the content contains the word “birthdate”:
Inspekt Scanner is a distributed system using Kubernetes. Depending on the workload (i.e., tasks in queue), it can scale horizontally as needed. Each Scanner node picks up task messages from the SQS task queue. For scanning robustness, each message reappears N times back into the queue until a scanner node deletes it. A diagram of the Scanner architecture is shown below.
At startup, each Inspekt Scanner node fetches and initializes the verifiers from the database. The verifiers are periodically refreshed to pick up new verifiers or verifier configuration changes. After verifier initialization, each Scanner node fetches a task from the task queue created by the Task Creator. Each task contains a specification of the task to be executed, i.e., which data asset to scan, the sampling amount, etc. The node then submits each task to a thread pool that performs the sampling and the scanning job. The scanning job runs as follows:
- Inspekt scanner connects to the data store that is specified in the task and samples data from the data store. For MySQL, the scanner node will connect to the MySQL database and sample a subset of rows for each table. To scan a different set of rows each time without causing a full table scan, we randomly generate a value X smaller than the maximum value of the primary key and select a subset of rows where the primary key >= X. For Hive, we sample a subset of rows for each table from the latest partition. For service logs, we sample a subset of logs for each service per day. To get better coverage over different logs, we make queries to our Elasticsearch log clusters to select logs from distinct logging points. For S3, we generate a random offset smaller than the object size and sample a customizable set of bytes starting from that offset. We also support scanning across AWS accounts. If an object is in a different AWS account than that where the scanner is running, Inspekt automatically uses the proper Assume Role IAM permissions to access and read the objects from the foreign account.
- For each piece of sampled datum from data stores, Inspekt Scanner runs each verifier against the datum to determine whether any match was found.
- Inspekt Scanner stores the matching results in a database. For each match, we will store the metadata of the data asset where the match was found, the matched content, and the verifier it matched with. We also store a subset of this information, containing just the data asset and what data element was found, in a separate table. We periodically delete records from the matching results table to ensure the security and privacy of our data.
- Inspekt Scanner deletes the SQS message
Inspekt Quality Measurement Service
As described in our previous blog, our Data Protection Platform leverages the classification results to initiate protection measurements. For downstream stakeholders to trust and adopt classification results from Inspekt, we need to continuously ensure that each data element is being detected with high quality. Causing too many false positives is disruptive to teams that are alerted with the findings, and it would discredit the team. Causing too many false negatives means we aren’t successfully catching all occurrences of the data element, casting privacy and security concerns.
Quality Measurement Strategy
To continuously monitor and improve the quality of each data element verifier, we built an Inspekt Quality Measurement service to measure their precision, recall, and accuracy.
For each data element, we store the true positive data and true negative data as the ground truth in our Inspekt Quality Measurement database. We then run the verifier against the ground truth dataset. From the true positive data, we output the number of true positives (TP) and the number of false negatives (FN) the verifier generated. From the true negative data, we output the number of false positives (FP) and true negatives (TN) the verifier generated. We can then calculate precision, recall, and accuracy from the TP, FN, FP, TN counts.
Sampling and Populating Test Data
As discussed above, for each personal data element, we need to gather true positive and true negative data sets. For these metrics to be accurate, the data sets used for testing must be as comprehensive and similar to production data as possible. We populate this data set by periodically sampling data from the following sources:
- Known datasets in production: Some columns in our online databases or our data warehouse are known to contain and represent a specific data element, e.g., a MySQL column that is known to store email addresses. We can use these columns as true positives.
- Inspekt results: As Inspekt runs and generates results, the result can either represent a true positive or a false positive. We can therefore use these data to populate the data set.
- Known freeform/unstructured data: Some columns in our online databases or our data warehouse represent freeform user-entered data or unstructured blobs, such as messages, JSON objects, etc. These columns could contain any type of data elements and represent a good test data source to ensure our system detects data elements in unstructured formats and different edge cases.
- Generated fake, synthesized data: Some data elements, such as a user’s height and weight, are not present very often in our data stores, and there are no known source columns that store them. To have enough test data for these data elements, we generate fake data in the proper format and populate our test database with it.
After sampling the data, we need to ensure whether each sample represents a true positive or true negative before storing it in our test data sets. To achieve this, we manually label the sampled data using AWS Ground Truth. For each data element, we’ve developed instructions and trained Airbnb employees to correctly label each sample as true or false. Our system will then upload the raw data that is sampled for each data element onto AWS S3 and create a labeling job with the proper instructions onto Ground Truth. Once the employees finish labeling the data, the labeled output will be stored in an S3 bucket for our system to read. The Inspekt Quality Measurement service will periodically check the bucket to determine whether the labeled data is ready. If ready, it will fetch and store the data in our test data sets, and then delete the raw and labeled data from S3.
Re-Training ML Models
The labeled data from the Inspekt Quality Measurement service can be of value to improve the performance of Inspekt verifiers. Specifically, the labeled results can be a useful source to reinforce the performance of Inspekt machine learning models. We feed the labeled data into the training samples of some machine learning models. During the re-training, the newly labeled data are used along with the training samples. We obtain better models for corresponding data elements after each re-training.
Angmar: Secrets Detection and Prevention in Code
In the previous sections, we described how Inspekt focuses on detecting personal and sensitive data in data stores. However, some sensitive data, such as business and infrastructure secrets, may also exist in the company codebase, which could lead to serious vulnerabilities if leaked to unintended parties. We expanded our scope to a secret detection solution called Angmar, which leverages detection and prevention approaches to protect Airbnb secret data in Github Enterprise (GHE).
Angmar is built as two parts, a CI check that intends to detect secrets pushed to GHE and a Github pre-commit hook that intends to prevent secrets from entering GHE in the first place.
We built a CI check to scan every commit pushed to the Airbnb GHE server. When a commit is pushed to GHE, a CI job, which is required to pass before a merge will be allowed into the main branch, is kicked off. The CI job downloads an up-to-date customized version library of an open-source library Yelp/detect-secrets and runs a secret scanning job over each modified or added file in the commit. Once secrets are detected in the CI job, it triggers the Data Protection Platform to create a JIRA ticket and automatically assigns the ticket to the code author for resolution. The details of the ticket generation will be discussed in part 3 of this blog series. We require all production secrets to be removed, rotated, and checked in again using our production secret management tool named Bagpiper within SLA.
However, with the CI check shepherding every push to the codebase, the secret exposure time window still casts a certain amount of security risks on the company infrastructure. Also, in some cases, secret rotations could be very costly in terms of engineering hours and expenses. Therefore, we propose a proactive approach to prevent secrets from entering GHE in the first place to further eliminate secret exposure and to save the effort of rotating secrets. We built an Angmar pre-commit tool using the same custom detection library for developers to block committing secrets into Git commits. Once secrets are detected in a commit, the Git commit command would prompt an error and block the new commit.
We made a few customizations to the detect-secrets open-source library for Airbnb’s use case:
- We added a few secret data elements specific to Airbnb into the library plugins.
- According to our analysis on false positive detections from the library, some test secrets, staging secrets, or placeholders are falsely detected as secrets. We added a path-filtering logic to skip certain files within a commit.
- We also implemented some deduplication logic that uses hashing to reduce repetitive tickets due to modifications on the same file in different commits.
- In rare cases when false positives occur, we allow developers to skip certain lines of code or certain files to avoid blocking emergency code merges into production. The security team reviews the skipped code regularly to make sure no actual secrets are bypassed.
We are continuously improving and expanding Inspekt and Angmar to scan more data sources and detect more privacy and sensitive data elements. A few initiatives we are currently exploring or working on include:
- Scanning Thrift interface description language API requests and responses to keep track of how personal and sensitive data flows between services.
- Scanning our third-party applications, such as Google Drive, Box, to understand data lineage around how data flows into third-party applications, and how data is accessed both internally and externally.
- Expanding our scanning capabilities to more data stores that are used at Airbnb, such as DynamoDB, Redis, etc.
There are several commercial solutions on the market that tackle data classification. Before building out our solution, we evaluated a few commercial vendors to see if we could leverage existing tools rather than building our own solution. We decided to build an in-house solution for the following reasons:
- Data store coverage: We needed a tool that could cover most of the data stores that exist in our ecosystem, since building a custom tool for a subset of data stores would require a very similar amount of effort as building it for all of our data stores. Most vendors only support scanning SAAS applications and S3 buckets.
- Customized scanning: We needed to be able to customize what scanning algorithms to scan against. This is important since we want to make sure we can scan for all personal and sensitive data elements and also ensure we are getting the best performance (precision, recall, accuracy) for all data elements. Many vendors support scanning against a custom regex, but none support scanning against a custom ML model.
- Cost efficiency: We found that for our purposes, building our solution would be much more cost-efficient than using a commercial solution.
In this second post, we dove deep into the motivations and architecture of our data classification system that enables us to detect personal and sensitive data at scale. In our next post, we will deep dive into how we’ve used the data protection platform to enable various security and privacy use cases.
Inspekt and Angmar were made possible by all team members of the data security team: Shengpu Liu, Jamie Chong, Zi Liu, Jesse Rosenbloom, Serhi Pichkurov, and PM team Julia Cline, and Gurer Kiratli. Thank Bo Zeng from the AI Labs team for helping develop Inspekt machine learning models. Thanks to our leadership, Marc Blanchou, Joy Zhang, Brendon Lynch, Paul Nikhinson, and Vijaya Kaza, for supporting our work. Thank Aaron Loo and Ryan Flood from the Security Engineering team for their support and advice. Thank you to the data governance team members for partnering and supporting our work: Andrew Luo, Shawn Chen, and Liyin Tang. Thank you Tina Nguyen and Cristy Schaan for helping drive and make this blog post possible. Thank you to previous members of the team who contributed greatly to the work: Lifeng Sang, Bin Zeng, Prasad Kethana, Alex Leishman, and Julie Trias.
If this type of work interests you, see https://careers.airbnb.com for current openings.
All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.