Automating Data Protection at Scale, Part 2

elizabeth nammour
Oct 19 · 14 min read

Part two of a series on how we provide powerful, automated, and scalable data privacy and security engineering capabilities at Airbnb

Elizabeth Nammour, Pinyao Guo, Wendy Jin

Introduction

In Part 1 of our blog series, we introduced the Data Protection Platform (DPP), which enables us to protect data in compliance with global regulations and security requirements. We stressed that understanding our data, by keeping track of where personal and sensitive data is stored in our ecosystem, is a necessary building block to protecting the data. In this blog post, we will discuss the challenges companies often face when trying to pinpoint the exact location of personal and sensitive data. As a stopgap, many companies rely on engineers to manually keep track of where and how personal and sensitive data flows within their environments. However, relying on manual data classifications presents some challenges:

  1. Data is constantly evolving. This makes it challenging for engineers to have a global understanding of the data and how the data flows throughout a company’s infrastructure. Data can be replicated and propagated into different data stores. Also, new types of data can be collected as products arise or change.

To address these challenges, we built data classification tools to detect personal and sensitive data in our data stores, logs, and source code. Continue reading as we walk through the architecture of our data classification tools. Specifically, we will dive deep into the technical components of Inspekt, our data classification system for our data stores and logs, and Angmar, our secrets detection and prevention system for our codebase on Github Enterprise.

Inspekt: A Data Classification Service

Inspekt is an automated and scalable data classification tool that determines where personal and sensitive data is stored in our ecosystem. Inspekt consists of two services: the first service, called the Task Creator, determines what needs to be scanned, and the second system, called the Scanner, samples and scans the data to detect personal and sensitive data.

Task Creator

The task creation system is responsible for determining what to scan and splitting it into tasks for the scanning system to ingest.

The Inspekt Task Creator periodically calls Madoka, our metadata service described in our previous blog post, to get a list of the data assets that exist at Airbnb. Fr MySQL and Hive data stores, the service fetches a list of all tables. For AWS S3, the service fetches a list of buckets for each AWS account and their corresponding list of object keys. Due to the sheer volume of data, the Task Creator randomly samples a small percentage of object keys from each bucket. For application logs, the service fetches the list of all services at Airbnb and their corresponding Elasticsearch clusters that store the logs. The Task Creator then creates an SQS message for each table/object/application, which is referred to as a task, and adds it to the scanning SQS queue, which will be consumed by the Scanner in a later stage.

Scanner

The scanning system is responsible for sampling and scanning the data to detect personal information. Inspekt provides an interface to define scanning methods, algorithms to scan sampled data. For each data element, we define a “verifier” as a combination of one or multiple scanning methods.

Inspekt currently supports four types of scanning methods:

  • Regular expressions (Regexes): Regexes are useful for data elements that follow a fixed format, such as longitude and latitude coordinates, birthdates, email addresses, etc. Inspekt allows us to define regexes to either match with the metadata (e.g. column name, object key name) of the data asset or with the content of the asset. Inspekt allows us to define regexes as both allowlists and denylists. For example, we can define a regex to detect data assets where the column name contains “birthdate”, or where the content contains the word “birthdate”, or where the content does not contain the word “birthdate”.

In Inspekt, verifiers are defined as JSON blobs and stored in a database that the Scanner reads from. This allows us to easily modify existing verifiers or add new verifiers to detect new data elements on the fly without redeploying the service.

Here is an example of a verifier configuration that aims to detect any column name that contains the word “birthdate”, or where the content contains the word “birthdate”:

Inspekt Scanner is a distributed system using Kubernetes. Depending on the workload (i.e., tasks in queue), it can scale horizontally as needed. Each Scanner node picks up task messages from the SQS task queue. For scanning robustness, each message reappears N times back into the queue until a scanner node deletes it. A diagram of the Scanner architecture is shown below.

Figure 1: Inspekt Scanner Architecture

At startup, each Inspekt Scanner node fetches and initializes the verifiers from the database. The verifiers are periodically refreshed to pick up new verifiers or verifier configuration changes. After verifier initialization, each Scanner node fetches a task from the task queue created by the Task Creator. Each task contains a specification of the task to be executed, i.e., which data asset to scan, the sampling amount, etc. The node then submits each task to a thread pool that performs the sampling and the scanning job. The scanning job runs as follows:

  1. Inspekt scanner connects to the data store that is specified in the task and samples data from the data store. For MySQL, the scanner node will connect to the MySQL database and sample a subset of rows for each table. To scan a different set of rows each time without causing a full table scan, we randomly generate a value X smaller than the maximum value of the primary key and select a subset of rows where the primary key >= X. For Hive, we sample a subset of rows for each table from the latest partition. For service logs, we sample a subset of logs for each service per day. To get better coverage over different logs, we make queries to our Elasticsearch log clusters to select logs from distinct logging points. For S3, we generate a random offset smaller than the object size and sample a customizable set of bytes starting from that offset. We also support scanning across AWS accounts. If an object is in a different AWS account than that where the scanner is running, Inspekt automatically uses the proper Assume Role IAM permissions to access and read the objects from the foreign account.

Inspekt Quality Measurement Service

As described in our previous blog, our Data Protection Platform leverages the classification results to initiate protection measurements. For downstream stakeholders to trust and adopt classification results from Inspekt, we need to continuously ensure that each data element is being detected with high quality. Causing too many false positives is disruptive to teams that are alerted with the findings, and it would discredit the team. Causing too many false negatives means we aren’t successfully catching all occurrences of the data element, casting privacy and security concerns.

Quality Measurement Strategy

To continuously monitor and improve the quality of each data element verifier, we built an Inspekt Quality Measurement service to measure their precision, recall, and accuracy.

For each data element, we store the true positive data and true negative data as the ground truth in our Inspekt Quality Measurement database. We then run the verifier against the ground truth dataset. From the true positive data, we output the number of true positives (TP) and the number of false negatives (FN) the verifier generated. From the true negative data, we output the number of false positives (FP) and true negatives (TN) the verifier generated. We can then calculate precision, recall, and accuracy from the TP, FN, FP, TN counts.

Figure 2: Inspekt Quality Calculation

Sampling and Populating Test Data

As discussed above, for each personal data element, we need to gather true positive and true negative data sets. For these metrics to be accurate, the data sets used for testing must be as comprehensive and similar to production data as possible. We populate this data set by periodically sampling data from the following sources:

  • Known datasets in production: Some columns in our online databases or our data warehouse are known to contain and represent a specific data element, e.g., a MySQL column that is known to store email addresses. We can use these columns as true positives.

Labeling

After sampling the data, we need to ensure whether each sample represents a true positive or true negative before storing it in our test data sets. To achieve this, we manually label the sampled data using AWS Ground Truth. For each data element, we’ve developed instructions and trained Airbnb employees to correctly label each sample as true or false. Our system will then upload the raw data that is sampled for each data element onto AWS S3 and create a labeling job with the proper instructions onto Ground Truth. Once the employees finish labeling the data, the labeled output will be stored in an S3 bucket for our system to read. The Inspekt Quality Measurement service will periodically check the bucket to determine whether the labeled data is ready. If ready, it will fetch and store the data in our test data sets, and then delete the raw and labeled data from S3.

Figure 3: Inspekt Test Labeling Pipeline

Re-Training ML Models

The labeled data from the Inspekt Quality Measurement service can be of value to improve the performance of Inspekt verifiers. Specifically, the labeled results can be a useful source to reinforce the performance of Inspekt machine learning models. We feed the labeled data into the training samples of some machine learning models. During the re-training, the newly labeled data are used along with the training samples. We obtain better models for corresponding data elements after each re-training.

Angmar: Secrets Detection and Prevention in Code

In the previous sections, we described how Inspekt focuses on detecting personal and sensitive data in data stores. However, some sensitive data, such as business and infrastructure secrets, may also exist in the company codebase, which could lead to serious vulnerabilities if leaked to unintended parties. We expanded our scope to a secret detection solution called Angmar, which leverages detection and prevention approaches to protect Airbnb secret data in Github Enterprise (GHE).

Architecture

Angmar is built as two parts, a CI check that intends to detect secrets pushed to GHE and a Github pre-commit hook that intends to prevent secrets from entering GHE in the first place.

We built a CI check to scan every commit pushed to the Airbnb GHE server. When a commit is pushed to GHE, a CI job, which is required to pass before a merge will be allowed into the main branch, is kicked off. The CI job downloads an up-to-date customized version library of an open-source library Yelp/detect-secrets and runs a secret scanning job over each modified or added file in the commit. Once secrets are detected in the CI job, it triggers the Data Protection Platform to create a JIRA ticket and automatically assigns the ticket to the code author for resolution. The details of the ticket generation will be discussed in part 3 of this blog series. We require all production secrets to be removed, rotated, and checked in again using our production secret management tool named Bagpiper within SLA.

Figure 4: Angmar Architecture

However, with the CI check shepherding every push to the codebase, the secret exposure time window still casts a certain amount of security risks on the company infrastructure. Also, in some cases, secret rotations could be very costly in terms of engineering hours and expenses. Therefore, we propose a proactive approach to prevent secrets from entering GHE in the first place to further eliminate secret exposure and to save the effort of rotating secrets. We built an Angmar pre-commit tool using the same custom detection library for developers to block committing secrets into Git commits. Once secrets are detected in a commit, the Git commit command would prompt an error and block the new commit.

Customization

We made a few customizations to the detect-secrets open-source library for Airbnb’s use case:

  • We added a few secret data elements specific to Airbnb into the library plugins.

Future Work

We are continuously improving and expanding Inspekt and Angmar to scan more data sources and detect more privacy and sensitive data elements. A few initiatives we are currently exploring or working on include:

  1. Scanning Thrift interface description language API requests and responses to keep track of how personal and sensitive data flows between services.

Alternatives

There are several commercial solutions on the market that tackle data classification. Before building out our solution, we evaluated a few commercial vendors to see if we could leverage existing tools rather than building our own solution. We decided to build an in-house solution for the following reasons:

  • Data store coverage: We needed a tool that could cover most of the data stores that exist in our ecosystem, since building a custom tool for a subset of data stores would require a very similar amount of effort as building it for all of our data stores. Most vendors only support scanning SAAS applications and S3 buckets.

Conclusion

In this second post, we dove deep into the motivations and architecture of our data classification system that enables us to detect personal and sensitive data at scale. In our next post, we will deep dive into how we’ve used the data protection platform to enable various security and privacy use cases.

Acknowledgments

Inspekt and Angmar were made possible by all team members of the data security team: Shengpu Liu, Jamie Chong, Zi Liu, Jesse Rosenbloom, Serhi Pichkurov, and PM team Julia Cline, and Gurer Kiratli. Thank Bo Zeng from the AI Labs team for helping develop Inspekt machine learning models. Thanks to our leadership, Marc Blanchou, Joy Zhang, Brendon Lynch, Paul Nikhinson, and Vijaya Kaza, for supporting our work. Thank Aaron Loo and Ryan Flood from the Security Engineering team for their support and advice. Thank you to the data governance team members for partnering and supporting our work: Andrew Luo, Shawn Chen, and Liyin Tang. Thank you Tina Nguyen and Cristy Schaan for helping drive and make this blog post possible. Thank you to previous members of the team who contributed greatly to the work: Lifeng Sang, Bin Zeng, Prasad Kethana, Alex Leishman, and Julie Trias.

If this type of work interests you, see https://careers.airbnb.com for current openings.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.

The Airbnb Tech Blog

Creative engineers and data scientists building a world…