Airbnb’s Approach to Access Management at Scale

How Airbnb securely manages permissions for our large team of employees, contractors, and call center staff.

By: Paul Bramsen

Introduction

Airbnb is a company that is built on trust. An important piece of this trust comes from protecting the data that our guests and hosts have shared with us. One of the ways we do this is by following the principle of least privilege. Least privilege dictates that–in an ideal world–an employee has the exact permissions they need at the moment their job requires them. Nothing more, nothing less. Anything more introduces unnecessary risk–whether from a malicious employee, compromised laptop, or even just an honest mistake. Anything less inhibits productivity.

Not only has enforcing least privilege always been crucial for maintaining trust, it’s rapidly becoming a legal necessity. Airbnb operates in almost every country and region in the world necessitating that we comply with an ever increasing set of data privacy regulations.

Administrators can effectively solve these problems with minimal tooling in small companies when an individual can track the work of all colleagues. But as a company grows, this approach does not scale. In this post, we will explain how Airbnb uses a novel software solution to maintain least privilege while enabling our large team of employees, contractors, and call center agents to do our jobs effectively and efficiently.

Where We Started

In Airbnb’s early days a combination of homegrown and vendor solutions were implemented, but the lack of a unifying architecture prevented us from scaling. The hodge-podge of systems used to control access made it difficult to hit either of our least privilege goals:

  • It was often unclear where employees could get needed permissions, hampering productivity.
  • Projects aimed at reducing unnecessary access (i.e., drive least privilege) required significant effort across many systems. Integrating access control with a new system took months of engineering effort when it should have been one or two days.

Ultimately, these factors led to growing operational burden, reduced security, and increased hours required for compliance efforts. This led us to the following conclusion: we need a single place to manage employee access.

Clarifying Focus

Having determined the need for centralized access control, we worked to set guiding principles for the solution we would implement. Ultimately, we boiled down the requirements for our system to two goals:

  1. The access control system should manage the entirety of the processes and logic around a permission’s lifecycle. This includes:
    – Self-serve ways to request or revoke permissions.
    – Settings to control who has to approve new permissions.
    – Tools for managing groups of permissions.
    – Settings for automatic permission expiration.
    – Logging to meet operational and compliance requirements.
    – Notifications about relevant permission updates like upcoming expirations or when an approval is required.
    All of these features should be controlled declaratively for each available permission and the system should use these declarations to implement all necessary logic and actions.
  2. We wanted to build a system that could easily and robustly integrate with any permission store (e.g., AWS IAM, LDAP, Apache Ranger, MySQL, Himeji, etc) without the need to modify it. To use a network analogy, the permission stores would be the data plane that enforces authorization while our access control system is the control plane that coordinates everything. This requirement led us to focus on providing the interface needed to efficiently synchronize permission changes from the central access control system into the permission stores. This would be accomplished using a little glue code for each store (allowing us to maintain the generality of the central system).

We also clarified what the system would not be.

  • Not a hyper-reliable, hyper-low-latency way to answer online permission checks. The permission stores themselves would answer online authorization queries and, because we would sync with them, they could automatically act as a cache if the central access control system went down. While availability and performance are always important, our primary focus would be on the permission management logic.
  • Not a place to dump one-off authorization code. Some of the prior permission management systems had evolved into authorization code dumping grounds incurring significant technical debt.
  • Not a place to store permissions for our guests and hosts. Public product permission management requirements are generally quite different from permissions we grant to employees to access our internal tooling and data to do their jobs. The scales also generally differ by many orders of magnitude. Additionally, internal permissions are usually significantly more complex. So it makes sense to handle each case separately.

If you could only take one thing away from this post, take away these goals. Clarifying our focus and using these two goals as our north star was the most critical step in building our centralized access control platform.

Build Vs Buy

We evaluated a number of products on the market but none of them solved for our specific goals. Generally, permissions were managed by a small group of knowledgeable administrators, operationalizing the approval process and failing our first goal. Additionally, integrations usually required modifying the client. While some of the permission stores already had plugins (e.g., LDAP plugin), others did not.

We hope that eventually a startup builds a solution that implements a centralized, self-serve, easy-to-plug-in model. We think this could provide a lot of value to other companies that don’t have the scale to justify building an in-house solution like ours.

Architecture

Each stage makes requests to the prior stage as updates flow through the system from left to right. Note that for the purposes of this post we are only considering stages 2 and 3. We can assume stages 1, 4, and 5 already exist.

We designed a system with a linear five-stage architecture. Changes flow from left to right. The architecture is linear in the sense that each stage can query the previous stage, but no others. For example the Access Control Platform can query Employee Data Systems and can be queried by Connectors, but never communicates directly with Permission Stores.

A stage can also have limited communication with the immediately following stage through loosely coupled channels like queues or callbacks. For example the Access Control Platform can enqueue an update message that will be consumed by a Connector to trigger a permission update.

  1. Employee Data Systems
    These are the HR systems (e.g., LDAP) that contain employee data (e.g., title, location, status, management chain). The Access Control Platform ingests this data to enable features like dynamic groups based on title and approval flows based on management chain. These systems are owned by the IT team.
  2. Access Control Platform
    This is the core system. This includes all the business logic to manage permissions as well as the UI that employees use to make and/or approve changes. The Access Control Platform is highly configurable but does not directly interact with any permission stores that integrate with it. The security team owns this system.
  3. Connectors
    Connectors are the glue code that connects the Access Control Platform to the Permission Stores. They serve two purposes. First, connectors tell the Access Control Platform what permissions should be available for request. For example the data warehouse connector might make read access to the users and reservations Hive tables available for request. Secondly, if user bob received access to read reservations the data warehouse connector would synchronize this permission into the appropriate permission store — Apache Ranger in this case. Since connectors are simply responding to messages on a queue by making the appropriate API calls, they can run in whatever environment their owner deems best (e.g., Kubernetes job, AWS Lambda, Airflow DAG). They are owned and operated by the team that owns the corresponding permission store. For example the storage team owns the MySQL connector.
  4. Permission Stores
    Permission stores are the systems that store the permissions and answer permission queries — for example, AWS IAM, LDAP, Apache Ranger, MySQL’s built in permission system, Himeji, or other internal systems. Note that in some cases Permission Stores may be built into clients in which case stages 4 and 5 would be combined, as is the case for MySQL.
  5. Clients
    The clients are all the systems that the end user needs — for example, SSH, Apache Superset, MySQL, internal customer support tools, Salesforce, etc.

Benefits Realized

Two years ago we implemented this architecture and since then we’ve integrated many systems into this centralized Access Control Platform. Here we highlight a few of the benefits we’ve realized.

Security

One of the biggest wins for security is having a single place where we can implement new least privilege features and then apply them across the board (as opposed to implementing once for AWS, once for MySQL, once for SSH, etc). A great case study is usage-based expiration. Usage-based expiration is a feature where permissions that have not been used for a significant period of time are automatically revoked. This approach is good for security because unnecessary permissions are quickly cleaned up. But it is also good for the user experience because employees can rest assured that the permissions being removed aren’t the ones they use regularly. Before the revocations happen, the Access Control Platform notifies impacted employees about the upcoming change and provides instructions on what to do if they need to keep the permissions. The notifications also provide links to low-friction ways to get the permissions back after the revokes happen should they realize later that the permissions were needed.

The chart below shows relative change in users with access to a core production system we’ll call System X. After rolling out usage-based expiration at the end of April, users with access to System X hit a steady state of about one third peak. A significant least privilege win! We saw similar results in other systems where we rolled out usage-based expiration.

Users with System X access dropped by two thirds after enabling usage-based expiration in late April.

Another security benefit has been the ability to roll out consistent compliance changes across all systems as new regulations are introduced. For example, we could enable a rule that requires North American employees to get special approval from our European legal counsel in order to access certain protected data for European customers. This rule can be consistently applied across many systems such as online databases, offline datastores, and customer support tooling.

Another win has been having a centralized database, against which we can create cross-system least privilege metrics and track our progress over time. The chart above was generated using this database.

Usability

Having a centralized access control platform has been a big win for usability. By consolidating, users no longer need to be aware of the N different places they need to go to request access. Effectively we’ve been able to create a one-stop-shop for all access at Airbnb. Just search for what you need access to and we’ll guide you through the rest.

The self-serve features we’ve built into the platform have helped reduce operational overhead. Employees can request permissions without having to involve a support engineer. When a manager goes out of town they can delegate a peer to approve changes on their behalf. Users have self-serve revoke for their own permissions, their reports’ permissions, or permissions for systems they manage.

Providing good self-serve access control tooling has significantly cut support costs.

Developer Experience

We’ve put significant effort into making it as easy as possible for developers to build the connectors that link the Access Control Platform with permission stores. A large portion of this effort has been building great tools.

As an example, a design decision we made that has proved extremely useful in providing a strong developer experience is notifying connectors about changes via an asynchronous message queue. Whenever a permission’s state changes, the Access Control Platform sends a message to the queue. The queue is processed by the connector that’s responsible for syncing the state of the updated permission into the permission store.

The permission’s state (granted or revoked) has to be fetched from the Access Control Platform. It is not included in the enqueued message.

The contents of the message are a critical part of the design. The message contains what the permission is and who it is for, but not whether the permission was granted or revoked. To get the current state (granted / revoked), the connector must query the platform. You can think of the message as a trigger to cause the system to resync permission X for user Y.

This design has the following properties:

  • Because the latest state is always fetched (granted / revoked), message processing is idempotent.
  • This allows us to use at-least-once delivery semantics, greatly simplifying the process of ensuring that the proper messages are sent every time a permission changes. If a permission changes but the process is killed (perhaps due to a deploy) before we’ve recorded that the platform triggered the necessary update message, we just trigger the message again in a clean-up process.
  • Replay attacks are nullified. So we let connector developers freely enqueue messages to aid in debugging. As a connector developer, this is quite useful when trying to determine why a permission sync is failing.
  • If updates do fail too many times, the message goes to a dead letter queue and the team responsible for the connector is alerted. Developers then use our tools to read the messages from the dead letter queue and debug the failing updates. Once issues are fixed, all failed messages can be re-enqueued which will bring all permissions back in sync.
  • We run regular offline jobs to do bulk permission diffs and identify any permission changes that need backfilling. Then we trigger resyncs by enqueuing update messages for these permissions. This means that connector developers only need to write code to support incremental sync rather than both backfill and incremental sync. The backfills are free!

Conclusion

Managing permissions and ensuring least privilege is a challenge at any company and especially difficult in large companies. Many companies come up with operationally heavy solutions that are expensive, insecure, and provide a negative user experience. At Airbnb, we’ve solved this challenge by implementing a centralized, self-serve access control platform. What made our investments such a success was solving Airbnb’s unique goals in a cohesive and scalable way, and what is very rare is the degree to which we’ve actually rolled this out in production. The majority of permissions at Airbnb are managed by our Access Control Platform. Our approach has enabled us to make huge strides in ensuring that we’re doing everything we can to keep our community’s data safe while at the same time enabling Airbnb’s employees to do our best work.

We’ve made a lot of progress in the access management space, but there is still a lot to do! If you’re interested in working on this or other efforts to protect Airbnb’s community, check out security and software engineering jobs on our careers page.

Acknowledgments

The Access Control Platform we’ve built was the result of hard work from many collaborators at Airbnb. Samuel Zhu, Alissara Rojanapairat, Kyler Mejia, Stephy Nancy, and Maryna Butovych built significant portions of the system and contributed to the architecture. Alan Yao and Abhishek Parmar provided invaluable feedback that influenced the architecture. Julia Cline ensured that we were building a product that would meet the needs of our customers. Brett Bukowski, Jacqui Watts, Julia Cline, Pat Moynahan, and Christopher B provided valuable feedback on this blog post. Tina Nguyen and Lauren Mackevich shepherded this blog post through the process. And many other colleagues contributed in small and large ways to make this possible.

****************

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store