Box Cloud Management Framework: Multi-Cloud IAM Challenges (Part 2 of 3)

Garth Booth
Box Tech Blog
Published in
12 min readAug 13, 2021
Illustrated by Navied Mahdavian

Welcome back to Part two of our mini blog series on Multi-Cloud IAM. In Part One, we introduced and defined our Multi-Cloud IAM Meta-Model. In this blog, we’ll take a deeper dive into the challenges we faced and how we leveraged our Meta-Model to resolve them. Let’s first recap those challenges we introduced in Part One.

Multi-Cloud IAM challenges

Before Box established a well-defined Multi-Cloud IAM model, we faced a number of the classic challenges any organization faces when trying to manage multiple Public Clouds to run their applications. This included typical account sprawl, unexpected costs due to overconsumption of cloud resources, security concerns due to lack of governance, and significant gaps and inefficiencies in meeting security and compliance requirements. In addition to those challenges, a number of other areas were inhibiting our ability to manage our IAM across multiple clouds in a consistent manner:

  • Inconsistent ownership of Organization Root administrator credentials
  • Inconsistent support for federated Identities across our clouds
  • Visibility into who as access to what resources
  • Lack of automated Identity and Access Management Life Cycle Management
  • Lack of consistent role definitions and role assignments
  • Limited input from Box Persona’s on Role Definitions
  • Lack of a common resource tagging mechanisms
  • Inconsistent naming conventions
  • Lack of consistent process to secure and manage programmatic application identities
  • Lack of consistent and efficient process to provide critical evidence for compliance audits

Resolving the Challenges

In order to address the security, compliance, and other stakeholder requirements, we enumerated five key areas that we used to drive detailed discovery and specific recommendations on how to remediate.

Organization Root Credentials and Resource Group Reclamation

Regardless of whether you already have one or more clouds already under management or are just starting to manage one or more clouds, it is critical that you define the operational model for managing Organizational Root credentials and also how to reclaim Resource Groups that may have been provisioned outside of your primary Organization Root. For Organizational Root Credentials, this includes defining the ownership model, limiting access to break-glass scenarios, defining a clear process for break-glass access, and finally ensuring there is an audit trail for any authorized break-glass access, by the designated owners. For Resource Group reclamation, this includes migrating those groups back into the main Organization so that you can provide consistent governance.

Let’s address the Organization Root credentials ownership model first. If you are already managing multiple clouds, you must reclaim those credentials and ensure that only one team is responsible for them throughout the organization. Also, once reclaimed, these credentials must be periodically rotated to protect against accidental or malicious leakage. There should not be multiple individuals across multiple teams that owns those credentials. By having a single designated team that is solely responsible for ownership of these credentials, it makes accountability easier to manage and also ensures a standard governance model can be defined to protect those credentials.

Along with reclaiming and defining a clear ownership model for these credentials, there must be a well-defined process that limits access to break-glass scenarios. Normal operation of your cloud(s) do not require access to these credentials and therefore they should not be used for day to day operations. Instead, you should define subordinate credentials that have much lower scoped permissions to manage your cloud(s) daily operations. Further, it is recommended that you have a process that notifies security prior to and directly after, any break-glass, access to these credentials. This must include automated security alerts that notify your security incident response team (SIRT) of any access to these credentials.

Auditors pay particular attention to the management of Organization Root Administrator credentials, so it is critical that along with the ownership and access model, that you have a clear audit trail when those credentials are accessed. As noted earlier, these are the “keys to the kingdom” and must be treated with extreme care. You must have access logs that document who and when those credentials were accessed.

Most organizations have suffered from the typical Resource Group (aka account) sprawl and inconsistent governance of those Resource Groups prior to developing a standard IAM architecture and governance model. It is critical that you reclaim and bring them under centralized management and ownership. This means that all Resource Groups must be migrated to the appropriate Organization Group and all Root access (i.e. admin access) must be transferred to the primary team responsible for managing IAM in your organization. This will enable the ability to apply Organization Policies and also ensure the principle of least privilege is adhered to across all Resource Groups in the Organization. Later in this blog, we will provide some examples that will illustrate the importance of this process.

Federated Identities

In a hybrid, multi-cloud environment, it is important to define a consistent trust model for enabling and disabling service owner access to resources (both private and public). Whether your business is using active directory, LDAP or other identity providers (IdP), developing a process for consistent management of service owner access to cloud resources is a fundamental business imperative to meet security and compliance requirements. The concept of federated identities is intended to establish that consistent access control system between the on premise resources and the cloud resources by managing access via a centralized IdP (controlled by your business).

Anyone who has participated in an audit of your access control systems, knows that being able to document and demonstrate a consistent process, throughout your organization, substantially reduces the amount of toil to complete the audit process. Fortunately, each of the leading cloud providers provide services to help you implement a federated identity model; some easier than others! Let’s explore how we’ve approached federation across our key cloud providers.

In general, there are three key components in the federation model, the individual users, an Identity source to manage those users, and the service provider. In our case, we have individual Boxers (that’s what we call people who work at Box), PingFederate with LDAP, and our support Cloud Providers.

We use LDAP as the authoritative source of truth for all our Box Identities (i.e. Boxers). It serves as our centralized system to manage access control to both on premise and public cloud resources. We leverage LDAP Groups as the primary construct for managing access. This basically means that as new Boxers need to be granted access to or removed from access to resources, we simply add or remove them from the appropriate LDAP Group. We defined a naming convention for cloud specific LDAP Groups so that we could visually and programmatically distinguish between specific cloud access groups. This was done by simply including a cloud specific prefix (i.e. azr, gcp, aws) in the LDAP group name.

In order to Federate our Boxer Identities, that are managed with PingFederate and LDAP, with our supported Cloud Providers, we leverage specific solutions provided by each provider. This model made it really simple to document and demonstrate to auditors how we maintain our access control systems. We were able to develop a service that continuously validates proper access by automating cross checks against your LDAP, Corporate Human Resource Directory, and the Cloud Provider. This process significantly improved our security posture, while also being able to meet compliance requirements with easy to produce access reports. We’ll explore that and other similar services in a future blog post.

Here’s a brief description of how we approach federation with each cloud provider:

  • AWS provides support using either AWS SSO or AWS Identity and Access Management. Our approach leveraged the AWS SSO model. We leverage an internal IdP to federate with AWS via the AWS SSO solution. The basic process is to create a new or use an existing AWS account, update AWS SSO details with the appropriate role and policy to allow for authenticated users, and finally update our internal IdP to map the LDAP group to a defined SAML role. The user can then login, assume the assigned role, and manage their account appropriately.
  • GCP provides Cloud Identity or Google Workspace Account to manage users and groups. Our federation implementation leverages the Cloud Identity solution. In order to avoid manually managing users and groups with Cloud Identity, we also leverage a GCP service called Google Cloud Directory Sync (GCDS). The GCDS tool enables you to continuously synch, appropriate, on premise LDAP groups into GCP Cloud Identity. Once users and groups are sync’d into GCP, you can then assign roles to groups or users to perform authorized actions in the cloud.
  • Azure provides three services for establishing hybrid identities: Password hash synchronization (PHS), Pass-through authentication (PTA), and Federation (AD FS). There is some fairly good Azure hybrid identity documentation on how to make a decision on which authentication method is best for your environment. Each of these solutions are described with the assumption that you are using an on premise Active Directory solution for your hybrid identities. Our intended approach is to use Federation AD FS. However, we are using LDAP and are continuing to evaluate the best way to implement this solution for federated identities on Azure.

Infrastructure as Code (IaC)

In order to deliver a secure, efficient, and repeatable set of processes for managing the lifecycle of IAM resources across multiple cloud providers, you must establish a well-defined model for automating your day to day operations. At Box, we defined a set of IaC standards and frameworks that drives not only our IAM operations, but all cloud resource lifecycle management ranging from networks, to kubernetes cluster platforms, to data analytics platforms, and much more.

For IAM operations, we manage all initial service owner onboarding and day to day operations via a centralized IaC pipeline that controls CRUD operations on organization groups, resource groups, roles, role bindings (i.e. role to LDAP group management), organization policies, application identities, and other IAM specific operations. In addition, we are building in standard security and compliance checks to the IAM and other IaC pipelines to ensure the proper guardrails are in place.

This is a critical step towards delivering a true self-service model where service owners can more easily drive their own onboarding and day to day operations. In our next blog post in this series, Box Cloud Management Framework: Infrastructure as Code, we will take a much deeper look at the technology stack and overall architectural approach to how we built our IaC services.

Audited Roles and Permissions (Least Privilege Model)

Achieving a Least Privilege model in the cloud is one of the most difficult challenges you’ll face. This is a result of trying to balance between allowing service owners enough access to cloud resources to achieve high velocity while maintaining the proper restrictions and guardrails to prevent accidental or malicious exposure to security threats. There are two key mechanisms we use here: Organization Policies and Audit of Roles and Permissions. Organization Policies enable the ability to put broad guardrails in place, either across the entire organization or within a specific Organization Group. We will cover both of these in more detail in the third and final blog post.

One of the first and most critical things we did was to do a role audit of the key persona’s required for operating in the cloud. The steps below describe our general approach:

  • Enumerate the key roles (in priority order)
  • Enumerate the use cases for each persona
  • Enumerate the cloud agnostic actions each needed to perform
  • Map the cloud agnostic actions to provider specific pre-defined and/or custom roles

This process was done in collaboration with the service owner, security, compliance, and cloud architects. What we actually found is that this process works well for some of the more straightforward cloud roles (e.g. security, finance, compliance), but not for many of the more complex roles. Network, Platform Owners (e.g. Kubernetes), and general service owner roles actually required a more iterative approach by actually performing the required use cases in a development or sandbox environment in order to achieve Least Privilege role definitions. Therefore, in development and sandbox type environments, this requires more open permissions, otherwise there will be significant thrashing between security, compliance, legal and the role owner to obtain approval to update role permissions as they try to learn what they don’t know (in terms of required permissions to support their service).

One of the best practices is to establish the definition of an environment that provides more open access to developers so that they can move fast in building their application(s) without excessive permission restrictions on their roles. This will require sign-off from your security, compliance, and legal organization, but it will significantly improve the velocity of role owners by removing the friction to update role permissions in real-time as the service is being developed.

Another best practice is to avoid the use of owner and editor in your stage and production environments. Although these basic roles make it very easy to grant privileges to users in a succinct manner, it also violates the principle of least privilege and can lead to security, compliance, and even cost related issues. As noted earlier, it may be ok to grant these in sandbox environments to accelerate developer velocity, but care should be taken even in a sandbox type of environment (due to the same concerns). In our development environment, we have defined a custom roles that provide more freedom for service owners to work with approved services, but does not give the broad access defined with owner and editor roles.

Tagging and Naming Conventions

Establishing consistent Tagging and Naming Conventions are both identified as best practices across all cloud providers. As your organization starts to manage a single or multiple clouds, the ability to apply meta-data to various cloud resources enables the ability to organize resources so that you can more easily manage, search, and filter resources by various categories (e.g. service owner, environment, region, cloud provider). Naming Conventions for tagging, resource groups, and resources enables the ability to support automation, via an IaC pipeline, so that you can programmatically perform manage cloud resources without the need for manual intervention.

AWS, GCP, and Azure all provide the ability to tag resources. Both AWS and Azure use the term tag, but GCP uses the term labels to enable the ability to set meta-data using key value pairs on various types of Cloud Resources. GCP does have a concept called tagging, but it applies only to network tags to support the ability to apply network policies to a resources, typically compute instances. Each of the Cloud Providers has some very good documentation on Tagging/Labeling best practices:

Our approach to tagging is used to support automatic cost allocation and show back, assignment of network access policies, service owner mapping, audit report generation (by environment), data classification, and many other operational use cases. We are focusing on consistently tagging or labeling Resource Groups as well as tagging specific resources. One thing we have found is that you must have a standardized process for managing this process with IaC as this allows you to ensure this process is consistent and that agreed upon standards for tag keys and values are always followed.

Application Identities and credential rotation

Application identities provide the ability to allow services to programmatically access cloud provider APIs. Each provider defines a concept to enable this type of capability: AWS access keys, GCP service accounts, Azure service principals. The need to manage the storage and lifecycle of application identities, is just as important as the need to reclaim and manage Organizational Root credentials. It is paramount that your organization define standards around how you securely store and periodically rotate application identities.

There are multiple choices on how your business can secure and rotate secrets across your organization. Each cloud vendor has cloud native solutions that provide these capabilities: AWS Secrets Manager, GCP Secrets Manager, and Azure Key Vault. The primary question you need to answer is whether you want a cloud agnostic solution or a cloud native solution. There are advantages such as native integration with cloud vendor services and disadvantages such as the need to support multiple implementations across multiple clouds.

We developed a cloud agnostic approach that leverages Hashicorp Vault along with a Box service that leverages the Vault Secrets Engine for AWS, GCP, and Azure. All Cloud secrets are stored and managed in Vault which provides a well-defined model for how all Cloud Provider secrets are stored. In addition, Box developed a service that allows us to periodically rotate credentials either on demand or on a time-based model. This model significantly improves our security and compliance posture as both these capabilities are critical controls that auditors require. We also get the advantage of having a single solution that spans multiple cloud providers.

Conclusion

Managing multiple cloud providers is a huge challenge and it should be clear that achieving a well-defined and operational governance model, across them, will not be an easy or short process. As we enumerated the various challenges and defined methods to resolve them, it was clear that we needed a multi-year roadmap that would allow us to prioritize and focus on the most critical areas first. Again, it is our hope that sharing these details with our customers, partners, and the general tech community, it will help you to formulate your own approach, by leveraging some of this information. In our Final blog post, of this mini series, we’ll illustrate some specific implementation examples across AWS, GCP, and Azure that leverage the Meta-Model and details in this blog to develop a consistent governance model across these cloud providers.

If you are interested in joining us, please check out the open opportunities at Box.

Special thanks to the following people for their detailed reviews and comments:

  • Luis Hernanz, Principal Architect, Box
  • Matt Bowes, Staff Security Engineer, Box
  • Xaviea Bell, Senior Site Reliability Engineer, Box

--

--