Box Cloud Management Framework: Our journey to delivering a Cloud Management Platform

Garth Booth
Box Tech Blog
Published in
11 min readAug 22, 2019

Box delivers the world’s leading Cloud Content Management (CCM) Platform that powers 95K+ businesses and 70% of Fortune 500 companies. Our platform enables customers to securely manage their content, build workflows to automate predictable and repeatable tasks, and to collaborate across both internal and external teams. Further, it enables customers to meet Data Privacy, Security, Regulatory, Data Residency, and other country specific requirements with products such as Box Zones, Box Governance, and Box KeySafe.

To deliver all these amazing capabilities, we’ve built a set of CCM Platform Services using a Hybrid, Multi-Cloud Architecture. This means we strategically consume and manage cloud infrastructure resources both in Box data centers and in the Public Cloud. This presents a number of operational challenges to secure access to our infrastructure, deliver efficient cloud resource management operations, ensure our cloud infrastructure is in healthy state, monitor and validate access, and analyze and optimize costs. Addressing these challenges in a holistic way is critical to ensuring we have secure and compliant Cloud environments that are operationally efficient.

This will be the first in a series of posts that will describe our journey to delivering a Box Cloud Management Platform (CMP) to address these operational challenges. Before diving into the details, let’s first define some terms and concepts.

What is a Cloud Management Platform (CMP)?

The precise definition of a CMP can vary widely depending on who you talk to. We defined it as an integrated product or software that provides for the management of one or more capabilities required to provide visibility and control of public, private, and or hybrid cloud environments. The capabilities can also vary between CMPs, so we defined five key capabilities we believe are required based on our hybrid, multi-cloud architecture and the operational challenges we must address:

Multi-Cloud Identity and Access Management defines how we securely enable single sign-on and role-based access control for console and API access key management. In order to secure access to our cloud infrastructure Resources we must enable the right access, to the right resource(s), at the right time. In other words, we adhere to the “Principle of Least Privilege”.

Capacity Life Cycle Management defines core orchestration and automation of capacity across bare metal, OpenStack, Kubernetes, and supported Cloud Provider resources. This capability enables an efficient and extensible way to safely, reliably provision, repurpose, immutably update, and decommission resources in any of the supported private and public cloud providers. It also provides auditable records to demonstrate we are following proper change control processes, including validation of cloud infrastructure and resources used to support our CCM Platform services.

Platform Health covers 3 key areas:

  • Proactive Monitoring: First, how we proactively monitor real-time and non-real-time outage, health, and security related events across our cloud infrastructure.
  • Visibility: Secondly, visibility into the customer experience so that we can detect and mitigate issues impacting specific enterprises/users.
  • Infrastructure Control: And last, providing control over our infrastructure to facilitate capacity expansion/contraction, updates, and recovery from disasters or outages. This capability provides delivery of immediate notifications or alerts and most importantly, the ability to recovery when there are outages, degraded services, or planned maintenance windows for any of the cloud infrastructure Platforms we use.

Compliance and Security defines how we continuously monitor, validate, and re-evaluate our implementation of the “Principle of Least Privilege”, support our compliance requirements, and ensure our cloud resources are configured according to our defined security policies. Box is committed to providing best in class compliance, security, and data protection for our customers (see Box Compliance statement). This capability ensures we are delivering on that commitment.

Cost and Capacity Management defines how we accurately track, budget, and forecast consumption of cloud infrastructure Resources. This capability assures we have sufficient visibility across all supported Cloud Providers to enable showback, budgeting, and forecasting of consumed infrastructure and services across production, staging, and development environments, and track actuals at Project, Service, and Platform levels of granularity.

The Journey Begins…

Back in June of 2017, we embarked on our journey to define our strategy of how to manage such a diverse set of cloud infrastructure. One of the first steps we took was to survey a number of CMP vendors on the market, including large (all-inclusive) platforms to smaller more niche focused platforms to get a sense of what capabilities were currently available. In order to provide an unbiased, thorough, and methodical approach, several stakeholders developed a set of use cases, based on the five capabilities we needed to provide at Box. We then reviewed several analysts reports on Multi-Cloud Management, consulted directly with a number of analysts, and attended numerous vendor specific Cloud conferences to collect additional information to help formulate our overall strategy.

We determined very early in the process that there is no single vendor platform or tool that will provide support for all of the capabilities we defined, across our hybrid, Multi-Cloud Architecture. Most of the platforms and tools we evaluated provide deep capabilities across 1 or 2 Cloud Platforms or limited capabilities across multiple Cloud Platforms which did not meet the strategic cloud infrastructure Resource consumption requirements we defined in our detailed use cases. We also identified a number of Open Source software solutions that would be ideal to provide capabilities that addressed requirements for a number of the use cases we defined. So, ultimately we decided to pursue a solution that does not rely on a single platform or vendor, but leverages solutions across niche CMPs, open source, and our own custom services.

Introducing the Box Cloud Management Framework (CMF)

Because there is no single CMP that completely addresses our requirements, we had to think about a framework into which different capabilities could be included to address our diverse set of use cases. So, we created the Box CMF!

The Box CMF is a set of guidelines, architectural patterns, and tools that enable delivery of services that support a Cloud Agnostic approach to delivering support for the five capabilities we enumerated earlier. This framework does not assume that Box will build out all of the components and capabilities. We are taking an approach that uses a buy vs build model. The figure below illustrates a simplified view of the Architectural Clouds that map to capabilities we defined for the Box CMF.

Each of the blue clouds define a strategic vision, a set of architectural patterns, and a set of services (either bought or built) that delivers the necessary features to enable the capability. The green oval represents the plug-in framework used to extend services for a particular capability across cloud infrastructure platforms. The purple clouds represent Cloud Resources that are being managed by services, contained within each capability, via the plug-in framework. And finally, the white Self-Service box represents the stakeholder, role based access control, portal for consuming the services delivered by the Box CMF. Together, these services deliver a Box Cloud Management Platform.

Core Principles

We defined a set of eight core principles that would guide our approach to delivering on the key capabilities for the Box CMF

  1. LDAP, SSO, and Multi-Factor support to deliver a consistent federated identity model across all cloud infrastructure
  2. Microservice based Architecture to enable independent buy, leverage, and build decisions across the set of use cases defined for each of the Box CMF capabilities. Assure flexibility and reduce the likelihood of vendor lock-in to a specific platform.
  3. RESTful APIs that adhere to Box API Standards and Patterns.
  4. Infrastructure as Code to deliver efficient, consistent and repeatable management of cloud infrastructure resources. Provide change control management and auditability of cloud infrastructure.
  5. Plugin Based Architecture to enable extensibility of services across a number of cloud infrastructures
  6. Common Hosting Environment to deliver operational consistency. There will be exceptions for 3rd party software that we buy to support specific use cases.
  7. Change Control via github pull request process to provide traceability of all approved cloud infrastructure changes used by production services.
  8. Containerized packaging to provide portability, deployment efficiency, and consistent predictable environments for our services

It is important to note that these are guiding principles and that some may or may not apply in specific cases, especially when evaluating 3rd party software options.

Box CMF: Buy vs Build

Let’s provide a little more context on the approach we took to determine whether we buy, leverage (open source), and or build specific services. This is a critical part of the overall journey to delivering a Box CMP as we need to be judicious in what we choose to buy vs what we choose to build.

We are basically following a 4-step process to evaluate CMPs and tools to deliver the required capabilities for Box CMF.

  1. Identify Stakeholders: Identify key stakeholders that will deliver and or consume services from this platform. This included each of the cloud infrastructure Platform owners, Compliance and Security, SREs, Service owners, and Architects.
  2. Define Use Cases: Define a set of detailed use cases across each of the key Capabilities. These use cases serve as a key mechanism to help drive our Buy vs. Build evaluation process.
  3. Identify Gaps: Mapping of use cases to specific CMPs and Cloud provider specific and Cloud agnostic tools to determine gaps
  4. Implement: Implementation (via Buy, Leverage, or Build) of specific micro-service(s) to enable the capability

In a recently posted blog, titled “Box CMF: Our Process to deliver a Box CMP”, we do a deep dive into each of the steps. In particular, we will describe our use cases, provide some guidance and input on CMPs and tools we evaluated, describe our use case to CMP and tool mapping process, and finally share our progress on closing the gaps we see for our business.

The remainder of this post will provide a brief introduction to each of the Architectural Clouds. Each of them are fairly large and would be impractical to describe in a single post. Therefore, we will be doing a series of follow-up posts that will go into much more detail.

Multi-Cloud Identity and Access Management (IAM)

Identity and Access Management is a foundational capability that will enable appropriate control over who (users and applications) has access to what cloud resources both in our private clouds as well as across our chosen public cloud providers. In addition, this capability will enable visibility into when, where, and why the cloud resource(s) were accessed to have a complete audit trail and meet Box compliance and security requirements.

The Multi-Cloud Governance image below illustrates some of the key concepts we will review in a future post. So, be sure to look out for a future blog on this topic.

Compliance and Security

Box is committed to adhering to a variety of standards, controls, and processes to ensure our customers can meet and exceed compliance and security standards across industries and geographies. Fundamentally, our strategy, for ensuring secure and compliant cloud infrastructure, is based on three key National Institute of Standards and Technology (NIST) specifications: NIST 500–292, NIST 500–299, and the NIST CyberSecurity Framework. We are leveraging these standards to provide a methodology for how Box will build out a holistic model for both Compliance and Security.

We start with a high level description of the Box Cloud Computing Reference Architecture which covers all the core areas of our Box Infrastructure used to host our Box Application Services, followed by a focus on IAM, Compliance, and Security. Finally, we finish with our thoughts on our Box Infrastructure moat.

Capacity Life Cycle Management

As depicted in the Box Cloud Computing Reference image above, our Box Infrastructure hosts our Box Applications. It supports both private and public cloud infrastructure. Our Capacity Life Cycle Management process is used to procure, provision, de-provision, immutably update, and decommission cloud infrastructure Platforms and Hosted Services. Fundamentally, we are using Infrastructure as Code as a core principle to enable life cycle management of resources in these environments. However, although the basic workflow for provisioning private and public cloud infrastructure is similar, the tooling used for each is quite different.

In an upcoming post titled “Box CMF: Capacity Life Cycle Management” we’ll describe our approach to both Box data center and public cloud infrastructure life cycle management using Infrastructure as Code principles.

Platform Health

Operating Infrastructure in a Hybrid, Multi-Cloud Architecture is a complex and challenging effort. Cloud Platforms must be monitored to ensure they are healthy and performing as expected. We have adopted Platform Observability patterns to provide a holistic approach that will include metrics, logging, and distributed tracing mechanisms across both on premise and public cloud infrastructure.

In an upcoming post titled “Box CMF: Cloud Platform Observability” we’ll describe our methodology and approach to enable immediate notifications or alerts when there are outages, degraded services, or planned maintenance windows for any of the cloud infrastructure being consumed by our services.

Cost and Capacity Management

Tracking and managing costs and resource utilization across our cloud infrastructure is critical to ensuring we are operationally efficient. We have defined several key standards that enable us to identify services whether they are running on premise on in the public cloud, are in a production, staging, or development environment, and how much capacity they are consuming.

Our focus on cloud infrastructure metrics collection to analyze and identify performance optimizations is a key focus here at Box. Continuously monitoring and understanding how are services are performing on our cloud infrastructure enables opportunities to optimize either service placement and/or compute configurations. These are a fundamental steps towards enabling comprehensive cost and capacity management capabilities.

In an upcoming post titled “Box CMF: Cost and Capacity Management” we’ll describe our service standards, how we are apply and consume them across our cloud Infrastructure, and how we leverage our ongoing performance metrics to identify cost and capacity optimization opportunities.

Conclusion

This blog should provide a glimpse into what it takes to deliver our amazing value at scale. We’ve learned a lot thus far and we want to share our thoughts with others (to show our value and invite a discussion with customers and partners). When we started on our Box CMF journey back in 2017, we knew there was no quick fix, so we decided to focus on formulating a clear strategy that could be executed over time. Our hope is that the information in these blog posts can help serve as a template that could be used to help you formulate and implement your own Multi-Cloud Management strategy.

Illustrated by Jeremy Nguyen/Art directed by Sarah Kislak

References

  • Evaluation Criteria for Cloud Management Platforms and Tools, Gartner
  • A Guidance Framework for Selecting Cloud Management Platforms and Tools, Gartner
  • AWS re:Invent 2016: Architecting Security and Governance Across a Multi-Account Strategy (SAC319)

--

--