Box Cloud Management Framework: Our Process to deliver a Box CMP

Garth Booth
Box Tech Blog
Published in
18 min readMay 13, 2020
“Box CMF: Buy vs. Build?”
“Box CMF: Buy vs. Build?” Illustrated by Jeremy Nguyen/art directed by Sarah Kislak

Buy vs. Build models are not new and can be applied to basically any number of decisions ranging from home improvement projects to stock portfolio management, to a plethora of software solutions and much more. Just do a quick search of google on some popular blog platforms such as medium.com and you’ll find a number of examples, including one of our former Boxers, Joy Ebertz: “Finding The Perfect Solution: Build vs. Buy vs. Open Source.” This is not a science and some would even assert that you should focus on Buy and Build, as opposed to one or the other.

Over the last 5 years or so, Hybrid Clouds have begun to emerge and have resulted in many new challenges. One set of challenges is around how to efficiently manage Hybrid Cloud platform capabilities such as Identity and Access Management, Security, Compliance, Cloud Resource Provisioning, and many others. All of the major Public Cloud providers as well as Private Cloud solution providers have created their own tools to support managing their platform(s). In addition, this has led to opportunities for a number of new startups as well as open source developers to create cloud management tools to help address these challenges. And, as with most challenges, Hybrid Cloud Management tooling has a number of eager engineers working across a number of different businesses looking to build DIY tools.

Given all the choices to manage Hybrid Clouds, how do we take the subjectivity out as we decide whether we buy, build, or leverage open source tool(s) to operationalize the management of these environments? In our first blog post, “Box Cloud Management Framework: Our Journey to Delivering a Cloud Management Platform” we introduced a set of core principals along with a 4-step buy vs. build process to help guide our decision making process to deliver key Box Cloud Management Framework (CMF) capabilities.

In this blog post, we’ll use 3 Box CMF capabilities and the specific tools that we implemented to illustrate buy and build decisions on our journey to building out our Box Cloud Management Platform (CMP). We will then share two examples that we learned from when we didn’t achieve desired outcomes. These examples should serve as learning to you as you pursue your own buy, build, or leverage decisions. Finally, we’ll summarize some key takeaways. It is worth noting that there are a number of security and compliance related considerations that we’ll defer to a future blog post.

Box CMF Capability evaluations
Before we delve into each of these capabilities, let’s take a closer look at cost and time considerations and our general 4-step process for buy, build, and/or leverage decisions.

Cost and Time Considerations
Two of the most important aspects of any buy, build, and/or leverage decision are cost and time:

  • Is it less expensive to build, leverage open source, or just buy?
  • How quickly do we need to deploy a solution to support the defined use cases?
  • How much is it going to cost to support the solution?

Regardless of the project, one of the first things you need to do is determine if your team has the necessary skills to build (either from scratch or leveraging open source) a solution that can deliver the defined use cases in a timely manner. For the remainder of this part of the discussion, we will assume your team has the necessary skills and that you have identified at least one viable buy solution.

First, cost — the primary question that will need to be answered is what will it cost year over year to build and maintain a solution vs. just buying a yearly license for an off-the-shelf software solution. Technically, it is not difficult to compute these costs, but care must be taken to not overlook any material costs that could negatively impact this part of the decision. In the build case, you must include costs to build the initial solution (i.e. how many engineering months), deployment costs, and ongoing support and maintenance costs. The latter is often overlooked, but could result in substantial costs over time. The good thing about the build case, in terms of costs, is that you’re going to get the exact capabilities you pay to build.

The buy case is relatively straightforward and is the negotiated yearly costs to install, maintain, support, and train the engineering team on the software being purchased. If you’ve chosen a SaaS solution, many of these costs will be included in the overall license fee. One thing to watch out for is how much unused functionality is built into the solution that you’re buying. Some of the larger multi-cloud software solutions on the market include a number of capabilities that you may never use and therefore you may end up overpaying for the solution.

The timeliness aspect is important on two dimensions. One, how quickly can you get the initial solution deployed. Two, how quickly can you get new enhancements or extensions added to the solution. Buying a solution will most likely be the quickest path to the initial deployment as long as it supports a good percentage of the defined use cases. However, if the solution does not have pluggable interfaces to support extensions outside of the vendors releases, then you will need to ensure the vendor is willing to add any desired enhancements in a timely manner. Otherwise, you may be limited in how you enable future use cases that may not have been part of the initial deployment.

4-step Buy, Build, and/or Leverage Process

  1. Identify stakeholders that will deliver or consume services from the platform. This should include each of the cloud infrastructure platform owners, compliance and security, SREs, service owners, finance, legal, architects, and others. These stakeholders will need to provide input and validate the details of the use cases that they will require to support their desired business outcomes.
  2. Define use cases across each of the key capabilities and validate with the appropriate stakeholders. These use cases serve as a key mechanism to help drive the buy, build, or open source evaluation process. The following SPREADSHEET provides some examples of how we define our use cases. They are written in an agile user story form and are typically epic level user stories (i.e. they include many smaller user stories which are enumerated and used to implement the desired capability in agile based sprints).
  3. Identify tools and gaps by mapping use cases to specific Cloud Management (Cloud Native and 3rd Party) Tools to determine gaps. It is critically important to collect as much information as possible about each of the solutions to ensure they can actually deliver the functionality being claimed and also support a large percentage of the use cases you’re evaluating against. Don’t overlook the lifecycle management of the software itself (including installation, upgrading, etc.). Ask for live demos and proof of concepts with your actual environment before proceeding with any substantial investments.
  • There are no unicorns out there for Multi-Cloud Management, so don’t get too enamored with trying to find one tool that can do it all, because there is no such tool. Well-researched data from industry analysts such as Gartner as well as our own experiences suggest that you will need to use a collection of tools to support your Multi-Cloud Management strategy. In cases where a tool being evaluated has gaps, you need to ensure it can be easily extended (via pluggable, technology agnostic interfaces) to support additional capabilities that may or may not be on a planned roadmap. This will allow you to potentially invest in extending the platform to help close any identified gaps without being at the mercy of a specific vendor. This also gives you ultimate flexibility as you can continue to leverage those extensions should you choose to move off of that platform to an alternate solution.
  • 4. Implement the defined use cases via a buy, build, and/or leverage open source option. In most cases, the tool you chose to implement may be targeted only for a specific environment. For example, you may implement specific use cases that apply only to Public Cloud and therefore the tool(s) you select would not be the same one you use to solve similar use cases for your Private Cloud environment. This could be due to tool limitations for Private Cloud environments or that there may already be existing solutions that you’ve built or purchased for that environment.
  • Recall the eight core principals we defined in our original blog: federated identity model, micro-service based architecture, restful APIs, infrastructure as code, plugin-based architecture, common hosting environment, change control, and containerization. Adhering to a set of core principals is important regardless of the approach. This will provide consistency, flexibility, and enable clean integration of these services into a coherent platform of services managed via a self service GUI.

Let’s explore security, cost capacity management, and infrastructure management CMF capabilities and how they leveraged the concepts in this model. Each of these solutions were completed by different teams and at different times over the past 3 years at Box. However, each of these solutions basically followed the general 4-step process we have converged on to help drive these type of decisions.

Security use case (buy):
Our evaluation begins on the premise of solving for the Identify, Detect, and Monitor components of our overall Cybersecurity Framework. Although we will cover that topic in depth in a forthcoming blog on Compliance and Security, let’s address some key aspects of this framework as it applies to evaluating Cloud Management tooling buy, build, and/or leverage decisions. As a security team, we must be able to identify, detect, and monitor potentially malicious activity across a multi-cloud environment when an event occurs. The event should contain who, what, where, and when so that the incident response team can remediate and address the event in a timely manner to reduce the impact of any threat vectors. Monitoring a multi-cloud environment can be tricky, with the simple fact that each provider will have their own sets of APIs to choose from and a varied scope as to what they are able to check for.

Historically, we’ve seen many large enterprises put resourcing and staffing into building their own tooling that is geared specifically to their security needs. While this can be a great way to solve security issues in multi-cloud environment, it comes at a cost, that cost being resourcing as well as staffing the appropriate folks who are versed in building custom tooling and pipelines. There are a number of open source cloud security tools that exist which tackle the majority of Public Cloud providers (AWS, Azure, GCP), each having their own set of pros and cons. Here at Box, some of the key factors we considered when choosing whether to build or buy our own multi-cloud security tooling were:

  • The ability to apply the CIS benchmark across our entire public cloud infrastructure ranging from all the multi-cloud environments we use to the hybrid connectivity between them
  • Tooling that would provide our team with detailed information in an easy to digest format such as JSON to ingest into our SIEM platform
  • Being able to map all the details we need from a change or misconfiguration in our public cloud environments to make actionable security decisions both proactively and reactively

Security practitioners, architects, and engineers are aware that it is hard to find one tool that can scale usage across many teams such as compliance, operations, and engineering. The ROI of a tool greatly increases when multiple groups within an organization can utilize it to benefit from it; within the realms of security, cloud security tools could potentially hold a lot more value than what appears at face value. Given some analysis and brainstorming, the right tool and solution for solving security needs can be identified with simplicity.

Two of the key tools our security team chose to buy to further enhance some of the components that go into our monitoring and detection pipelines focus on: detection, monitoring, and dependency vulnerability analysis. The first tool, at a high level, is a public cloud detection and monitoring tool that will provide CIS benchmarking across your multi-cloud environment. It’s particularly helpful when you have a large footprint in the cloud. Compliance auditing becomes simplified with this tool when reports and findings are easily exportable in a format that can be both shared with auditors and our internal compliance team. In addition, there are numerous integrations (such as Jira, Splunk, and Slack) which help further automate your workflows. The second tool, on the other hand, is a dependency vulnerability analysis tool that can be placed into a build pipeline such as Jenkins. Potentially vulnerable image builds are verified by this tool once they are uploaded to your image registry and scanned for any known vulnerabilities.

Following this blog we will be releasing a more comprehensive overview of how we tackle other areas within our security stack to solve for the principles of Protect, Response, and Recover components of our Cybersecurity Framework. Since pro-active security is always better than reactive, we’ll demonstrate some of the other evaluation points that are used in implementation of controls that will stop malicious activity before it even has the chance to occur in the first place.

Cost and Capacity Management use case (build):
For capacity management teams, one of the main requirements is to have all the required metrics in a structured format that can be accessed easily using querying languages like SQL. This enables the teams to generate various types of capacity reports, generate organic growth demands, run ad-hoc analysis, or even plan for more complicated scenarios such as data center migrations. There are variety of use cases that demand capacity teams to look at the same data from different perspectives. For example, the number of compute cores consumed by a service within a specific availability zone and across an entire region.

As we gain confidence with the accuracy and consistency of the data and metrics that we are collecting, we will shift our emphasis to automating regular day-to-day activities. For example, automated alerts based on overall site capacity health, threshold based right sizing, and time to live calculations that are computed in a near real time. Ultimately, we will start leveraging machine learning models and anomaly detection techniques to move from threshold based logic to a more dynamic contextual based system. This will enable us to be more pro-active rather than re-active in our capacity management processes.

While there are many tools that can help achieve these business outcomes, we chose to leverage data from an existing metrics collection and visualization tool that was already well entrenched in Box operational processes to build our own capacity analysis tool. This tool captures all the the system level and app level metrics as a time series and we use this data to build a tool that provides cost and capacity management data. So, instead of potentially duplicating efforts, we decided to put our focus on how to effectively use the data we were already collecting to achieve our desired outcomes.

The tool that we built is internally, referred to as Spectrum, supports multiple features that allows the user to explore various configurable options based on the ingested metrics for each service. Below are the high level details of how the tool works:

  • Starting with the overall list of metrics that are ingested for each service, we define a list of default metrics that will be collected for every service by using templates. In addition, we also defined custom app level metrics, back-fill duration, and configure thresholds at a service level. There are daily jobs that run to identify new services and populate the default metrics for each service, using our metrics tool APIs, without any manual intervention.
  • The metrics are then put through a data transformations and aggregation process. This feeds into a scoring mechanism that is used to categorize services into different states: hot, cold, good. This categorization is based on different characteristics of a service: constraining metrics, thresholds, DR capacity, and traffic patterns. Using the service score, we can then derive several types of analytics, including the expected rightsize for a specific service.
  • End users are provided with a UI that visualizes and exposes the controls to manipulate the various configurable options. It provides high level information about a service including: # of cores, # of instances, recommended allocation or possible opportunities with ability to filter or drill down by various dimensions like cloud platform, SKU type, region, availability zone, as well as others.

This tool is supported and maintained by our capacity team. A future blog post will go into the architecture for this tool as well as some more advanced capabilities that might be helpful to other capacity teams that may be trying to solve similar use cases.

Infrastructure management use case (buy and build):
Our Data Center operations team manages thousands of systems (servers, storage, and networking). One of the many areas that requires continuous focus is BIOS and firmware management. In order to ensure efficient operational procedures in these areas, Box needed to implement a system to not only upgrade the BIOS and firmware of our hosts, but also to ensure that all hosts received the upgrades and could monitor the state of the fleet to ensure version compliance. Any tool we bought or built would need to meet a few minimum criteria; API endpoints for automation, a user friendly front end, the ability to be hosted on premises, and full control of what would be deployed and how.

With that in mind we looked at the market for existing solutions while also blueprinting what an internally built system would look like. By splitting our approach, each effort helped the other. By blueprinting how would implement our own tool, it helped us refine our technical requirements, and by looking at what was already out there, it helped us enhance our use case and what we wanted out of such a tool.

A purchased tool would not need to be built, just vetted, installed, and configured which would enable us to hit our goals much quicker. Buying a tool would also mean having the support of an entire dedicated staff and separate development team to escalate issues that needed to be resolved. That said, there are also plenty of disadvantages to consider when buying a solution. In this case, it was lack of control of the functionality of the product and the fact that it only supported one hardware platform. In weighing the pros and cons of each we opted to take a phased approach.

We designed a timeline where the end state of the project was a “north star best of all worlds solution.” The project would be decided into two major phases. First, to realize immediate benefits and without delaying other projects, we invested in a purchased tool that would solve our short term problems, but could also be implemented in such a way that it would be easy to migrate off of. While this project was being rolled out, we would take what we learned from this deployment and design to ultimately build an internal tool that meets all of our use cases. This approach enabled us to see results right away while helping us better design and build our long term solution.

The solution we design will need to be hardware platform agnostic, and leverage industry standards for server management. In order to enable our vision of being able to support all server hardware from ODM (whitebox and open source) to OEM, we need to ensure that our tooling does not use any proprietary tools or software. In addition, we also need to implement a highly automatable solution with robust APIs that do not have to be redesigned as our provisioning stack evolves. With these requirements in mind, the RedFish protocol from DMTF will be at the heart of any solution we will move forward with.

Box CMF use case learnings
The Box CMF capabilities described above represent just a fraction of the successful buy, build, and leverage open source decisions we’ve made. However, there have also been a number of use cases that failed to produce the desired business outcomes. It’s important to note that these failures were all proof of concepts and never made it to production level support. We are sharing these use cases to highlight the importance of having a thoughtful and thorough methodology behind this process so that you can avoid some of the issues that we’ve experienced. Both of the cases discussed below are centered around the automation of a critical parts of our data center server lifecycle management process.

The first example involved capabilities to support use cases to deliver end-to-end provisioning of one or more servers. We originally evaluated and pursued an open-source solution. The primary use cases centered around configuring server BIOS settings, deploying OS images, and subsequent installation of service specific software packages, based on the service that would ultimately be deployed on the server. Unfortunately, we were at the bleeding edge of some of the initial versions of the open source software and although there were some good capabilities that appeared to match our key use cases, there was critical functionality that was missing and therefore made the open source solution unusable. Fortunately, we caught these limitations before fully deploying to our production environment, so there was ultimately no impact, other than lost time and resources evaluating the solution.

We then pursued an enterprise license with the vendor to resolve some of the functionality issues and also provided some professional service help with deploying the solution to our data centers. Unfortunately, this resulted in a lot of delays as we needed to ensure the desired capabilities were actually available in the solution before we deployed it into our production environments. After a number of iterations with the vendor, we were able to get a number of those capabilities built, but there were other non-technical issues that resulted in us abandoning this solution.

The second example included the same capabilities but also an additional use case that would enable the ability to define server templates that could be used in a server agnostic manner. These server templates would include BIOS settings, OS, and even post server package installation details. We ultimately decided on a buy decision and pursued a solution with a vendor to deliver these capabilities. This was super compelling as there were other well-known products that supported server profile/template based provisioning, but they were all proprietary and only applied to the vendors server hardware. Since we required the ability to support multiple server types, we did not want to get locked in on a specific venders hardware. The initial discussions and solution details were great and it appeared we had a solution in hand that would deliver significant business value to our server provisioning process. Unfortunately, the details we reviewed on paper were not what could actually be delivered in reality.

In this particular example, we made a strategic decision to get a full proof of concept up and running on a set of SKUs we supported in our data centers. After the solution details were confirmed, we attempted to get the vendor to demonstrate these capabilities, but it was quickly determined that there was not an out-of-the box solution that could provide enough confidence in the solution that we could extend it that solution with minimal effort to support the entire spectrum of use cases we had in mind. In fact, there was no real enterprise use cases that confirmed the capabilities that were discussed in our solution details.

Key Learnings

Throughout each of the efforts we’ve described in this post as well as others, there are a few common themes that have emerged. One, make sure you have a clearly defined set of use cases that you can use to evaluate potential buy, build, or leverage open source options against. Defining meaningful use cases will require that you solicit input from all of the key stakeholders that will rely on the service(s) being implemented. Clear and concise use cases are critical as they will remove ambiguity and ensure an objective approach is taken to determine what the best approach is to deliver the desired business outcomes.

Second, as you engage specific vendors as part of the evaluation process, make sure to request live demos and preferably a proof of concept implementation in your actual data center environment. Ensure you include demonstration of the full life cycle management (i.e. installation of the software, upgrading, extending via pluggable interfaces etc.) of the software under review. Vendors with mature and stable software will have no issues with these requests and should be able to demonstrate their capabilities within a relatively short period of time. There should be little or no cost required to support these demos or proof of concepts. Be very cautious of agreeing to any type of professional service support prior to validating that the desired use cases can actually be implemented with the software under review.

Finally, be sure to define a set of core principles to guide how the services will be implemented and managed. For example, we’ve set some explicit requirements on leveraging micro-services, restful based APIs, infrastructure as code methodologies, plugin based architectures, containerization, and others to ensure that we maintain consistency and flexibility on how we deploy, extend, and manage our services regardless of whether we build, buy, or leverage open source solutions.

Conclusion
The 4-step process we’ve converged on was the result of learnings from multiple independent efforts across many teams. We hope the information in this blog can help start a conversation on how your team manages these types of decisions and share best practices that will improve future efforts. You will note throughout this document that we specifically left out individual tool vendor names; this was intentional. Our goal here was not to promote or diminish a specific tool, our focus was on the overall process and how we approached it. If you’re interested in additional details about specific tools we use here at Box, feel free to reach out and we would be happy to discuss in more detail.

Our next blog post on our Multi-Cloud Identity and Access Management model is forthcoming and we look forward to sharing our thoughts to help others that may also be on a Multi-Cloud journey. If you’re interesting in joining us, check out our open opportunities.

Co-Authors: Clay Alvord, Garth Booth, Suresh Gonugunta, and Ashish Patel

References:
A Guidance Framework for Selecting Cloud Management Platforms and Tools, Gartner

--

--