Build it: Technical considerations
As we started building our CIO Hybrid Cloud platform, we were given a very simple directive: build a hybrid cloud platform on Red Hat OpenShift. We were also empowered to start small, embrace failing fast, and experiment. We extended the hybrid, integrated framework within our existing ecosystem and were finally open to implement features requested directly from our users. As we developed our platform, we took these concepts to heart and we released our first production version in 1.5 months. In the time since, we’ve continued to release often, revisit our technical decisions and features, and continue to make our hybrid cloud vision a reality.
So, what were some of our key technical decisions? Below we’ll cover some of the decisions we’ve made in terms of our infrastructure, how we install our custom components and configurations, and how we enable teams to deploy their applications. We’ll also tell you about our upcoming unified hybrid cloud portal experience that we’re developing for our customers.
Infrastructure
While building on OpenShift Container Platform was fait accompli, it is a design that we firmly supported. If you haven’t yet heard of OpenShift, it is Red Hat’s enterprise distribution of Kubernetes — a fully supported distribution which includes commercial support, integrated upgrades, and hundreds of fixes to defects, security, and performance issues for upstream Kubernetes in every release. One of the exciting features of OpenShift, and Kubernetes in general, is its ability to provide a common experience no matter where it is running or even what it is running. OpenShift running in a private cloud provides the same experience as OpenShift running in a public cloud, and OpenShift on zLinux feels like OpenShift on x86.
Knowing that we were building the platform on OpenShift, we needed to decide what we were going to run on. OpenShift supports running on Red Hat Enterprise Linux (RHEL) or RHEL CoreOS (RHCOS), an operating system designed specifically for running container workloads. Even though RHEL is a rock-solid platform, we chose to build on CoreOS for several reasons. First, RHCOS is built from the same enterprise ready components as RHEL. RHCOS uses the same quality, security, and control measures that support it. In addition, RHCOS provides a fully immutable environment that runs just enough components to bootstrap the OpenShift environment with no extra services installed. RHCOS builds on the management approach of OpenShift itself as the RHCOS configuration is stored within the cluster. This integration makes upgrades (and rollbacks!) simple as an integrated process with OpenShift, which lets us treat the cluster in a cloud native way — reloading rather than patching.
Once we determined that we would be running OpenShift on top of RHCOS, our next step was to determine how we were going to build out our clusters. From the beginning, we knew that we’d need to deploy multiple clusters in multiple public and private cloud regions and interact with a variety of other services, such as DNS and firewall. As we already had experience with provisioning systems at scale, we knew that automation was critical and we chose to build on Terraform — an extensible platform that enables off-the-shelf integration with multiple infrastructure providers including IBM Cloud, with provisioners such as Ansible, and with any number of APIs, including custom ones. With Terraform, we were able to describe our environment via configuration files, files that we can place under source control, and have the system generate the activities needed to reach the desired configuration for a targeted provider. In short, we can define our environment in a readable file and have Terraform determine how to interact with our various targets, separating the what from the how. This allowed our team to easily build and extend out clusters in a consistent manner, avoiding the natural drift that is common with a more manual approach.
Custom components
At this point, we had determined how to build the runtime environment, but OpenShift is only part of our story. We still needed custom components and configurations in order to deliver our platform. Initially, this was a very small set of things and this was done in a very ad hoc way using a variety of approaches: helm, kustomize, or simple kubectl based scripts. While this initially worked well, as we increased the number of components deployed and the number of clusters deployed, we found the ad hoc process to be cumbersome, complex, and error prone. As a team, we were also looking to improve, which meant continually re-evaluating our decisions, and we came to an agreement that we needed a new, safer approach.
We needed an approach that allowed us to build on our infrastructure as a code approach, where we could have our deployment process and artifacts checked into source control, with deployments happening when any changes were approved — a process that sounds a lot like GitOps. For this approach to work, we needed a safe way to store sensitive information, like passwords, in Git. To accomplish this, we chose to leverage a Sealed Secrets operator. This operator enabled us to commit sealed (read encrypted) secrets into Git. When deployed, they were automatically converted into standard Kubernetes secrets. With secret management solved, we still needed an automated way of deploying our components and configuration. After some research, we elected to leverage Argo CD. If you are unfamiliar with Argo CD, it is a GitOps continuous delivery tool for Kubernetes. It monitors our running applications and configurations and compares their state to the desired state defined in our Git repository. With Argo CD in place, we had a consistent, automated deployment triggered by a Git commit and the ability to observe that our applications were in sync. These two tools aligned our ability to manage our custom code with our ability to manage our infrastructure, resulting in a more safe and repeatable deployment process.
Applications
As has been mentioned, our hybrid cloud platform is not “managed multi-tenant Kubernetes”, but rather an opinionated platform. One of our initial opinions was to simplify the deployment process of an application, hide the complexity of the standard OpenShift resources, and enforce several business rules. This initially was done via custom code that created and managed these sources, an approach that worked but limited user interaction to our portal. As our users asked for an API/cli along with more fine-grained options, something we agreed with and fully supported, we realized we needed to revisit our approach.
Rather than developing our own custom API, and potentially a custom CLI as well, we decided to embrace Operators and opened up access to the clusters. If you haven’t run into Operators yet, they are built from Custom Resource Definitions (CRD), a custom extension to the Kubernetes API, and a Controller — together they provide a means for new capabilities. Building on the Operator SDK, we introduced an application operator, which provided a Custom Resource Definition, a Resource Controller, and application logic for creating/managing/auditing applications in our platform. The application CRD can be seen as the “primary” custom resource of our platform and it empowers our users with a simple but versatile API definition that leverages the existing OpenShift/Kubernetes CLI, while enabling the platform to create, update, and validate the underlying resources. In addition, it provides a simple entry point for users who are new to containers, but does not limit those who already have knowledge of OpenShift.
On our journey of building our hybrid cloud platform, we’ve been empowered to make the technical decisions and develop opinions that we felt were needed to deliver, embracing ideals like failing fast and experimentation. This opinionated approach has enabled us to not get stuck in endless analyses and instead focus on delivering features and functionally quickly. At the same time, it is very important to be open to revisiting decisions and opinions as the ultimate success of the platform lies in the hands of our customers.
Ryan DeJana is Senior Technical Staff Member, CIO Hybrid Cloud Platforms at IBM based in Boulder, CO. The above article is personal and does not necessarily represent IBM’s positions, strategies or opinions.