The “How” of Cloud-Native: Technology and Infrastructure Perspective
Kyle Brown and Kim Clark
While the people, process, architecture and design issues we covered in the last two articles are all critical enablers for cloud native, cloud native solutions ultimately sit upon technology and infrastructure, which is what we’re going to cover in this article.
Cloud infrastructure is all about abstracting away the underlying hardware to enable solutions to be rapidly self-provisioned and scaled. It should enable administration of different language and product runtimes using the same operational skills. Furthermore it should promote automation of operations, and provide a framework for observability. Let’s take a closer look at exactly what the key characteristics of that infrastructure are that are leveraged in a cloud native approach.
Elastic, agnostic, secure platform
For cloud native to work, we have to ask a lot from the platform on which we deploy our components. The platform should help us to not worry about non functional concerns by using common mechanisms across whatever we deploy, and likewise should “burn in” security. Therefore, our top requests from the platform should be:
- Elastic resource capacity
- Agnostic deployment, and operations
- Secure by default
If developers are to increase their productivity, they need to be able to focus purely on writing code that creates business value. That means the platform should take care of concerns such as load balancing, high availability, scalability, resilience, and even some elements of disaster recovery. At deployment, we should be able to specify high level non-functional requirements and let the platform do the rest. We will use Kubernetes container orchestration as a powerful (indeed ubiquitous) example of this kind of thinking, but cloud native is definitely not limited to the Kubernetes platform.
A Kubernetes cluster provides automated, elastic provisioning of resources such as cpu, memory, storage and networking based on the requirements of the component being deployed. The pool of resources can be spread across many physical machines, and over multiple availability zones in many separate regions. It takes on the responsibility of finding the resources you need, and deploying your components to them. You only need specify your requirements — what resources you need, how they should or should not be spread out, and how they should be scaled and upgraded. Arguably, we could also have said that about platforms based on virtual machines, but as we will see, containers bring something more to the party.
Assuming we adhere to the architectural principles from the previous section, delivering application components in containers enables Kubernetes to perform deployment and subsequent operations in a standardized way, regardless of the contents of any given containers. It the components are largely stateless, disposable, fine-grained, and well-decoupled, this makes it easy for the platform to deploy, scale, monitor, and upgrade them in a common way, without knowledge of their internals. Standards like Kubernetes are part of a trend of gradually moving away from proprietary installation and topology configuration for each software product. Now they all work the same way, and we benefit from operational consistency, reduced learning curves, and broader applicability of add-on capabilities such as monitoring and performance management.
Finally, we want to have security burnt in to the platform so we can be confident it is a safe environment for our applications from day one. We should not need to re-engineer core aspects of security every time we design a new component; we should instead be able to inherit a model from the platform. Ideally, this should cover identity management, role based access to administration, securing external access and internal communications. We will note that this is an example where Kubernetes itself is only a partial solution; added elements such as a service mesh for internal communication, and an ingress controllers for inbound traffic are also required for a complete security solution.
We can only achieve operational agility if the components are as straightforward and lightweight as possible. We can boil this down into three main properties.
- Fast start up/shut down
- File-system based install and configuration
- File-system based code deployment
To manage availability and scaling, components must be able to be rapidly created and destroyed. That means the runtimes inside the containers must start up and shut down gracefully and optimally. They must also be able to cope with ungraceful shutdowns. There are many possible optimizations this implies: from removing dependencies, reducing memory population performed during initiation, through enabling a “shift left” of compilations into the image build. At a minimum, runtimes should be able to start within the order of seconds but that expectation is constantly lowering (e.g. Quarkus).
We also want builds to be as straightforward and timely as possible if we are to embrace continuous integration. Most modern runtimes have removed the need for separate installation software, instead simply allowing files to be laid down on a file system. Similarly, since an immutable image by definition shouldn’t be changed at runtime, they typically read their configuration from properties files rather than receiving them through custom runtime commands.
Equally, your actual application code can also be placed on the filesystem rather than being deployed at runtime. The combination of these features enables builds to be done rapidly through simple file copies and is well suited to the layered filesystem of container images.
What if we could deliver the entire blueprint for how to stand up our solution at runtime — including all aspects of infrastructure and topology — as part of the release? What if we could store that blueprint in a code repository, and trigger updates just like we do with our application code? What if we could laser-focus the role of operations staff into making the environment autonomous and self-healing? These questions lead to some key ways in which we should approach the way we work with infrastructure differently:
- Infrastructure as code
- Repository triggered operations (GitOps)
- Site reliability engineering
Image based deployment, as discussed earlier, has already brought us a long way toward ensuring greater consistency. However, that only delivers the code and its runtime. We also need to consider how the solution is deployed, scaled and maintained. Ideally, we want to be able to provide this all as “code” alongside our component’s source to ensure that it is built consistently across environments.
The term “infrastructure as code” initially focused on scripting low level infrastructure such as virtual machines, networking, storage and more. Scripting infrastructure isn’t new, but increasingly specific tools such as Chef, Puppet, Ansible, and Terraform have advanced the art of the possible. These tools begin with the assumption that there is available hardware, and they provision and then configure virtual machines upon it. What is interesting is how this picture changes when we move to a container platform.
In a perfect world, our application should be able to assume that that there is, for example, a Kubernetes cluster already available. That cluster might itself have been built using Terraform, but that is irrelevant to our application; we just assume the cluster is available. So what infrastructure as code elements do we need to provide to fully specify our application now? Arguably, that would be the various Kubernetes deployment definition files packaged in helm charts or Kubernetes Operators and their associated Customer Resource Definition (CRD) files. The result is the same — a set of files that can be delivered with our immutable images that completely describe how it should be deployed, run, scaled and maintained.
So if our infrastructure is now code, then why not build and maintain our infrastructure the same way as we do our application? Each time we commit a significant change to the infrastructure we can trigger a “build” that deploys that change out to environments automatically. This increasingly popular approach has become known as GitOps. It’s worth noting that this strongly favors infrastructure tools that take a “declarative” rather than “imperative” approach. You effectively provide a properties file that describes a “to-be” target state, then the tooling works out how to get you there. Many of the tools mentioned above can work in this way, which is fundamental to how Kubernetes operates.
Real systems are complex, and constantly changing, so it would be unreasonable to think that they will never break. No matter how good a platform like Kubernetes is at automating dynamic scaling and availability, issues will still arise that will at least initially require human intervention to diagnose and resolve. However, in the world of automated deployments of fine-grained, elastically scaled components, it will become increasingly impossible to sustain repeated manual interventions and these need to be automated wherever possible. To enable this, operations staff are being retrained as “site reliability engineers” (SREs) to write code that performs the necessary operations work. Indeed, some organizations are explicitly hiring or moving development staff into the operations team to ensure an engineering culture. Increasingly, ensuring the solution is self-healing means that when the systems scale up, we no longer have the problematic need for a corresponding increase in operations staff.
Observability and monitoring
As organizations move towards more granular containerized workloads treating monitoring as an afterthought is untenable. This is yet another example where we need to “shift left” in order to be successful in cloud native. Addressing the problem depends upon around being able to answer three questions about our components:
- Is it healthy? Does your app have easily accessible status?
- What’s going on inside it? Does your app use platform neutral logging and tracing effectively?
- How is it interacting with other components? Do you take advantage of cross-component correlation?
Observability is not a new term, although it is seeing renewed use in IT and particularly around cloud native solutions. Its definition comes from the very old discipline of control theory. It is a measure of how well you can understand the internal state of a system based on what you can see from the outside. If you are to be able to responsively control something, you need to be able to accurately observe it.
Some definitions create a distinction that that monitoring is for your known unknowns such as component status, whereas observability is for your unknown unknowns — finding answers to questions you hadn’t thought to ask before. In a highly distributed system, you’re going to need both.
The platform must be able to easily and instantly assess the health of the deployed components in order to make rapid lifecycle decisions. Kubernetes for example requests that components implement simple probes that report whether a container has started, is ready for work, and is healthy.
Additionally, the component should provide easily accessible logging and tracing output in a standard form based on its activity for both monitoring and diagnostics purposes. In containers typically we simply log to standard output. The platform can then generically collate and aggregate those logs and provide services to view and analyze them.
With fine-grained components, there is an increased likelihood that an interaction will involve multiple components. To be able to understand these interactions and diagnose problems we will need to be able to visualize cross component requests. An increasingly popular modern framework for distributed tracing that is well-suited to the container world is OpenTracing.
Looking back on the Perspectives
Based on what we’ve seen, these are the key ingredients across all the previously mentioned perspectives that are required to make cloud native successful:
They are often inter-related, and typically mutually reinforcing. Do you need to do them all? That’s probably the wrong way to phrase the question, as few if any of the above are a simple binary “yes you’re doing it” or “no you’re not”. The question is more to what level of depth are you doing them. You certainly need to consider your status for each one, and assess whether you need to go further. In our next article, we’ll return to the question of “why” people are drawn to cloud native, and see if we can use that to help prioritize how, when and to what depth we embrace the above ingredients.