Kyle Brown and Kim Clark
In the previous article, where we discussed what cloud native actually means, we established that to achieve the desired benefits from a cloud native approach you needed to look at it from multiple perspectives. It is about not only what technology you use and where your infrastructure is located, but also how you architect your solutions. But perhaps most importantly, it is about how you organize your people and what processes you follow. In this and the next two articles we are going to walk through what we have seen to be the most important ingredients of successful cloud native initiatives, taking a different perspective in each. A summary of the themes in this series is shown in the diagram below:
Let’s begin by looking at perhaps the most overlooked perspective — how cloud native affects the people involved, and the processes they are part of.
The people and process ingredients of cloud native
The people component outweighs any of the other parts in getting to cloud native success. In order to achieve the business value of cloud native, teams need to be able to rapidly coordinate between business and IT, have a “low touch” way of getting their changes through to production, and be passionately accountable for what they deliver. No amount of new technology, or modern architecture approaches will accomplish this on their own. Teams need to invest in moving to agile methods, adopt DevOps principles and software lifecycle automation, adopt new roles (such as SREs), and organizations must give teams an appropriate level of autonomy. We show some of the most important people aspects of cloud native in the diagram below:
This list is by no means complete. We would also assert that there are other people based aspects that improve team resiliency and cut across all the ingredients below such as a move to a no-blame culture, and encouraging a growth mindset. In the next sections we’ll dive into each of the ingredients in the above diagram in depth.
Cloud native infrastructure and microservices-based design enable the development of fine grained components that can be rapidly changed and deployed. However, this would be pointless if we did not have development methods that can leverage and deliver on that promise. Agile methods enable empowered (decentralized) teams to achieve rapid change cycles that are more closely aligned with business needs. They are characterized by the following:
- Short, regular iteration cycles
- Intrinsic business collaboration
- Data driven feedback
Agile methods are usually contrasted with older, “waterfall”, methodologies. In a traditional waterfall method, all requirements are gathered up front, and then the implementation team works in near isolation until they deliver the final product for acceptance. Although this method enables the implementation team to work with minimal hindrance from change requests, in today’s rapidly changing business environment the final delivery is likely to be out of sync with the current business needs.
Agile methodologies use iterative development cycles, regular engagement with the business, combined with meaningful data from consumer usage to ensure that projects stay focused on the business goals. The aim is to constantly correct the course of the project as measured against real business needs.
Work is broken up into relatively small business relevant features that can then be prioritized more directly by the business for each release cycle. The real benefit to the business comes when they accept that there cannot be a precise plan for what will be delivered over the long term but that they can prioritize what is built next.
Agile itself is becoming an “old” term and has suffered over time, as many terms do, from nearly two decades of mis-use. However, for the moment, is it perhaps still the most encompassing term we have for these approaches.
You cannot achieve the level of agility that you want unless you reduce the time that it takes to move new code into production. It does not matter how agile your methods are, or how lightweight you have designed your components if the lifecycle processes are slow. Furthermore, if your feedback cycle is broken, you cannot react to changes in business needs in real time. Life cycle automation is centered around three key pipelines. These are:
- Continuous Integration — Build/test pipeline automation
- Continuous Delivery/Deployment — Deploy, verify
- Continuous Adoption — Runtime currency (evergreening)
We show the interaction of these in the diagram below:
Continuous Integration (CI) means that as changes that are committed to the source code repository often (“continuously”) and that they are instantly and automatically built, quality checked, integrated with dependent code, and tested. CI provides developers with instant feedback on whether their changes are compatible with the current codebase. We have found that Image-based deployment enables simpler and more consistent build pipelines. Furthermore, the creation of more modular, fine-grained, decoupled, and stateless components simplifies the automation of testing.
CD either stands for Continuous Delivery or Continuous Deployment (both are valid, although Jez Humble’s book popularized the term Continuous Delivery, which covers both). Continuous Delivery takes the output from CI and performs all the preparation that is necessary for it to be deployed into the target environment, but it does not deploy it, leaving this final step to be performed manually in controlled, approved conditions. When an environment allows the automation to deploy into the environment, that is Continuous Deployment, with advantages in agility balanced against potential risks.
Continuous Adoption (CA) is a less well known term for an increasingly common concept; keeping up to date with the underlying software runtimes and tools. This includes platforms such as Kubernetes, language runtimes and more. Most vendors and open source communities have moved to quarterly or even monthly upgrades. and failing to keep up with current software results in stale applications that are harder to change and support. Security updates as a minimum are often mandated by internal governance. Vendors can provide support for a minimal number of back versions, so support windows are getting shorter all the time. Kubernetes, for example, is released every three months and only the most recent three are supported by the community. CI/CD, as noted above, means code changes trigger builds, and potentially deployment. Enterprises should automate similar CA pipelines that are triggered when vendors or communities release new upgrades. For more information about CA, see Continuous Adoption.
Its worth noting that lifecycle automation is only as good as efficiency of the processes that surround it. There’s no value in working to bring your CI/CD cycle time down to minutes if your approval cycle for a release still takes weeks, or you are tied to a dependency that has a lifecycle measured in months.
DevOps and Site Reliability Engineering
As we can see from the figure above, lifecycle automation lays the groundwork for a more profound change to the way people work. As we simplify the mechanism between completion of code, and it’s deployment into production, we reduce the distance between the developer and the operations role, perhaps even combining them.
This is known as DevOps, and has some key themes:
- Collaboration and combination across development and operations roles
- “Shift left” of operational concerns
- Rapid operational feedback and resolution
In traditional environments there is strong separation between development and operations roles. Developers are not allowed near the production environment, and operations staff have little exposure to the process of software development. This can mean that code is not written with the realities of production environments in mind. The separation is compounded when operations teams, in an effort to protect their environments, independently attempt to introduce quality gates that further impede the path to production, and cyclically the gap increases.
DevOps takes the approach that we should constantly strive to reduce and possibly remove the gap between development and operations so that they become aligned in their objective. This encourages developers to “shift left” many of the operational considerations. In practice this comes down to asking a series of questions and then acting on the answers:
- How similar can we make a development environment to that of production?
- Can we test for scalability, availability, and observability as part of the earliest tests?
- Can we put security in place from the beginning, and not just switch it on at the end for major environments?
Platforms elements such as containers and Kubernetes can play an important role in this, as we will see from concepts such as image based deployment and infrastructure as code that we will discuss later.
Clearly, the shortening of the path between development and production by using CI/CD is a linked to DevOps as is the iterative- and business-focused nature of agile methods. It also means changing the type of work that people do. Software developers should play an active role in looking after production systems rather than just creating new functions. The operations staff should focus on ways to automate monotonous tasks so that they can move on to higher value activities, such as creating more autonomically self-healing environments. When these two are combined, this particular role is often referred to as a Site Reliability Engineer to highlight the fact that they too are software engineers. Key to succeeding with this is the need to accept that “failures are normal” in components and that we should therefore plan for how to manage failure rather than fruitlessly try to stop it from ever happening.
In a perfect world, software development and operations become one team, and each member of that team performs both development and operations roles interchangeably. The reality for most organizations has some level of compromise on this however, and roles still tend to become somewhat polarized toward one end of the spectrum or the other.
If we make the methods more agile, and the path to production more automated, we must no then stifle their ability to be productive and innovative. Each team is tackling a unique problem, and it will be better suited to particular languages and ways of working. We should give the teams as much autonomy as possible through:
- Decentralized ownership
- Technological freedom
If we’re going to rapidly iterate over more fine grained components, we need to decentralize “one-size-fits-all” policies and allow more local decision making. As we will discuss later a good cloud platform should naturally encourage standardization around build, deployment and operations, so long as components are delivered in a consistent way (e.g. container images). To be productive, teams then need to have freedom over how they implement those components; choosing their own technologies such as languages and frameworks. Equally important is to ensure the teams can rapidly self-provision the tools and resources they need, which of course aligns well with the very nature of cloud infrastructure.
There is still a need for a level of consistency in approach and technology across the enterprise. Approaches like the Spotify model, for example, often approach this need through “guilds”, groups made from individuals from the teams that focus on encouraging (rather than enforcing) common approaches and tools based on real world experiences in their own teams.
Of course, a caveat is that decentralization of control can’t typically be ubiquitously applied, nor can it be applied all at once. It might make sense for only certain parts of an enterprise, or certain types of initiative in an enterprise. Ultimately, seek a balance between enabling elements of a company to innovate and explore in order to retain market leadership, and ensuring that you do not compromise integrity of the core competencies of the business with constant change and increasing divergence.
In the next part of this series we’ll look at what architecture and design choices we need to make in order to best leverage the cloud environment. In the meantime, if you want to learn more about any of these issues, visit the IBM Garage Method website, where we cover many of these topics in more depth in the context of an end-to-end method.