Orient your team for Cloud Operations

Keiran Holloway
4 min readFeb 21, 2023

--

Cloud operating models are important. So important, I’d argue that they can often sit between you and your companies’ ability to successfully adopt a public cloud strategy. This is also one of the areas which is often difficult to contextualize and quite nebulos to pin down. Fortunately, over the years I’ve had a lot of discussions and spent a fair amount of time thinking about this topic — This article is intended to bring a bit of clarity around this.

Features of a successful operating model:

  • Effectively democratize technology and bring it closer to the end users. Adoption of Cloud native technologies and processes suitable for rapidly evolving applications and technologies will make you more dynamic as an organization.
  • Provide the agility and the ability to run fast. With cloud you should be able to experiment and provide value to your end users quickly.
  • Operating in an efficient fashion. Eliminating duplication whilst enabling economies of scale.
  • Provide clarity around the lines of demarcation between various functions within the cloud environment.
  • Provide paths of communications between these functions. Adding more functions and teams is great — but it is tradeoff between specialization and efficiency.

Contextualizing an effective operating model

If we think about your average cloud stack of an enterprise, you can generally break this down into three layers:

  1. Application layer — Where your application and/or business logic lives. This is effectively the platform that your end-users interact with.
  2. Shared Services layer — The shared service layer is made up of various components — Depending on the technologies which are adopted by your company. Technologies which commonly sit within the shared service layer include things like your container orchestration mechanisms (Eg, Kubernetes clusters), shared messaging buses (Eg, Kafka or SQS or EventBridge) , Cloud Identity and Access management patterns (Eg, SSO) and networking landing zones.
  3. Infrastructure Layer — Whilst there should be a bias to using shared service platforms where possible — there will be occasions where bespoke deployments are needed (for example, running an off-the-shelf product) within an EC2 instance. Infrastructure which sits outside this shared services layer will commonly fall into the infrastructure layer.

Within each of these 3 layers there are generally two functions:

  1. Building and Deploying (‘Dev’) — they should be coming up with standard deployment patterns. For example, using standard CI/CD pipelines and getting application deployed and;
  2. On-going operations (‘Ops’) — This is considered both the watering and feeding of infrastructure which has been deployed to ensure that it is being operated in an effective, secure and efficient fashion. Incident response typically sits within the realm of operations.

A visual depiction of how these functions could be structured:

As above you can think about your operating model by considering that there are 6 separate functions within your cloud environment (the yellow boxes). Depending on your organization, you could have these as 6 separate teams or use a smaller number of teams and overlap functions. For example, the team building the applications could also be the people who are responsible for the watering and feeding of the application. This could include things like monitoring the application availability 24x7 as well as being responsible for code updates and release application changes. Alternatively, you could split these out into separate a teams and functions.

Importantly — all of these functions must be adequately covered otherwise you will effectively have gaps in your operating model.

There is an emergence around the notion of “Platform Engineering” which is a team that largely consolidate both the shared service and infrastructure layer. Platform engineering is focused on activities which design and build toolchains and workflows which enable self-service capabilities for software engineer who are building and deploying applications on cloud technologies. This considers both the build and run functions and is focused on optimisation for application owners.

Keys to building a successful cloud operating model:

  • Make sure that you have all of the functions covered above and the team comprises of the right skillsets. An anti-pattern which I have casually observed is having Infrastructure teams on call for the application. Whilst this does provide for pager response — it rarely yields outcomes much greater than restarting the application (which means the root cause is never particularly well understood).
  • Document which team exists and who makes up these teams. It is not uncommon to see gaps within the operating model when walking through the actual teams and individuals who fulfill each function. Plugging any gaps is a quick way to improve your cloud operating approach.
  • Agree on where the lines of demarcations exist between these functions within your organization
  • Understand and record how these teams interact. Frequently, we see different teams using different ITSM tooling. This means that interactions are chaotic and different language or terms are used by different teams.
  • Embrace cloud! This means using cloud native functions (such as serverless) and target elimination of any undifferentiating heavy lifting.

Before you go — if you’ve got this far you’ve hopefully seen some value in this article. I write and publish this content free of charge. The cost for reading it though? Please like the article and click follow me — Think of this as a gentlemen’s agreement. The cost is nothing finacial just a token of appreciation for the time taken to put this together. Thank you.

--

--

Keiran Holloway

Technical Lead and Engineering Manager with over 20 years running complex public infrastructure. Strongly passionate about continous learning and improvement.