Microstructures and other Velocity Drivers — Part 2

Paul Pogonoski
5 min readNov 14, 2022

--

OK, the next part in this series starts to introduce you to my concept of Velocity Enablement.

The previous section can be found here: https://medium.com/@paulpogonoski/microstructures-and-other-velocity-drivers-part-1-c0361b766e20

As I say in my Introduction, please be patient and keep an open mind as all will be revealed in the following chapters.

Velocity Enablement and Gain

Velocity Enablement and Gain, or just Velocity Enablement, or VE, is a loose collection of practices called Velocity Drivers:

  • Microstructures
  • Structuring a DevOps team to be responsible for:
    - Shared Cloud Services
    - Velocity Enablement
    - DevOps Advisory Services
  • Having Dev teams take responsibility for their own Microstructures
  • Shared Production Support

All of these have the combined effect of

  1. having the entire Application Lifecycle of anything utilized in the Production Environment managed by the Development team that created it;
  2. and removing the funnel effect that occurs for any centralized “service” team, what most organizations have as there DevOps team(s).

When this happens your software delivery organization is freed up to maximize development and delivery velocity. Hence the use of the term Velocity Enablement.

By now you’re are probably realizing that VE looks like a somewhat radical re-alignment of what’s seen as the current responsibilities in software development. It’s not that radical, as I aim to convince you of. It certainly doesn’t require re-organizing teams, or introducing new teams, or even taking on new methodologies. What it is, however, is a very practical (This has been successfully used, to greater or lesser extents, in the most recent companies I’ve worked for) means of achieving “Shifting Left” as some vendors and consultants have labeled DevOps.

Let’s take the above points in reverse order.

Removing the Funnel to the centralized service teams

Centralized teams have become the de-facto model for creating and managing service teams, like Desktop support, Networks, DevOps, Production Support, and so on ad-nauseum.

Given Conway’s Law[1] this is not surprising, and a sensible approach — pre-cloud. Before the cloud organizations had their own datacenters or outsourced the provisioning and maintenance of hardware. Hardware, whether bare-metal or Virtual Machine based, was physical, finite, and required specialist resources to manage. It required constant monitoring, maintenance, and resource planning. It spawned ITIL and ITSM as a means to better understand the state of your constant monitoring, maintenance, and resource planning of your hardware. Therefore, centralizing teams responsible for this critical infrastructure was not only seen as instinctive but best practice.

Unfortunately, when organizations move to public clouds, even if they refactored their applications to use Microservices, they retained the centralized structure for the teams managing the cloud services. Consequently, the software-based services of the cloud we still seen as infrastructure — with all the connotations that that word brings (infrastructure is centralized, it’s bespoke, it’s slow to design and implement because it required specialized skills), even when the notion of DevOps became popular. DevOps teams were, effectively, renamed operations teams.

So why is this an issue for organizations that use Public Clouds? Let’s look at “wait time” as defined in The Phoenix Project:

The wait time for a given resource is the percentage that resource is busy, divided by the percentage that resource is idle. So, if a resource is fifty percent utilized, the wait time is 50/50, or 1 unit. If the resource is ninety percent utilized, the wait time is 90/10, or nine times longer. And if the resource is ninety-nine percent utilized the wait time is 99 time longer.[2]

So, if the centralized team is “the resource” then there is a natural funnel that builds up a queue to the team that naturally slows down access to it the more it’s utilized.

Having a centralized DevOps team that Development teams rely on for the provision of cloud services is serious and inevitable retardant to Development and Delivery Velocity.

Development teams responsible for the whole Lifecycle of anything they produce

OK, this may seem radical, and you will have to have a conversation with the teams about the perceived “extra work” of Production Support, but it is about time we discussed the “Elephant in the Room”. That is, why don’t development teams support the software they produce after hours?

There are 3 categories of failure to systems when in Production:

  1. Infrastructure failure — networks, storage, Database, or compute resource
  2. Software failure — catastrophic logic bug, situation not supported, fundamental design limitation
  3. Partial, or Temporal, Infrastructure failure causing Software Failure — temporary loss of network, DNS error, Memory loss, load balancer swap

Two out of those 3 reasons involve Software. Also, the first category should be eliminated if software is designed correctly for a Public Cloud. Yet IT departments still organize around the datacenter model where there was a significant risk of infrastructure failure.

I’ve endeavored to find any study that analyses systems failures in commercial IT and looks at the above categories to try to break-down and understand the current risks in failures in cloud computing. Alas I have found nothing. Perhaps this is tied to the purveying assumption in the computing industry based on its non-cloud history and experience?

Anecdotally, over the past year supporting a modern distributed solution on a public cloud, there was a single Sev-1 issue due to a temporary cloud service failure, compared to 3 Sev-1 issues due to software failures. In every case of the software errors the DevOps team were the first responders and took at least 30min to 1 hour to determine that a development team representative was needed to address the problem.

Anecdotally, again, in the 6 months before the last 12 months supporting the same solution, there were no Sev-1 issues due to a cloud provider service, but there were 6 Sev-1 issues due to software failures.

I accept that what I have outlined is a poor sample size for anything other than a high-level pointer, but a difference between the first 6 months and the last 12 months is that the company I worked for decided to make the developer teams responsible for supporting their code in Production using a “Shared Support” model (more on this later in this book). This was gradual and, to-date, they still don’t support the code out of hours, but the impact was noticeable and discussed when describing the Shared Production Support Model.

[1] https://en.wikipedia.org/wiki/Conway%27s_law

[2] The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win
Authors: Gene Kim, Kevin Behr, George Spafford
Publisher: IT Revolution Press
ISBN:978–0–9882625–9–1
Published:10 January 2013

--

--