Chaos and Order in Software Development

Vladimir Prus
6 min readOct 1, 2023

--

In this post, I want to share one of the most important ideas I heard in the last few years — that most companies are too well-run.

Photo by Sergey Pesterev on Unsplash

Chaos, Order, and Productivity

The original formulation of the idea comes from Jim Keller who explained it with a seemingly trivial chart of productivity vs. organizational order.

In the chaos, nothing can be done. Priorities change overnight, untested code gets deployed to production, databases fail with no backup, and there are five different languages and three libraries for any single task.

When there is a sensible order, productivity is at its peak. People can focus on the actual goal at hand, have time to do it properly, release it, and have it just work.

But when the order is perfect, nothing can be done again. Writing a line of code requires three planning meetings, four approvals, and a separate team to deploy it. Anything remotely novel needs a couple of Ph.D. theses to even try.

Now, here’s the key observation — once the organization starts to move towards order, it can’t be stopped. The move continues until nothing can be done.

Team Variance

The same idea can be formulated in terms of team variance. Say, a team takes on a number of different projects and addresses them using whatever technology and approach seems reasonable. This can work in three fashions:

  • High variance — there are good projects, there are fantastic breakthroughs, and there are total failures.
  • Medium variance — we take reasonable steps to eliminate the most common failures, and end up with some good projects, some breakthroughs, and very rare failures
  • Excessively low variance — we’ve standardized everything. Every project is completed on time, on budget, and fails to impress anybody.

The axes in software development

Speaking about company organization is rather abstract. When we look at software development, there are a few axes along which we can independently move from chaos to order. The key point, again, is that either extreme is untenable, and will make it impossible to get things done.

The formal processes axis

  • Chaos: a junior developer can refactor the entire payment code without prior discussion or code review
  • Order: fixing a one-line typo requires a JIRA ticket, a pull request, 3 mandatory reviewers, and a day of wait

The automation axis:

  • Chaos: production is deployed by scripts from John’s home directory, using FTP to servers. Jonh has left the company last Christmas.
  • Order: the CI/CD process builds everything, including a C++ compiler, and takes several hours for any change.

The technology uniformity axis:

  • Chaos: a team of 6 randomly uses 4 different technology stacks; no single person is fluent with all of them
  • Order: a 5000-person company insists on using Java 11 for everything, across business units and acquisitions

The framework axis:

  • Chaos: every new service is a copy-paste of some previous one; each function has dozens of copies with random tweaks
  • Order: there’s an internal framework for everything. A new hire needs a month just to read the documentation. Each framework is inferior to its open-source counterpart.

On all of these axes, it’s easy to see why “perfect order” is untenable and would cause an immediate revolt. But often, productivity is killed quietly, because we go a little bit too much on every axis.

We can draw an analogy from reliability. If you want to have a 99% SLO for an API, and that API uses several components, those components likely need to provide a 99.9% SLO. Likewise, if we wish to complete a change in 1 hour, and we have 10 steps, then each step has only 6 minutes to complete.

That’s the key dynamic that slows down software development. On each axis or area, adding a bit of order seems like a good idea, but together, they turn a 5-minute fix into a one-hour process and a one-hour feature into a one-day ordeal.

How it happens

Surely nobody wants to slow down development by 10x, but there are strong factors at play.

S-curve and risk avoidance

Many established companies, teams, and engineers are on the right side of the S-curve. They have much more to lose due to minor missteps than to gain through the most brilliant designs. Naturally, it makes sense to push for more order in their areas.

Imagine that a junior developer proposes to use a flashy new database. It might be objectively 30% faster. However, for the infrastructure team, having a database going down is a major incident, and a 30% speedup on a new project is an accounting error.

Gatekeepers

There are people who a dead set on having everything structured and organized. In some cases, that’s a personal trait. In some cases, people are explicitly hired or promoted into a gatekeeper role. Or maybe, the HR competency matrix for a senior manager says “keeps everything tidy”. Either way, career progression for these people becomes tied to moving towards order everywhere.

Then, if you put “established a new CI/CD process” or “standardized on Java 11” on your annual review form, it’s a clear win. Measuring that the new process reduced development speed by 10% is much harder.

Framework teams

The most successful software projects benefit millions of developers. While few of us will have a chance to write a new database or a programming language, we’re at least hoping to create tools useful for colleagues. It’s natural, and fine, that most companies have some sort of infrastructure teams and internal libraries.

For such teams, the key KPI is often the usage of their tools across the company. Surely enough, they start pushing other teams. It quickly turns out, that reusing anything across 5 teams might require 10x the effort. You meet unexpected usage requirements, need to write documentation, keep your SLOs, and provide timely support.

In some cases, you have people for 10x the effort, and end up with an excellent corporate framework. Maybe, this even grows into a public product or an open-source project. But often, you end up in death valley — where productivity is worse than before.

What can we do?

Maybe, the answer is “nothing”. If both the business and you are indeed on the right side of the S-curve and are doing well there, maybe riding out the time until retirement is the best course of action.

If, however, you have further ambitions, here are some of the options:

  • Just be aware that more order is not always better. Way too many senior people don’t even stop to think that extra processes and rules have costs.
  • Commit to new projects. Few things improve productivity better than CEO/VP/GM demanding a release in a month. Isolate new prototype projects, technologically, and organizationally.
  • Estimate where you are on each axis. Probably, compare the number of “why it takes so long” and “why we’re down again” complaints. If possible, attach dollar values to both.
  • Embrace technical diversity. While pedantic gatekeepers are important, you should have a healthy mix of approaches and a few erratic agents of chaos. It might be good to adjust your hiring/review/promotion process to increase such diversity.
  • Be skeptical of grand frameworks. Whenever you see a proposal to create one, look for existing open-source solutions, visit GitHub, and estimate the complexity. If a 2-person team wants to essentially replicate a project with 2000 contributors, they better have a new breakthrough in hand.

Hopefully, after reading this post, you’ll appreciate a bit of constructive chaos.

--

--