Navigating the tricky path between autonomy and standardisation
Recently I was talking to the CEO and CTO of a company and they asked me whether they should standardise on a single tech stack or have diverse stacks. It was pretty clear their preference was to standardise and minimise duplication of effort. On the surface this seems like a good idea. I’ve also had very heated conversations with engineering teams who are prepared to relinquish their favourite language or stack when you pry it from their cold dead hands.
This is a common disconnect for leadership and engineering teams. It comes up again and again disguised as an argument about autonomy — particularly as a company scales beyond a certain point. The reality is that neither extreme is healthy. It’s easy enough to demonstrate this.
Single stacks kill progress by stopping innovation
- If you have a single stack then you are locked into it forever. This commits you to a monolith. When you change the stack you have to commit to changing everything. This limits your future flexibility.
- Engineers don’t like having their design choices taken away. Given the choice between a highly motivated team that owns the problem and one which hates the environment they’re building in I would always tend towards the former.
Multistacks kill progress by generating tech debt and taking you down the path of Conway’s Law (shipping your organisation chart)
- It’s hard to run multiple stacks in production. In order to test, monitor and deploy you push the complexity onto your DevOps and SRE teams. Who have to build custom stuff to paper over the cracks.
- Multiple stacks struggle to talk to each other. In a diverse system interoperability of data is king.
- It’s hard for people to work in other people’s codebases. So no internal open source. Which in turn means duplication of effort.
- The maintenance cost scales with the number of tech stacks you support and your organisation becomes fragile for skills which aren’t common within the company.
Both of these extremes are bad. After all, progress is the name of the game. Instead of thinking about the stacks themselves it’s useful to think about the joins between them. There are some things which need to be common within your company. To help understand this I’m going to talk about how Google does it. It’s easy to think of Google as a fairly homogeneous set up. The reality is engineers have a lot of freedom. Within certain constraints, that is. If you want to go your own way then you are responsible for making sure your system joins up. So if you are willing to invest that effort you can go whichever way you want. This doesn’t always lead to great design choices. I remember watching some incredibly smart people trying to get the LAMP stack working on Google’s equivalent of Kubernetes. It sort of worked but it was impossible to iron out the flakiness.
The key part of this balancing act, then, is understanding what constraints exist on freedom. This empowers engineers to make choices they believe in whilst also ensuring they can be held to account for playing nicely with the rest of the company. And this all comes down to interfaces and data structures. You want to minimise the number of each of these you have (ideally as close to 1 as is possible) whilst accepting you can never account for all future problems. Here’s a (probably incomplete list) of the things you want consensus on.
- Interoperability of data. Unless you’re building a monolith then stuff needs to talk to each other. You want common formats for passing data around. You want to be able to update one service independently of another. You do not want to build a multitude of factories for serialising and deserialising objects. Trust me — you REALLY don’t.
- Discovery of services. Discovery is often overlooked. This leads to endpoints being hardcoded all over the place. This leads to fragile production infrastructure and code. The endpoint needs to be abstracted so it can move around without affecting its clients.
- Making requests between services. What language do services talk to each other. What tools are you providing to observe the end to end flow of requests. At some point stuff will break. Debugging needs not to be a black art known partially in different places.
- Routing of requests between services. What happens when services move? I can tell you from experience that trying to code this yourself is a rabbit hole that sucks up a lot of time you could be solving user and business problems in.
- Exposing operational metrics. You need to run the thing in production. You want a view of how everything is running so you know when things are getting unhealthy. From a business perspective availability equates to trust.
- Logging of requests. You want to know who is doing what in your system. This allows you to build better systems. You want to experiment and explore your problem space. A multitude of mechanisms for logging prevents you doing this. You are not reduced to operating on gut instinct rather than being data driven.
- Management of change of interfaces and data. When services change they can break other services. Your build system needs to have some way of identifying these breakages before they hit production so you can fix them cheaply.
- Common services around account management and authentication. In today’s age of enhanced privacy laws you cannot afford to get this wrong. Multiple account management systems is one of the worst smells you can have from an infrastructure perspective. Doing this well is hard. Doing it more than once is suicidal.
- Common deployment model. Deploying safely to production is hard. You can’t afford to build n versions of this. It also points you towards some commonality in where and how you deploy (eg. cloud, packaging and containers).
If you can get to consensus on how to do these things you can tell any engineering team that they have choices as long as they can join up with the rest of your infrastructure. It becomes simple then to determine whether a design decision is heading into crazy town. For example if you’re building a system which decides to do its own logging and operational metrics the team is then accountable for the corresponding infrastructure for collecting and analysing that log information, monitoring its own services and alerting on outages. This is a non-trivial set of things to build. Likewise any team that decides to go its own way with account management and users is storing up future privacy hell for everyone. A team which wants to go its own way without owning the resultant requirements is probably lacking in experience.
It also tells you which infrastructure teams you need to be building out. The goal here is to build something where the easiest thing to do is the right thing. Don’t force people to adopt common infra (but do reward them). The people in these infrastructure teams are the people who need to have a lot of soft skills. Since they are going to be doing a lot of influencing without authority. If they can’t work with people and solve their problems with what they’re doing then you get a stand off where there’s either open rebellion or the infrastructure teams tyrannise everyone else. Possibly an article for another day there.
The truth is there is no one answer to this question. Both extremes are bad and you have to find the middle ground which works for your organisation. Knowing it may change in the future. Interfaces and loose coupling gives you the flexibility to grow and adapt. They give you the tool to give engineering teams both autonomy and accountability for their design decisions. It tells you when you need to get teams talking to each other to get consensus. It tells you which common services teams you need to build out. And, more importantly, give you a way of viewing designs which validates whether you’re storing up trouble for later.