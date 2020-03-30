You’re a technical leader. You see that services are being rewritten twice a year, every team is increasingly becoming a silo, migrations are hard, service problems are common, tech-debt and bugs are climbing, understanding the system end-to-end is a substantial undertaking, and engineers are feeling unmotivated. You want to make life simpler for engineering. It seems like no matter what you do lately this keeps happening. Townhalls, 1-on-1s, release procedures, etc — nothing is really working. This post will describe actions technical leadership (a broad category of people) can take to make sure complexity is reduced.

Pointing is very necessary during coding

From the business perspective, you’re shipping a lot of code that generates problems or fails to solve key problems for your customers. CAC (customer acquisition cost) is being spent only to lower your NPS (an indicator of how much people like your product and how likely they are to recommend it to others).

From an observability perspective, you likely have PagerDuty alerts setup but no easily accessible knowledge that shows you exactly how the system is functioning. Diagrams like the one below could take you a week or longer to reason out.

About Our Team

The engineers I work with are very smart people. They’re mostly junior-to-mid with moderate knowledge of DDD (domain-driven design), systems patterns, design patterns, and strong Java/React/NodeJS development skills. They also have great attitudes and are wonderful human beings.

I work as a VPE. However, I am a very technical VPE (previous roles include Senior R&D, SRE Lead/Manager, Director of Eng yada yada) and I absolutely love architecture and systems design. Diagrams like the one below are things I spend a lot of time reasoning about.

Example: I wonder if the producers/consumers are keeping state. I assume, given the Spotify scale, producers can be consumed cross-region. My other set of thoughts are about failure scenarios, costs, hiring-for, team communications, and the SaaS offerings in these spaces. I could write an entire post on this diagram

We have a pod-based org structure that allows product teams to operate quite autonomously. However, over the years, the architecture has grown to mirror our lack of proper bounded contexts, domain/subdomain identification, and decoupling strategies. With the high-levels of autonomy, we sacrificed some cohesion. We created a straightforward distributed monolith. Necessary communications to demystify microservice interactions between pods were substantial. Some of our pods also didn’t really have clear boundaries. The Core/Platform team, for example, has an ambiguous charter and became custodians for orphaned but commonly used services and tasks.

Things were up and running but with lots of intermittent availability issues. There were broken windows everywhere if you dug into NewRelic, repositories, and talked with the engineers.

Things were up and running but with lots of intermittent availability issues. There were broken windows everywhere if you dug into NewRelic, repositories, and talked with the engineers.

We had issues. Long-standing issues that were widely known and understood within the pods and being reported in exit-interviews.

Validate the Problem Before Proposing Solutions

Given the issues we were seeing, we initially put our focus on the availability problems. We made a list of why we might be shipping unreliable code:

Engineers wait on permission/guidance to undertake reliability initiatives

Product focus over reliability focus ie “We don’t have time to focus on reliability”

Conveyor belt mentality or keeping busy rather than thinking strategically as an engineer

“We will fix this during the next rewrite” thinking

Knowing the complexity of our systems I felt we may be confusing the symptom for the cause. I decided to do some digging to find out “why” reliability is such a pain rather than just pep-talking engineers about shipping quality code again. Experience gives you hunches like this.

While digging into some of the bugs I noticed a consistent theme: the objective of the buggy pull-requests was simple and well-scoped but the implementation almost always required a lot of cross-team communications and keeping track of an immense amount of logic to account for all of the possible, often implicit, side-effects. The code was riddled with complexity. Digging deeper, I realized much of the code was never really designed in the first place to support 70% of what it was now responsible for, key abstractions were missing, and the service was not properly bounded. Using a chess metaphor: our opening game was in need of improvement.

When the product changes coders often have to retrofit things — this is why we all refactor or rewrite services eventually. Given this is a known and common occurrence there are engineering strategies to deal with this. The simplest effective practice is having conversations with engineering and product/business about where the product is headed before making any moves.

In addition, you want to design on the assumption things will change frequently. This is the central thesis of the book Building Evolutionary Architectures. I won’t go into rants about YAGNI, intentionally delayed design decisions, and fitness functions but I will say a prerequisite to building adaptable systems is to understand the Domain and your long-term business-objectives within that domain. The quality of your designs is directly impacted by the clarity of vision and communication by business and product here.

We were lacking focus on the collaborative, forward-looking planning stage. We were doing RFCs, employing some good microservice patterns, and validating the basics — but the lack of adaptability being built-in was really limiting our potential. More importantly, for all of the services in the middle and end stages of the game, the alignment between the code and the problems we were trying to solve was so bad we had reasons to expect bugs to be shipped.

The Plan

We wanted to efficiently, and non-regressively, improve uptime and time to market. This would be our goal. We also wanted to solve the current availability issues we were dealing with. So we split out responsibilities. I’d take the longer-term simplification initiative and delegate the short-term availability issues. At this point, I found myself with fundamental problems to solve: