Simplifying Complex Architectures

9 min readMar 30, 2020

Pointing is very necessary during coding

You’re a technical leader. You see that services are being rewritten twice a year, every team is increasingly becoming a silo, migrations are hard, service problems are common, tech-debt and bugs are climbing, understanding the system end-to-end is a substantial undertaking, and engineers are feeling unmotivated. You want to make life simpler for engineering. It seems like no matter what you do lately problems keep piling up. All-hands, engineering meetings, release procedure governance, etc — nothing is really working. The problem is complexity. This post will describe how technical leadership (a broad group of people) can make moves towards simplification.

From the business perspective, this means money is being spent to acquire customers who will likely lower your NPS (an indicator of how much people like your product and how likely they are to recommend it to others).

From the product perspective, engineering and product are slowing down. A significant product slowdown simply doesn’t work. Product is the lifeblood of a company. Your users need more and more value unlocked and you need to capture that value to keep delivering delightful experiences to your customers.

From an observability/ops perspective, you likely have PagerDuty alerts setup but no easily accessible knowledge that shows you exactly how the system is functioning and why. Diagrams like the one below could take you a week or longer to reason out manually.

Take my company for example. After many years, our architecture is feeling some of the effects of improper proper bounded contexts, incorrect team compositions, and a lack of decoupling. Our workflow logic is disbursed across several services. Demystifying these workflows is a substantial lift. These stress-inducing issues persist because we are sprinting towards better and better product-market fit and accelerating go-to-market. Welcome to engineering. The true challenge and fun of engineering are building against constraints.

About Our Team

The engineers I work with are very smart people who have a good working knowledge of systems patterns, design patterns, and strong development skills. They also have great attitudes and are all-around wonderful human beings.

We have a loosely “pod-based” org structure that allows product teams to operate very autonomously on a particular domain.

We all love the architecture here. My role is the VPE. However, I am a technical VPE (these exist). I absolutely love architecture and programming. Diagrams like the one below are things I spend a lot of time thinking about.

Example: I wonder if the producers/consumers are keeping state. I assume, given the Spotify scale, producers can be consumed cross-region. My other set of thoughts are about failure scenarios, costs, hiring-for, team communications, and the SaaS offerings in these spaces. I could write an entire post on this diagram

Validate the Problem(s)

We made a list of why we might be shipping inelegant and sometimes unreliable software that generated the complexity we were not struggling with.

Product focus over quality focus ie “We don’t have time to focus on quality”
Engineers wait on permission/guidance to undertake reliability initiatives
Keeping busy rather than thinking strategically as an engineer
“We will fix this during the next rewrite” thinking

The most important thing here is that if you are not building quality at each step you are building in risk.

While digging into some of the bugs I noticed a consistent theme: the objective of the buggy pull-requests was simple but the implementation almost always required a lot of cross-service and cross-team communications, as well as keeping track of an immense amount of logic to account for all of the possible, often hidden, side-effects. Digging deeper, I realized much of the code was no longer suited for handling 70% of what it was now responsible for.

When the product changes engineers often have to retrofit things — this is why we all refactor or rewrite services eventually. This entropy is natural. The simplest effective mitigation strategy is having the business, product, and engineering departments strongly aligned (feedback loops like the ones agile suggest are good at this). Seeing things coming and being able to alter course is invaluable for engineering organizations.

Volatility (big surprise) was the Problem

You want to design on the assumption things will change frequently. This is the central thesis of the book Building Evolutionary Architectures. I won’t go into rants about YAGNI, intentionally delayed design decisions, volatility-based decomposition, and fitness functions but I will say a prerequisite to building adaptable systems is to understand the domain and your long-term business objectives within that domain thoroughly. The quality of your designs is directly correlated to the clarity of vision and communication by business and product about the domain. Ensure engineers have their business hats on before designing anything.

Developing a Plan

Have a Sense of Urgency

The idea of a change that is necessary for the success of the organization can be very powerful. If you can create an environment where individuals are aware of an existing problem and can see a possible solution it is likely to support for the change will rise. We all agreed on some issues that needed to be solved. We wanted to efficiently, non-regressively, improve shipping quality code to market. At this point, we found ourselves with several problems to solve for:

We need to improve prioritization. You want a common understanding of why we are doing “X” over “Y”
We need to cap concurrent streams of technical work (entropy)
We need to pay down existing tech debt (a second-order function of entropy)
We need to simplify architecture across the board by building “legos” rather than esoteric components
We need to better isolate changes to improve quality and reduce overhead
We need real observability. Systems should be easy to reason about at a glance. This isn’t optional for high-performing teams
We need to create compounding leverage. A “platform” for all teams to leverage that allows people to build systems from battle-tested repositories implementing common architecture patterns (grab and test a few from Chris Richardson)

As you can see, many of our issues were rooted in architecture.

Architecture Belongs to All Engineers

It will be very hard to lead the whole change process on your own, and therefore it is important to build a coalition to help you direct others. First, let’s get some of our most collaborative and seasoned engineers to focus on solving the architecture problems. Avoid approval-by-committee, private club mentality, and cadres. Utilize chapter and team meetings. Architecture is done in every team. Though architecture guidance may come from a smaller group, it must ultimately be the domain team’s responsibility. The best chief architects view architecture as a series of partnerships. Always put decision-making in the hands of the people closest to the action to ensure people have autonomy.

The architecture coalition’s main objectives:

Establish a good flow of information between business strategy and teams
Create good designs by thinking through the business domains and their likely future. The output here is services cohesion; what changes together, stays together
Decouple services
Liberate data. Think of producing data to message brokers so that anyone can subscribe to that data
Reduce unnecessary cross-service requests by duplicating data (data is cheap)
Properly establish bounded contexts, domains, and subdomains
Consult with teams regarding best practices (abstractions, decouplings, datastores, etc)
Sell. Don’t force

We saw the need for a Platform team

The Platform team will design backend systems and librares pertaining to basic functionality needed by most services. X-as-a-service capabilities. In technical terms, the Platform team will focus on services required by many teams and make certain workflows easy to implement. Success for the Platform team is the speed and ease at which teams can implement common workflows to solve larger domain problems. The platform team builds and gives away well-maintained and production read legos.

The Platform’s customers are internal developers. At first glance, the architecture coalition closely resembles the Platform team. However, a platform team has a very specific charter/mandate which separates it from the architecture coalition and warrants a long-lived team of specific individuals.

Platform engineers will build services that must be maintained for years
Platform engineers must possess a deep understanding of distributed systems and data-intensive applications
Platform engineers need to be able to have conversations and think through what people might want, and they’ll need to be able to provide support when people have a question
Platform engineers must build systems that are extremely robust and index heavily on architectural characteristics like scalability, elasticity, resilience, and modularity

Now that we had identified some needed groups, we had to create good communication pathways. I elected to have the lead platform engineer report to me. I put architecture in the hands of our engineering leads, ultimately our engineering chapter, to ensure the biggest technical opportunities and risks for the business are top of mind for all engineers. I have weekly syncs with the leads about architecture to make sure business strategy, which is tech strategy expressed differently, is communicated well.

Consider Your Timing and Resources

There are many stages a company can be in, survival mode or market-capture mode, where they are more than willing to take on unhealthy levels of tech debt. If they are very well funded or have exceptionally good leadership, you may get buy-in regardless of what stage the company is in.

Keep in mind that leadership may be correct in pushing this kind of initiative off. Every time you say “yes” to one thing you are saying “no” to another.

Pitch Building with Legos as Compounding Leverage

The thing I often attempt to persuade CxOs and Founders with, when it comes to architecture simplification initiatives, is building with legos. Ideally, we start building everything with a focus on being reliable and extensible (like a lego) rather than building with the sole goal of satisfying product acceptance criteria. Once we start outputting these legos, I tell them, the product teams can start leveraging engineering more efficiently and we ship fewer bugs.

Pich Sub-linear Scaling

Often times we have more headcount than we need to build and support our products because we don’t understand our systems. As we grow and hire more engineers, we split up the work into specialized teams. However, 5 teams of engineers aren’t going to be 5x as productive. Things slow down due to more coordination, more overhead, more externalities, and in many cases, much more complexity. — Paraphrasing Charity Majors

Pitch Revenue Protection

This is a somewhat under-used tool in my opinion. Rework costs money. A lot of it. IRR is directly impacted by rework. Point out the rework being done, the frequency, and how it can be mitigated. We talked with quite a few teams to make our case here very strong. All teams are affected by business pivots, unforeseen use-cases/requirements, and a lack of guidance on good design. Really dig into the “Why” of each rework here.

Above All, Be a Trusted Advisor

Assuming you’ve taken timing into account, as a technical leader or representative of tech, it’s your job to capitalize on technical opportunities and mitigate risk. Bring the tech into the discussion of business goals, product roadmaps, team goals, and timelines. Do this early and often. Beyond being a partner, become a trusted advisor. Someone with a track record of anticipating problems and generating solutions. Someone executives and investors know has the company's goals in mind.

Thanks to Charity Majors for her early feedback on this post.

Disclaimer: I do not advocate for any particular organizational structures here since the effectiveness of different structures is mostly dependent on the people that inhabit the various roles. Instead, I will focus on what needs to be done regardless of your org structure and titles — though you may certainly need to push for changes in the org.