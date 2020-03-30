You’re a technical leader. You see that services are being rewritten twice a year, every team is increasingly becoming a silo, migrations are hard, service problems are common, tech-debt and bugs are climbing, understanding the system end-to-end is a substantial undertaking, and engineers are feeling unmotivated. You want to make life simpler for engineering. It seems like no matter what you do lately this keeps happening. Townhalls, 1-on-1s, release procedures, etc — nothing is really working. This post will describe actions technical leadership (a broad category of people) can take to make sure complexity is reduced.
From the business perspective, you’re shipping a lot of code that generates problems or fails to solve key problems for your customers. CAC (customer acquisition cost) is being spent only to lower your NPS (an indicator of how much people like your product and how likely they are to recommend it to others).
From an observability perspective, you likely have PagerDuty alerts setup but no easily accessible knowledge that shows you exactly how the system is functioning. Diagrams like the one below could take you a week or longer to reason out.
About Our Team
The engineers I work with are very smart people. They’re mostly junior-to-mid with moderate knowledge of DDD (domain-driven design), systems patterns, design patterns, and strong Java/React/NodeJS development skills. They also have great attitudes and are wonderful human beings.
I work as a VPE. However, I am a very technical VPE (previous roles include Senior R&D, SRE Lead/Manager, Director of Eng yada yada) and I absolutely love architecture and systems design. Diagrams like the one below are things I spend a lot of time reasoning about.
We have a pod-based org structure that allows product teams to operate quite autonomously. However, over the years, the architecture has grown to mirror our lack of proper bounded contexts, domain/subdomain identification, and decoupling strategies. With the high-levels of autonomy, we sacrificed some cohesion. We created a straightforward distributed monolith. Necessary communications to demystify microservice interactions between pods were substantial. Some of our pods also didn’t really have clear boundaries. The Core/Platform team, for example, has an ambiguous charter and became custodians for orphaned but commonly used services and tasks.
Things were up and running but with lots of intermittent availability issues. There were broken windows everywhere if you dug into NewRelic, repositories, and talked with the engineers.
We had issues. Long-standing issues that were widely known and understood within the pods and being reported in exit-interviews.
Validate the Problem Before Proposing Solutions
Given the issues we were seeing, we initially put our focus on the availability problems. We made a list of why we might be shipping unreliable code:
- Engineers wait on permission/guidance to undertake reliability initiatives
- Product focus over reliability focus ie “We don’t have time to focus on reliability”
- Conveyor belt mentality or keeping busy rather than thinking strategically as an engineer
- “We will fix this during the next rewrite” thinking
Knowing the complexity of our systems I felt we may be confusing the symptom for the cause. I decided to do some digging to find out “why” reliability is such a pain rather than just pep-talking engineers about shipping quality code again. Experience gives you hunches like this.
While digging into some of the bugs I noticed a consistent theme: the objective of the buggy pull-requests was simple and well-scoped but the implementation almost always required a lot of cross-team communications and keeping track of an immense amount of logic to account for all of the possible, often implicit, side-effects. The code was riddled with complexity. Digging deeper, I realized much of the code was never really designed in the first place to support 70% of what it was now responsible for, key abstractions were missing, and the service was not properly bounded. Using a chess metaphor: our opening game was in need of improvement.
When the product changes coders often have to retrofit things — this is why we all refactor or rewrite services eventually. Given this is a known and common occurrence there are engineering strategies to deal with this. The simplest effective practice is having conversations with engineering and product/business about where the product is headed before making any moves.
In addition, you want to design on the assumption things will change frequently. This is the central thesis of the book Building Evolutionary Architectures. I won’t go into rants about YAGNI, intentionally delayed design decisions, and fitness functions but I will say a prerequisite to building adaptable systems is to understand the Domain and your long-term business-objectives within that domain. The quality of your designs is directly impacted by the clarity of vision and communication by business and product here.
We were lacking focus on the collaborative, forward-looking planning stage. We were doing RFCs, employing some good microservice patterns, and validating the basics — but the lack of adaptability being built-in was really limiting our potential. More importantly, for all of the services in the middle and end stages of the game, the alignment between the code and the problems we were trying to solve was so bad we had reasons to expect bugs to be shipped.
The Plan
We wanted to efficiently, and non-regressively, improve uptime and time to market. This would be our goal. We also wanted to solve the current availability issues we were dealing with. So we split out responsibilities. I’d take the longer-term simplification initiative and delegate the short-term availability issues. At this point, I found myself with fundamental problems to solve:
- We need to simplify architecture across the board by building “legos” rather than esoteric components that are hard to integrate
- We need to better isolate changes to improve quality and reduce overhead
- We need to improve technical prioritization. You want common thinking on why we are doing “X” over “Y”
- We need to cap concurrent streams of technical work (entropy)
- We need to pay down existing tech debt (a second-order function of entropy)
- We need real observability. Systems should be easy to reason about at a glance. This isn’t optional for high-performing teams
- We need to create compounding leverage. A “platform” for all pods to leverage that allows people to build systems from battle-tested repositories implementing common architecture patterns (grab and test a few from Chris Richardson)
First, let’s get some of our most collaborative and seasoned engineers to focus on solving the architecture problems and call it… the architecture team. No approval-by-committee, no private club mentality — let’s just spend all our time helping teams better the architecture.
The architecture team‘s main objectives:
- Establish a good flow of information between business strategy and pods
- Create good designs by thinking through the business domains and their likely future
- Properly establish bounded contexts, domains, and subdomains
- Decouple services
- Reduce cross-service requests
- Consult with teams regarding best-practices (abstractions, decouplings, datastores, etc)
- Create a high level of cohesion in the architecture
- Position services correctly (both technically and within the organization)
Second, we established a charter for the Core team:
The Core team will design backend systems pertaining to basic functionality needed by most services. The Core team will also validate and implement, sometimes code, the tools that wire these services together: contracts, service meshes, logging, distributed tracing, libraries, etc. In technical terms, the Core team will focus on *subdomains* required by all pod Domains and making these subdomains easy to implement. Success for the Core team is the speed and ease at which teams can implement common patterns and subdomains to solve larger domain problems. Legos.
Example Domains: Banking, Loans, Credit, etc
Example Subdomains: Payment, Underwriting, Authorization, Authentication, Logging, Auditing, Notifications, User Data, E-signing, Balances, etc
The Core team’s customers are internal developers. The architecture team will work to reduce complexity in other teams and eventually help teams leverage what is developed by the Core team. Perhaps, eventually, the Core team would take over for the architecture group.
Next, we had to create good communication pathways. In my case, I elected for a Chief Architect and had Core team report up to the Chief Architect. This makes sure tech and business strategies are in lockstep. This communication path would also allow architecture to leverage the Core team effectively. Build your legos and then employ them in your teams.
Lastly, we wanted to make sure that this way of designing systems propagated throughout the company. This meant zooms and face-to-face meetings where we could candidly discuss architecture and architecture decisions. This would be as important as any code written. People would likely only buy-in to this once they saw results from the architecture and Core teams. This will start at the CxO level and make it’s way all the way down into git commits. Let’s get the whole company behind it and put some resources on it.
Consider Your Timing
This is where I tell you (veterans already know) this may not be an easy pitch. You have to know where your company is in its business cycle. If they are still gunning for a product-market fit you can expect execs to have no bandwidth for this kind of initiative. There are many stages a company can be in, survival mode or market-capture mode, where they are more than willing to take on tech-debt. If they are very well funded or have exceptionally good leadership, you may get approval regardless of where the company is maturity-wise. Keep in mind that leadership may be correct in pushing this kind of initiative off. Every time you say “yes” to one thing you are saying “no” to another.
Getting to Yes
A boat can successfully navigate the ocean without life preservers, can’t it? Of course, it can. It isn’t necessarily a good idea for the passengers, but that doesn’t mean the propellers won’t spin.
The thing I often attempt to persuade CxOs with, when it comes to architecture initiatives, is building with legos. Ideally, we start building everything with a focus on being reliable and extensible (like a lego) rather than building with the sole goal of satisfying product acceptance criteria. Once we start outputting these legos, I tell them, the product teams can start leveraging engineering more efficiently and we ship fewer bugs.
I don’t feel this lego-talk motivates many executives because they are not unhappy with the status quo unless a system goes down or bugs pile up on customers. Perhaps more importantly, long-term engineering initiatives smack of idealism. A big contributor to resistance is a company culture that primarily, or exclusively, executes short-term plans. People, especially executives, are typically evaluated on short-term results. This pragmatism can quickly turn into myopia. As a person who has an inveterate tendency to meddle with the bigger picture I often run into this. When I was younger, I lacked the tact to know that there is a dangerous chasm here for engineers. Many of us never learn to clear this chasm by being extremely patient and staying positive.
In our case, shipping fewer bugs, sub-linear scaling, and revenue protection are the concepts I chose to try and find common ground with stakeholders outside of technology.
Sub-linear scaling
Often times we have more headcount than we need to build and support our products because we don’t understand our systems. As we grow and hire more engineers, we split up the work into specialized teams. However, 5 teams of engineers aren’t going to be 5x as productive. Things slow down due to more coordination, more overhead, more externalities, and in many cases, much more complexity. — Paraphrasing Charity Majors
One business outcome of an architecture simplification is the need for fewer people in engineering to ship product, all other things remaining equal. Sub-linear scaling is the result of any good architecture.
Revenue Protection
This is a somewhat under-used tool in my opinion. Rework costs money. A lot of it. IRR is directly impacted by rework. Point out the rework being done, the frequency, and how it can be mitigated. We talked with quite a few teams to make our case here very strong. All teams are affected by business pivots, unforeseen use-cases/requirements, and lack of guidance on good design. Really dig into the “Why” of each rework here.
Don’t Give Up, It’s Important
Assuming you’ve taken timing into account, as a technical leader or representative of tech, it’s your job to really close a deal here - whatever it takes. If you can’t get leadership to come to the table here you will find yourself and your engineers working hard just to maintain something they are not proud of. This becomes not only a problem of architecture but an attrition problem.
Don’t take rejection personally and don’t expect to get everything you want. Keep working towards this while empathizing with the business until it really makes sense to put your foot down. This is why being a Chief Architect, for example, is a difficult job. Complexity always comes down to people.
In Part II, I walk through what happened after we got buy-in. Big thanks to Charity Majors for her early feedback on this post.
Disclaimer: I do not advocate for any particular organizational structures here since the effectiveness of different structures is mostly dependent on the people that inhabit the various roles. Instead, I will focus on what needs to be done regardless of your org structure and titles — though you may certainly need to push for changes in the org.