Technical Excellence at Scale

Published in

Tide Engineering Team

20 min readJan 12, 2022

What is Technical Excellence?

Coming up with a short definition for Technical Excellence is not straightforward. Different people and different organizations have varying interpretations and perceptions of this subject.

I almost have the feeling that Technical Excellence falls into the category of “I know it when I see it” i.e. we are able to recognize Technical Excellence even without having a formal definition for it. There are some good articles online that focus on the topic, including the ones from Peter Lindberg, Marshall Bowers and Ben Morris and, of course, this talk by Martin Fowler. It turns out that even NASA published a paper titled Technical Excellence: A Requirement for Good Engineering. NASA is probably one of the organizations most strongly associated with Technical Excellence. Numerous brilliant engineers work there, tackling space exploration which is one of the biggest engineering challenges of all time. After all, NASA sent people to the moon more than 50 years ago when the state of technology was nowhere near what we have today.

Source: https://www.theatlantic.com/photo/2019/07/apollo-11-moon-landing-photos-50-years-ago/594448/

NASA has defined four guiding principles for Technical Excellence:

Clearly Documented Policies and Procedures
Effective Training and Development
Engineering Excellence
Continuous Communications

They also defined two sets of roles and capabilities in the pursuit of Technical Excellence:

Personal accountability — each individual must believe that they are responsible for the success of the mission
Organizational responsibility — provide the proper training, tools and environment

Combining the wisdom of the quoted articles we can say that Technical Excellence means continuously learning, striving for quality, being open for new ideas and willing to try out new approaches. Personal commitment is important, but what matters most is organizational commitment — nurturing the right environment and culture that supports learning and knowledge sharing.

Why does Scale matter?

In a small organization with just a handful of engineers a single individual committed to Technical Excellence can have a huge impact. In an environment where everyone communicates with each other on a daily basis it is easy to enforce best practices and spread new ideas in an informal way. Knowledge sharing happens during casual chats in the office, lunch or in code review comments. Well, in today’s (post-)pandemic world all of these are rather the exception. As you’ve probably heard a million times now — things are never going to be the same. People will continue to work remotely permanently and it will become the norm to have colleagues based in different countries or continents. Even in this new reality it is much easier to maintain a culture of Technical Excellence in a smaller organization. In such companies usually everyone is in the few Slack channels or together in the same Zoom calls.

In a small company you might be successful without having any documented guidelines, policies etc. and without the organization itself having a commitment towards Technical Excellence. This might be perfectly fine if you are a-few-months-old start-up, where everyone is focused on delivering an MVP and no one knows if the company will exist in a year. But if that start-up survives a bit longer you are entering a new territory. Even if the size of the team is mostly the same, to be successful in the long run, Technical Excellence shouldn’t depend on one or two engineers. As time passes, people are more likely to seek new challenges and without organizational mechanisms to nurture Technical Excellence, the engineering culture will become weaker when key people leave.

In bigger companies the impact of an individual is limited. A single great engineer won’t be able to spread their influence to more than a few teams. At such a scale Technical Excellence is impossible without commitment on an organizational level. This, of course, means a strong obligation from the leadership. Even if you are lucky enough to have good talent density (i.e. having enough engineers that exhibit the attributes of Technical Excellence on a personal level equally distributed across your organization), this still won’t be enough. Different people, different teams and different departments could have different perceptions and interpretations of this subject. You need a company wide agreement on “What good looks like?”.

It’s quite obvious that there are a lot of differences between “small” and “big” organizations. But in today’s world a company can transition from “small” to “big” really quickly. It’s not uncommon to see successful Start-Ups that experience rapid growth with just a small team of engineers and end up with hundreds of them in just a few years. Having the right policies early on is crucial for maintaining Technical Excellence as the organization grows.

Unfortunately, it’s not uncommon to see companies that had great engineering culture when they were small to let it rapidly deteriorate as they expand. This is usually the result of a number of factors, including — key people leaving, lowering the hiring bar, not investing enough in training the new engineers, not maintaining and enforcing guidelines, not investing enough in documentation, etc.

Why Tribal Knowledge can’t scale well

Source: https://www.wikiwand.com/en/Regional_forms_of_shamanism

Have you worked in a company where the only way to get help was to ask a specific person or a group of people as their knowledge wasn’t documented anywhere?

“Tribal Knowledge is information that is known by certain individuals or groups of individuals within a company but it is undocumented and is not common knowledge to everyone”[Quote]. This phenomenon has almost certainly existed in every company at some point, especially in organizations where creating and maintaining documentation is an afterthought.

As your organization grows the people holding the Tribal Knowledge will have to handle more and more requests for assistance each day. After a certain point it might turn out that they spend almost all of their time answering questions, helping and unblocking others. This is not scalable and more importantly Tribal Knowledge will be lost when these people leave.

The first steps for improvement are pretty straightforward:

Identify the Tribal Knowledge holders.
Ask them to pinpoint gaps in the documented knowledge.
Fill them.
(Then comes the hard part) Enforce the practice of documenting (solutions, processes, policies etc.) early on at the “idea” stage and do not leave it for a later point.

As “later” in most cases turns out to be actually “never”. Achieving and maintaining Technical Excellence at scale requires a general “shift left” on documenting the organization’s knowledge.

What is the “Bus Factor”?

Source: https://twitter.com/geminicompanies/status/857951380576849920

At one of my previous companies we had a backend team of just 4 people that had to take care of a system that included a legacy monolith and around 30 microservices. The same 4 people had to maintain the infrastructure, CI/CD pipelines etc. Due to the size of the system each of us had to specialize in a certain area. We had very limited resources and our CTO would often rhetorically ask — “What would happen if any of you gets hit by a tram?”.

He used that example because there were trams passing by on the street near our office building. I hadn’t heard the term “Bus Factor” back then, but in our case it was pretty clear that we had several instances of single engineers accumulating lots of key knowledge.

The “Bus Factor” is defined as the minimum number of team members that have to suddenly disappear from a project before it gets derailed due to lack of knowledge.

In my case we had a “Bus Factor” of just 1. But this can easily be a huge issue in larger organizations. Relying on Tribal Knowledge will definitely produce low “Bus Factor” numbers.

You definitely don’t want to have a low “Bus Factor” when it comes to Technical Excellence. Relying on 1 person to enforce best practices and processes is not sustainable. There are other mechanisms in order to make sure that all engineers know “what good looks like” without depending on the limited capacity of a few people that won’t be at the company forever.

Why does your Organization need Architectural Principles?

Most of us have a set of core values we rely upon in our everyday lives. These fundamental beliefs guide our decisions and behaviour especially when it comes to differentiating between right or wrong. A good piece of advice is to always fall back on your principles when facing an important decision. The real world however is complex and it’s not always easy to align with all of them.

In software engineering organizations the vast majority of the important decisions are related to architecture. They are usually made early in the process and changing course later is expensive — that’s why getting them “right” is crucial. Relying on documented architectural principles would allow you to compare different design proposals and choose the ones that better align with the company’s principles without falling into personal preferences. Big organizations usually produce and maintain large systems that consist of multiple subsystems owned by different teams or even departments. Architectural principles ensure that the designs of your subsystems are compatible across the organization and won’t diverge too much. Teams can produce designs autonomously by following the principles which should minimize cases where they get rejected by the wider organization.

A principal should be general enough to apply to a broad class of problems.

For example, some of Tide’s architectural principles are:

“Maximize cohesion, minimize coupling”
“Promote simplicity“
“Transparency“

Applying each principle has its consequences.

Let’s take “Maximize cohesion, minimize coupling” — a statement probably every engineer has heard early on in their career or in university, as it is applicable even on a very low level e.g. relations between simple functions/methods in a small computer program. For example, using “Maximize cohesion, minimize coupling” would mean that you:

“should prefer single-team-owned services over shared monoliths”
“should prefer single purpose services over all-mighty monoliths”
“should prefer an event driven over a request-response model”

Architectural Principles help you define what good Architecture looks like in your organization. Software engineering, much like the real world, is complex, so expect to choose between your principles. Those are the so-called trade-offs. Some of your principles might overlap or even contradict each other. Don’t follow them blindly, but if you decide to not follow some of them in a situation, make sure that you have very solid reasons for doing so.

At the end of the day the important thing is to strive to create designs that comply with as many of your company recognized principles as possible.

Why does your Organization need Guidelines?

A Guideline is a collection of best practices for developing software.

Unlike architectural principles, a guideline shouldn’t be general or apply to a broad class of problems. A guideline describes best practices, rules, conventions, processes and ways of working applicable only to a specific subject. Examples of guidelines could be:

a language specific Guideline (Java Coding Conventions, Python Style Guide etc.)
a Guideline for designing REST APIs
a Guideline for asynchronous messaging
a Guideline for using Databases
a Guideline for Code Reviews

There are two options, when it comes to making sure that all engineers in an organization know “what good looks like”.

The first one is to rely on experienced people to informally share their knowledge, mentor new joiners and enforce best practices in code reviews. But this is basically relying on Tribal Knowledge and, as discussed earlier, it will not work for bigger and growing companies.

The second option is to formalize the organization’s view on best practices in a set of guidelines for various subjects. Documented guidelines ensure that each engineer has access to that knowledge and can use it as a reference when in doubt, without depending on the availability of certain people.

And this is just a small part of all the benefits that maintaining guidelines would bring to your work.

What are the subjects that are worth creating a guideline for?

Observing issues that arise frequently gives a good idea about which subjects require guidelines. Issues that engineers regularly get stuck on and need help with, best practices most people are not aware of or things that are done in different ways by different people leading to inconsistency and confusion are all good subjects for guidelines. Basics which are obvious and every decent engineer is familiar with, should not be part of the guidelines. Be aware that people can only remember a certain amount of information and new joiners will need to get familiar with all the guidelines in a short time frame. A large set of guidelines can be counterproductive and can delay the onboarding process, so carefully consider if something really needs to become part of a guideline.

The rules that might become parts of a guideline can be divided into 3 main types:

Rules that prevent disasters — not following them will result in severe negative consequences. Think about rules related to security and securing your APIs in particular. If those are violated your users might get hacked, their data leaked, they might even lose funds, depending on the industry you are in.
Rules that enforce best practices — they usually improve the “-ilities” of your system, like maintainability, reliability etc.
Rules for consistency — they often relate to conventions like format, naming etc. The bigger the scale, the more difficult it is to maintain consistency across your code base.

The benefits of Consistency at Scale are so big that it begs a dedicated section.

The importance of consistency

Consistency rules might seem less important as they often relate to naming and formatting conventions. Actually the reality is that enforcing them in bigger organizations creates large productivity benefits. Imagine hundreds of engineers split into dozens of teams maintaining hundreds of services. If rules for consistency are not applied to seemingly minor things like naming and formatting conventions or if the choice of libraries, languages or tools that engineers can use is not restricted, the end result will be a confusing code base very hard to maintain. Consistency enables “internal open source” i.e. engineers working on services that are owned by another team. It also makes it easier to transfer service ownership from one team to another.

Consistency requirements could be perceived by some engineers as a limitation to their choice. But at scale it’s worth considering the trade-off between consistency and unrestricted freedom. Consistency will win by a large margin. Yes, an individual contributor might be temporarily unhappy, because they weren’t allowed to use their preferred approach, but in the end this will benefit the organization as a whole, because other engineers would easily maintain that code later on.

Consistency also prevents conflicts and saves time that would otherwise be wasted in unproductive discussions. We all know how engineers sometimes engage in passionate discussions, about their preferred conventions, frameworks etc., that can escalate into personal contradictions.

For example, the usage of snake_case or camel Case for naming query parameters in APIs doesn’t make much difference. Debating it won’t bring value either. Simply enforce one of them across the code base. When the choice is restricted, there is no other option but following the guideline. As it takes just 2 engineers to start an unproductive conversation, imagine how much time is saved in the long run for a company with 100+ developers. Of course, this doesn’t mean that engineers should avoid sharing their opinion. It means that people will have more time to focus on discussing the things that really matter and that would bring value to the customers and the company.

Automate your guidelines

Checking for stuff like format compliance or naming conventions during code reviews is boring for engineers. No one likes counting the number of spaces used for indentation just to see if they match the convention or staring at the screen to catch redundant empty lines. People are not robots and even if they do their best they would still not catch all cases. These are the kind of things that machines can certainly do faster and better than humans. An automated checker can analyze thousands of lines of code in the matter of minutes (or even seconds), so the amount of time it spends per day doing its job won’t increase much even if the company’s size grows from 20 to 200 engineers. Obviously the same doesn’t apply to humans performing manual checks, so Guideline automation helps you scale.

Formatting and naming conventions are not the only aspects you have to automate. Actually you should be able to automate the bigger part of your guidelines. There are multiple code analyzers that allow creating custom rules, especially when it comes to the most popular programming languages. Even if there isn’t a tool already available, you can always create your own. Here consistency brings a big benefit again. It’s easier to use or create automated tools when your code base is consistent across services and teams. The more inconsistencies you have, the more corner cases your automated tools would need to handle.

Automated guideline checkers should run as part of the CI/CD pipelines. This will make sure that all rules are enforced and no one can bypass them. Usually the two things that cannot be automated are:

checking if the code is easy to read and understandable by humans
checking if the implementation does exactly what it’s supposed to do

These are the most important subjects worth of your engineers’ time during code reviews.

Managing guidelines

Anyone should be able to propose and create a new guideline and whether it gets adopted depends on the engineering community. The bigger the size of your organization the harder it would be to get everyone to agree with everything in a guideline. That’s why achieving a 100% consensus is not a viable option. Relying on a simple majority is more realistic. As each guideline is focused on a particular subject, it is reasonable to designate a number of engineers, who are recognized as experts on the specific topic, as owners of the guideline. That group of people should act as gatekeepers to the guideline and should have the authority to veto any changes.

As with everything in life, your guidelines are not set in stone. You might need to change the existing ones or entirely deprecate and replace them. Reasons for this might include new versions of languages and libraries being released, collecting more information about how the existing guidelines affect your engineers and the issues they face complying with them etc. Try to be Data-Driven with such decisions. For example, If there is enough data suggesting that a rule in an existing guideline is affecting productivity in a negative way, then it has to be changed. Ideally, proposing and creating new guidelines should be Data-Driven as well.

Managing documentation at Scale

We now know documenting knowledge is important. The larger your organization, the more documentation (hopefully) it will produce. Bigger companies have existed for a number of years in order to reach that size. So they will need to manage older and older documentation as time passes. Probably the 3 most important properties of good documentation are:

Organized
Easily searchable
Kept up to date

Fortunately, today we have a lot of tools for managing documentation that provide good enough search capabilities. But on the other hand, to keep the documentation organized and up to date we often have to rely on humans. And people forget, get sloppy or documentation just becomes an afterthought for them. The logical solution would be to automate as much as possible. Tools like Swagger can help you automate the documentation of REST APIs. AsyncAPI can be used to automatically document your Event-Driven APIs. There are other tools out there for different use cases. Just find them and put them to action. If there isn’t one for your particular use case — create it. Of course, there are things that you can’t automate, so engineers still need to follow some best practices when creating documentation.

Let’s have a look at some of them.

How to create documentation that’s valuable for future readers

Most people focus on documenting the what and the how e.g. “What features does a service provide?” and “How are they implemented?”. That’s fine to start with, but there are two issues when answering only these two questions:

The answers to these questions are most probably already available in the code e.g. any engineer can just see the code and understand what a service does and how it is implemented. Due to the documentation often being out of date some engineers prefer jumping to the code directly in order to avoid wasting time on outdated and confusing documentation.
They don’t explain the “why” e.g. “Why things were designed in a certain way?”, “Why pattern X was used instead of Y?” etc. etc. The answers might have been obvious at the time of implementation but years later this might not be the case. Even worse, the original creators most probably won’t be at the company anymore and knowledge will be forever lost. As a consequence, new engineers are often left guessing the reasons behind certain decisions.

I’ve often seen brave new engineers joining a company and immediately starting to criticize how things were designed or implemented. They would often say things like: “I don’t see any reason why this is here. Let’s remove it.”, “This doesn’t make any sense. Let’s change it.” etc. More often than not they are not aware of the context and constraints that existed at the point in time when those decisions were made. In most cases it turns out that it was simply the best or even the only available option for the engineers who worked on the problem in the past.

So, the criticism is totally unfair.

This is a perfect example of the Chesterson’s Fence principle. It’s most concise version states “Do not remove a fence until you know why it was put up in the first place”. To translate this to the Software Engineering world — not knowing why something was designed in a certain way is definitely not a good enough reason to change or remove it. Moreover, not understanding the why is actually a good enough reason to forbid changes.

The “why” is usually missing from the documentation. Of course, we can’t document the reason behind any minor decision. There will always be cases where someone gets confused while reading a few years old code. What’s important is to properly document the “why” behind the decisions with the biggest consequences. Usually these are the decisions related to your Architecture.

What is LADR?

LADR stands for Lightweight Architecture Decision Records.

This approach for documenting architectural decisions was originally proposed in a blog post by Michael Nygard. Both links contain enough details about LADR, so we’ll give a really brief overview here.

LADR uses sequentially numbered files with a structured format to keep track of the significant architectural decisions. Each file records a single decision. The files should be small and in general shouldn’t change after a decision has been accepted. This sequence of files can be viewed as a sort of commit log i.e. by replaying all accepted decisions from the beginning you should reach the current state of your architecture.

Each file contains the following sections:

Title;
Date;
Context — describes what the current problem, state or situation is. Mentions the constraints and forces at play at this moment of time. This section actually contains the reasons for your decision. It answers the “why”;
Decision — describes your decision;
Status — Proposed, Accepted, Deprecated, Superseded;
Consequences — describes how your decision would change the Context.

Solid data proves the benefits of good quality documentation

Source: https://services.google.com/fh/files/misc/report_2021_accelerate_state_of_devops.pdf

Recently The DevOps Research and Assessment (DORA) team at Google Cloud published the Accelerate State of DevOps 2021. The report “represents seven years of research and data from more than 32,000 professionals worldwide”. For the first time it “measured the quality of internal documentation and practices that contribute to this quality”. The findings are not surprising — “teams with high quality documentation are better able to implement technical practices and perform better as a whole”. Most of us probably “had a feeling” that this would be the case, but “having a feeling” and having some solid data that proves something are quite different things, right?

So let’s have a look at some numbers. Teams with quality documentation are:

2.4 times more likely to see better software delivery and operational (SDO) performance
3.8 times more likely to implement security practices
2.4 times more likely to meet or exceed their reliability targets
3.5 times more likely to implement Site Reliability Engineering (SRE) practices
2.5 times more likely to fully leverage the cloud

“Teams with good documentation deliver software faster and more reliably than those with poor documentation.[…] From security to testing, documentation is a key way to share specialized knowledge and guidance both between these specialized sub-teams and with the wider team.”

The report outlines some best practices for improving documentation quality. Let’s have a look at some of them.

“Define owners of the different parts of the documentation. This is a general principle” — not having a dedicated owner for something means that most probably it won’t get done. The other possible outcome is to have multiple people pulling the thing in different directions.
“Define Guidelines for creating and maintaining documentation.” — Writing quality documentation is not much different than writing quality code. In order to promote best practices and enforce consistency across your organization in terms of documentation you’ll need some Guidelines about this. At Tide we even have a Guideline for creating Guidelines called “Guidelines about the Guidelines“.
“Include documentation as part of the software development process” — it shouldn’t be an afterthought. “Like testing, documentation creation and maintenance is an integral part of a high-performing software development process.” Automate as much as possible in terms of documentation creation and maintenance.

We can see that the latest Accelerate State of DevOps report aligns with what we discussed before that.

Why does your Organization need Community of Practices?

We already know that Technical Excellence requires openness to new ideas, willingness to try out new approaches and continuous learning. But in today’s world engineers are constantly under pressure to deliver the “next” feature and the “next” unrealistic deadline is just around the corner. A lot of engineers will go into a mode where they spend all of their time (and even overwork in some cases) just doing their “normal” work using the same technologies and tools they’ve been using for years. Ben Northrop’s excellent blog post titled Always do Extra mentions the concepts More, Extra and Nothing in terms of using the little spare time at work. More means just doing more of your “normal” work and Extra means working on things that would allow you to learn and could benefit both you and the company.

Doing Nothing or More doesn’t align well with Technical Excellence. Yes, some engineers will work towards Technical Excellence in their personal time, but let’s be honest — people have families, friends and other passions different from technology — they have a life.

So how do you enable the engineers in your organization to continuously learn and improve during working hours given the constantly increasing demands from the business to deliver new features to your users?

One of the options is having the so-called Communities of Practice (CoPs), which some companies also call Interest Groups.

These are groups of engineers who share common curiosity in a particular topic or technology. These Communities of Practice give people the opportunity to spend some of their working hours on things they are passionate about and don’t feel like “work”. Being able to work on interesting things outside of regular duties will make most people happier and that’s not the only benefit the organization will get. These Communities will nurture a group of experts on the particular topic or technology. Over time, people attending the various Interest Groups will become the natural “go-to” experts in cases where other engineers have problems related to their area of specialization. The members of the Communities should be encouraged to share their knowledge and their achievements with the wider organization by creating Guidelines, presentations etc. Some of the groups could actually create libraries or tools that are adopted across the organization.

It’s important to align the creation and growth of the various groups with the organization’s goals. It doesn’t make sense to have a CoP focused on a topic that is not relevant for the organization or is not likely to become relevant in the foreseeable future. As the organization grows it might turn out that a certain topic requires more attention and a group of engineers dedicating just a portion of their time can’t solve all the problems. If that’s the case then the time has come to create a dedicated full-time team that would tackle those problems.

Finally, let’s mention something about events like Innovation Days, Hackathons etc. These complement the Interest Groups in a natural way as they provide engineers with days that can be entirely dedicated to their CoPs. Of course, such events are usually not limited to CoP work and allow all kinds of exploration and experimentation which are important prerequisites for Technical Excellence.