When I meet new engineering teams, I’m always excited to delve into their design document repository. Analyzing the team’s design process is an excellent proxy for understanding how the team works. I hear these responses all too often:
- We don’t have a standard process for design reviews
- We don’t follow a standard design template
- Different teams and engineers are using different tools like Google Docs, Confluence, and [insert your favorite tool here].
- We only do design docs for important features
- You can have a deep dive with [insert the seniormost engineer]. He or she can answer any questions you may have
- We are anti-process. We follow agile
- We are a startup. Velocity is critical for us.
The last few years of my career have been about building and leading cloud engineering organizations of all shapes and sizes. I have been fortunate enough to lead the integration of my previous startup CirroSecure into Palo Alto Networks, as well as the integration and scaling efforts for a couple of significant acquisitions. Here are a few learnings I picked up along the way
I come from a school of “measure ten times and cut once” (Ok, maybe measure twice at least!). A design review is one of the most crucial tools for ensuring good quality, scalability, and security. I am often surprised at how many teams skip this vital step for the perceived benefits of speed and agility. Designing things right becomes even more critical once you have product-market fit and have more than 15–20 customers. Rewrites become more expensive over time, especially when dealing with large amounts of data.
Over the last few years, cloud providers have made it simple to ship an almost working product in record time. However, ensuring optimal levels of operationality, quality, scalability, and security is still a complex issue. Balancing these aspects, especially without sacrificing velocity, can be incredibly challenging.
Techniques like Canary releases and feature flags are great tools to help achieve that balance. However, in my experience, implementing design processes will always be one of the most crucial steps to safeguarding the future. Prevention is cheaper and better than a cure!
The obvious next question is, how do you run implement successful design processes without sacrificing velocity? Here are some ideas to consider when developing a good design review process.
Standardizing the design template: Make sure you are thorough when you create the design template. This your top priority. Capturing thoughts around security, scalability, and operationality are key. Here are some guiding considerations.
Data Security and Privacy
- Are you storing customer data for this service? How will it be stored and for how long?
- Do you need an architectural security review?
Operationality and Supportability
- Is there a runbook?
- Is there a monitoring dashboard for this service? Where? What are the key indicators of impending failure?
Quality and Efficiency
- Is there a technical debt you are accruing with this service/module?
- What are the scale considerations? How do you plan to test for scale?
- How resilient is the service/module?
- How much will this service cost?
- What is the scaling factor for the cost (flat, linear, step, or exponential)
How do you decide who attends the design meetings? This question is a tricky one. The whole team or only a select few? Just the architects or senior engineers? Ideally, one would try to get all engineers to attend — open meetings can encourage cross-training. More participants also mean more pairs of eyes looking for potential issues, which can be very beneficial. However, this becomes very inefficient as the team scales.
One could consider an 80–20 solution. 80% of the attendees should constitute the folks working on the feature (Dev, DevOps, QA and relevant architects) and 20% should be made up of new hires, junior engineers, developers from other teams on a rotational basis). It is essential to get a more cross-functional review of the feature from a Dev, QA, and DevOps perspectives.
Another question that often gets thrown around is the role of an architect. Should they be an advisory body or a layer of enforcement? Who is responsible for making the final decision on architecture? The right answer depends on a lot of different factors — for example, the skill level of the team’s engineers and architects. However, one principle I encourage is delegating architectural decisions to a broader range of team members — not just architects and senior engineers. Limiting your set of decision-makers creates a knowledge and skill gap in the team, which can lead to a single point of failure. Senior engineers/architects should play the role of guides. They should continuously mentor their team with a “teach a person to fish” attitude. I would not recommend forming a dedicated architecture team until one scales beyond 100+ engineers.
Design reviews for every feature? It would be ideal to design review every single feature and bug fix — unfortunately, this approach is not sustainable at scale. One suggestion is to let the lead engineer decide. However, this method is not reliable and can lead to misses. A better approach is to compute a Complexity and Impact Score using weighted averages based on several factors such as:
- The estimated development time for the feature
- Involves new code or refactor of an existing module
- Expected number of customers using this module
- Amount of data stored by this module
- Type of data accessed by this module
- Tenure of the engineers working on this module
- The complexity of the code
- Will it require net new work
Egoless Design Reviews — Just like blameless postmortems, egoless design reviews play a massive role in ensuring excellent results. These reviews should be led by the principal engineer responsible for the module. He or she should be trained to be open and actively solicit feedback from everyone — regardless of their title. Creating an environment where opinions are valued can go a long way in improving results. Engineers come in all flavors, and it is vital to get feedback from even the most reserved engineers.
Two of my favorite Ray Dalio principles apply here:
- The biggest threat to decision making is harmful emotions
- Decision making is a two-step process: learning and then deciding
Leaders should strive to create a transparent and safe environment where opinions are valued, and directness is encouraged.
Supportability and observability?
I wish I had a dime for every on-call incident that had an out-of-date runbook. Supportability and observability are both essential factors to think about during the design process. How will an SRE team detect degradation in my service? Is the runbook updated? Runbooks must be jointly reviewed by DevOps/SREs and developers during the design review process or before the release.
Where do we store these design documents? It doesn’t matter where you store them. The organization of the design documents, as well as the consistency of creation and updates, is more important here. Tools like confluence are littered with graves of unused spaces and half-cooked design documents. Organizing design docs can cut your new hire ramp-up time as well.
How do we measure the effectiveness of the design and the design review process? If you can’t measure it, it doesn’t count. One quantitative approach to measure success by the number of rewrites, rollbacks, bugs, and incidents on this module.
A good design process can transform engineering teams, yet very few engineering teams follow best processes. It can enable teams to tackle scale security and performance problems proactively, leading to happy customers and engineers. The last five years were all about cloud adoption. I strongly believe that the next five will be about attaining operational maturity and productivity.
I would love to hear from you on how you have solved some of these challenges and what best practices you find useful.
You can reach me at @nishantdoshi or https://www.linkedin.com/in/nishantdoshi/