my engineering standards
Software can never be perfect, it can only ever be “good enough” … beyond a certain size and rate of change — it’s always going to contain bugs and experience outages.
So how do you know if your software is good enough?
How do you know if you’re optimising too much toward building and launching new features and thus potentially accruing technical debt that will eventually catch up on you, slow you down and break any semblance of good user experience?
Or how do you know if you’re being too conservative, over engineering things, and not pushing hard enough to maintain a strong release cadence of new, value-add customer-facing features? When you work like this, in a competitive market you risk losing product market fit and/or becoming an inferior product?
These are super hard questions, with no right answers.
My opinion and approach is to codify your beliefs around what constitutes software that is “good enough” into a small set of engineering principles and build a culture, organisation and set of processes that reinforce them.
Rich’s engineering standards
Adhering to the following standards, tenets or principles protects us from most of the easily avoidable mistakes — like building something that’s not valuable, of poor quality, operationally weak or difficult to maintain/evolve.
From an engineering perspective our work is likely to be valuable and sufficiently high quality when it has the following qualities:
- It was conceived, prioritised and executed using our standard product development and engineering processes.
- Key engineering decisions were made based on thorough investigation, experimentation and analysis rather than weak understanding and/or guesswork.
- Work was ordered to eliminate risks instead of making linear progress.
- Where possible it was built from standard technologies.
- There is sufficient instrumentation that the system is considered observable, especially in times out outage.
- It was shipped in many small releases rather than one bigger release.
- As assessed, individual changes were derisked by leveraging feature flags, especially when high throughput, complex or unfamiliar code paths were touched.
- Each change was tested and validated after it went to production to assess if it worked and to see if it broke something else.
- We added a small number of high signal paging alarms to ensure we knew when customer-facing functionality broke.
- We are confident that it won’t run out of capacity due to predictable growth.
- There is sufficient, discoverable documentation such that someone else can maintain and evolve things from here.
- The systems should be recoverable by a shared out-of-hours oncall team.
- We were deliberate about communicating, getting peer review and feedback from the relevant people/teams all along the way.