Engineering Ideas #4
Maxime Beauchemin summarizes hard-won lessons about building repeatable and reasonable data processing systems:
Functional programming brings clarity. When functions are “pure” — meaning they do not have side-effects — they can be written, tested, reasoned-about and debugged in isolation, without the need to understand external context or history of events surrounding its execution.
As ETL pipelines grow in complexity, and as data teams grow in numbers, using methodologies [of functional programming] that provide clarity isn’t a luxury, it’s a necessity.
Adrian Cockcroft adds a third component to the classic risk exposure formula in the context of engineering:
In addition to the common financial calculation of risk as the product of probability and severity, engineering risk includes detectability. Failing silently represents a much bigger risk than the same failure that is clearly and promptly reported as an incident. Hence, one way to reduce risk is to make systems more observable.
Response > prevention:
The first technique is the most generally useful. Concentrate on rapid detection and response. […] Figure out how much delay is built into your observability system. […] Try to measure your mean time to respond (MTTR) for incidents.
Which reaffirms the idea from Lessons from Giant-Scale Services by Eric Brewer:
Two related metrics are meantime-between-failure (MTBF) and mean-time-to-repair (MTTR). We can think of uptime as:
uptime = (MTBF — MTTR)/MTBF
Following this equation, we can improve uptime either by reducing the frequency of failures or reducing the time to fix them. Although the former is more pleasing aesthetically, the latter is much easier to accomplish with evolving systems.
For example, to see if a component has an MTBF of one week requires well more than a week of testing under heavy realistic load. If the component fails, you have to start over, possibly repeating the process many times. Conversely, measuring the MTTR takes minutes or less and achieving a 10-percent improvement takes orders of magnitude less total time because of the very fast debugging cycle. In addition, new features tend to reduce MTBF but have relatively little impact on MTTR, which makes it more stable. Thus, giant-scale systems should focus on improving MTTR and simply apply best effort to MTBF.
Cockcroft suggests a simple yet powerful way to improve operational observability:
Systems that have an identifiable capacity trend, for example are filling up disk space at a predictable rate, have a “time to live” (TTL) that can be calculated. Sorting by TTL identifies the systems that need attention first and can help focus work during a rapid response to a problem.
In the largest part of the post, Cockcroft makes an example of failure mode and effects analysis for an imaginary web service. It’s eye-opening to recognize how many ways to fail are there even for a service with zero dependencies. And the number of failure modes jumps sharply with each extra dependency the service has, such as a database.
I had a chance to practice with the Failure Mode and Effects analysis framework. It’s very powerful, indeed. Try it!
Continuing the theme of culture from the previous week, here is an observation from Stephen Orban:
During my career, I’ve had the opportunity to build new teams and products, optimize existing teams and products, lead all-out business transformations, and evangelize how technology can be applied to do so. I’ve had a few modest successes and learned from countless mistakes. I’ve come to believe that most of these successes or failures can largely be attributed to how well the culture supported (or worked against) a desired outcome, or how well the culture could be shaped to support that outcome.
A 10x developer strategy is far inferior to a high-multiple culture —
“It would be great to have all 10x folks and just fly forever. But the truth is we have a range of people available to us, and many of them are capable of doing solid good work as long as they are given the right environment and guidance in which to do it.” — Daniel Seltzer
Edmond Lau reminds of a simple truth which some engineers (myself included) tend to often lose sight of:
So how do you grow your impact as a software engineer without becoming a manager? You identify and solve problems that are core to the business, or you enable those around you to more effectively solve those core business problems. It’s in this alignment of your technical efforts with business value that your career grows.
Here are some examples of how you might amplify your impact without going into management:
— You build tools and abstractions that multiply the output of the engineering teams around you.
— You develop sufficient expertise to consult on software or experiment designs from other engineering teams, and your feedback is valuable enough that it shaves days or weeks worth of work or it turns key projects from failures into successes.
— You become an expert on a deep, technical field that is material to a growing company. For example, you become a machine learning expert and then work on news feed ranking at Facebook, ads ranking at Google, or search ranking at Airbnb. The projects you ship directly translate into growth and revenue for the company.
— You identify a critical business opportunity, perhaps by working with the sales and business teams, and you become part of the founding team within the company to build out a product to address that need.
— You build out onboarding and mentoring programs to teach and train other engineers, and you make them significantly more valuable members of the team.
— You play a key role in building out a solid hiring process, and you help recruit and close engineering hires.
— You make significant contributions to building the engineering brand for your company.
A profound essay by Julian Shapiro. I encourage you to read it in full.
Using a mental model to break out of your flow and question what you’re doing is like ripping the carpet out from under your goals. It’s painful to confront the possibility that you’ve spent years working on the wrong thing.
Whether it’s coding, working a 9–5, saving money, building muscle — anything — you’ve been riding the momentum of steady progression toward objectives. In the process, you’ve buried your head into your hustle and gone with the flow.
Losing agency over your life to extended flow is flow paralysis. It is the archenemy of critical thinking. It is the clearest sign you’ve abandoned mental models, because the ongoing use of mental models is supposed to continually adjust your life trajectory — whereas flow keeps you doing the same thing over and over again.
Julian shares a prioritization framework for tasks which could be applied in engineering work:
ICE: make short-term decisions using this model. When facing many options needing prioritization, score each on a scale of 1–10 using three variables.
— The positive Impact it would have if it succeeds.
— The Confidence you have that it will succeed if you try it.
— How Easy it would be to try it.
For each option, average its three numbers to get its ICE score. Then order all your options by their ICE scores. Options at the top of your list will have the highest expected value and should be given priority.
This is how you plan your life on the timescale of weeks or months.