Trying to parse my career trajectory from reading my resume is like relying on a pirate’s riddle to find treasure. Instead of “walk four paces from the crooked tree, west of skull rock,” though, it’s “spend eight months dropping frozen chicken tenders into a deep fryer, then move eight hundred miles to write automated QA tests for software.”
But there’s one constant in all the varied jobs I’ve had: they are defined by systems that are largely outside of our control, but that we’re ultimately responsible for.
The person who prepares your food has about as much control over the process that delivered them the ingredients as the on-call developer has — getting paged at 3 a.m. because some tangentially related system on the Internet has broken. It’s the people, though, who are ultimately responsible for the immediate crisis, and who bear the burden of fixing the immediate problem.
Even Simple Systems Are Complex
The thing that these systems all have in common is that they’re largely more complex than it seems from the outside. Similar to an iceberg, where a small precipice breaking through the waves can hide a massive chunk of floating ice below the waterline (at least until climate change gets really going and we’re all living in Gas Town waiting for our turn in Thunderdome Plus), the systems that we’re responsible for in our professional lives are often more massive and cumbersome than they appear at first glance. Why is this? Some of it is easily explained, some more difficult, but the usual answer is quite simply ‘inertia’. This organizational and technical inertia isn’t simply something that affects large technical companies, however — it’s something that touches nearly everyone working with software today.
One form of this hidden depth is, of course, other people’s computers, better known as ‘the cloud’. Cloud services have revolutionized the software industry, giving us access to infinitely scalable hardware resources, durable managed services, and a whole host of convenient ways to accidentally delete your entire cloud stack because someone fat-fingered a terraform apply. I’ve seen that last one happen: a new team member at a prior job accidentally deleted every resource in our AWS account due to a combination of poor internal documentation, flawed pairing practices, and overly permissive IAM roles.
But, even if you’ve built an internal system that’s resilient to human beings, how confident are you that every external service you rely on is also so circumspect? You may have a small application with only a handful of services (or even only one service), but every external API you rely on, be it from a cloud provider, or some other SaaS product, is a potential point of complexity, failure, and hair-rending frustration when it goes down and takes your application with it.
Your Dependencies, Their Dependencies, and (Lest We Forget) What’s Dependent on Them
You probably didn’t write your own HTTP stack, or networking stack, or even string comparison library. While it’s easy (and cheap) to go after left-pad or other, similar stories (such as Docker migrating their Go packages to a new GitHub organization, before gomod), the biggest threat to the performance and security of your application may simply be bugs in your third-party dependencies.
Either through benign logic errors, or malicious intent, every module you import is a potential landmine and a source of complexity that you have little control over.
As open source becomes more integral to the art of software development, the potential impact becomes even more widespread — you may vet all of your dependencies, after all, but are the authors of your dependencies taking that same level of care with their dependencies? This, of course, isn’t simply something you need to consider with direct code dependencies — your CI system, your package and container repositories, your deployment pipeline — these are all uncontrolled sources of entropy and complexity that you need to contend with.
While we often think about our software systems strictly in terms of technical depth and complexity, it’s important to remember the organizational and human systems that underpin them.
I think most people can relate to the feeling of helplessness that comes from fighting endless battles against organizational dynamics that don’t seem to make a lot of sense or have misaligned priorities. Maybe your organization rewards new features, while maintenance work is seen as “less important”. Perhaps you’re trapped in a ‘sales-driven development’ pattern where your work shifts from project to project, relentlessly adding new checkboxes to a system without a lot of concern for the overall scalability or maintainability of the application?
Long-term vendor contracts can tie us to particular pieces of technology, forcing hacks and workarounds to development. There are a million other pieces of organizational detritus that float around our teams, regardless of the size or complexity of the actual software we work on. This hidden depth is possibly the most pernicious, as it’s difficult to understand how you can even begin to tackle it, but it’s still a source of complexity as you build and maintain software.
Burning(Out) Man: We’ve All Been There, Some of Us Are Just More Vocal About It
Unfortunately, we don’t have the luxury of sitting back and saying “eh, we’ll fix it tomorrow” when it comes to addressing these issues. You may have a brilliant team of developers, designers, PMs, and more — but you can’t afford the human costs associated with unreliable software. Posts about burnout litter technical communities across the web and recent studies indicate three in five employees feel burnt out by their job at least once a month.
The stress-induced by trying to debug and analyze failures, especially those that aren’t in services or external systems under your direct control, can contribute to burnout. Teams that are suffering from burnout find themselves in situations that can rapidly escalate out of control, and the consequences can be dire — as failures pile on and multiply, more and more time is spent firefighting rather than dealing with the root cause of failures, which leads to more failure and late-night pages.
The Truth About Deep Systems
We’ve been talking a lot about deep systems recently here at LightStep, and I’d encourage you to read some of that material and think about it in the context I’ve presented here.
In short, deep systems are architectures where there are at least four layers of stacked, independently operated services, including cloud or SaaS dependencies. Perhaps a better way to think about deep systems is not so much an explicit definition, but rather what they “sound like”:
- “I’m too busy to update the runbook.”
- “Where’s Chris? I’m dealing with a P0, and they’re the only one who knows how to debug this.”
- “We have too many dashboards!”
I’ve spoken with a lot of developers who think that their system isn’t ‘deep’ because they’ve only got a handful of services, or because they’re a small team. I’d argue that this isn’t the case at all — as demonstrated above, there’s an awful lot of ways your system can have hidden depth that contributes to stress, burnout, and unreliable software. The solution isn’t to despair, but it’s to embrace observability as a core practice of how you build and run software.
This way, when something breaks, you’ll have the context you need to understand what happened, who is responsible, and how to best resolve the issue — even if the regression or error is deep in the stack or the result of a third-party dependency.