A singularity is a point in time or configuration space in which a system becomes unpredictable or misbehaved in some way. For example, in mathematics, the function f(x) = 1/x is singular at x=0 because it “explodes”; in general relativity, Stephen Hawking showed that the Big Bang has infinite density (i.e., space is point-like) and Nobel Prize winner Roger Penrose argued that black holes necessarily have a singularity; maybe most famously, the technological singularity is a “hypothetical point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable changes to human civilization” (Wikipedia).
In this blog post I argue that software code bases are subject to what I call the code base knowledge singularity. In a nutshell, this is the point in time when organizational knowledge about the code has been lost, when nobody remembers the original design trade-offs or the nuances of the implementation. Given my experience of working with hundreds of teams across thousands of code repositories at Palantir, I will describe the circumstances that lead to singularities and present the techniques and tools that we employ to avoid them.
What the heck is this?
I think every programmer has been confronted with the code base knowledge singularity at some point. It is the moment when you look at a code base, wondering, what the heck is this function doing? I have observed two common response patterns in such situations:
- The pessimist: I won’t ever touch that code, this is way too scary to change
- The optimist: That code looks way too convoluted, I’m sure I can refactor and simplify this
If I were part of an engineering organization or team that you are in charge of, would you prefer I were generally exhibiting the pessimistic or optimistic behavior? Correct, neither. The pessimistic variant is bad because it indicates that the code in question will never change, it is definitionally legacy, it is a dead code base. The optimistic approach is bad in more subtle ways: it’s quite likely that there were good reasons for the code being complicated and it’s even more likely that I don’t understand these reasons and will mess things up and introduce bugs rather than simplifying. This scenario is known as Chesterton’s fence; quoting from a Farnam Street article:
“When we seek to intervene in any system created by someone, it’s not enough to view their decisions and choices simply as the consequences of first-order thinking because we can inadvertently create serious problems. Before changing anything, we should wonder whether they were using second-order thinking. Their reasons for making certain choices might be more complex than they seem at first. It’s best to assume they knew things we don’t or had experience we can’t fathom, so we don’t go for quick fixes and end up making things worse.”
The code base knowledge singularity
Before changing such code, I should ask a knowledgable colleague or contributor to explain its nuances. Unfortunately, more often than not, such a person does not exist, either because they have left the project or company or because they themselves forgot the details after a few years. This is the point when the organization hits the code base knowledge singularity, it has lost the institutional memory of why a particular piece of software is written the way it is. In order to find out, the only hope is source code archeology: I would dig through Git commit logs, GitHub pull request comments, emails, Quip documents, Slack channels, etc. pp., desperate to learn the why and the how behind the mysterious what the heck is this function doing piece of code.
Of course the implications are not as dire as those of the technological singularity (“hypothetical point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable changes to human civilization”). However, when approaching the code base knowledge singularity, developers become much more likely to introduce performance or behavior regressions, to waste time exploring futile previously-trodden paths, or to violate implicit or hidden cross-component assumptions.
So how did we get to this situation in the first place, why and how did we end up with two poor choices, the pessimist treating code as legacy or the optimist refactoring and potentially introducing bugs? Of course, many different factors determine how quickly code bases approach the singularity, but I want to call out two particularly important ones. First, communication style. In organizations with predominantly synchronous communication (tell tales: “let’s schedule a VTC to figure this out”, “can we discuss this in person next week when I’m in London anyway?”), decisions are primarily made in verbal, face-to-face communication. Unless such organizations have developed a strong culture of note taking and dissemination, decisions and — more importantly — the rationale behind the decisions are ephemeral and quickly lost.
In contrast, organizations with asynchronous communication styles (check out Basecamp or Gitlab, for instance) typically use collaborative, text-based communication tools (e.g., Quip or Wikis) that automatically produce written artifacts of discussion and decision-making. But watch out, don’t get tricked into relying on synchronous communication tools like Slack to capture decisions for asynchronous consumption! Even with collaborative documentation tools like Quip or Google Documents, knowledge organization (e.g., structuring, tagging, search, archival, access control, etc.) remains a major challenge.
Second, the software architecture (as a mirror of the organizational structure, per Conway’s law) directly impacts the half-life time of architectural decisions. For example, micro-service code and organization architectures often exhibit strong intra-service social bonds, but also strong per-service information siloing. Even if technical decisions within a given service are documented, organizations may lack an obvious forum for discussing and capturing cross-service, system-wide architectural decisions. Organization leaders who strive to promote organization-wide knowledge sharing will find themselves in a constant uphill battle against entropy and laziness.
Programming vs software engineering
The Farnam Street quote from above discusses Chesterton’s fence in terms of “first-order thinking” and “second-order thinking”. In mathematics or physics, a first-order approximation is usually a linear approximation, e.g., an order-1 Taylor expansion of a function. Linear approximations have poor forecasting ability. For instance, they cannot express the difference between my account balance grows (first order) and the rate at which my account balance grows grows (second order). Analogously, first-order thinking concerns the direct consequences of an action, while second-order thinking considers the consequences of the consequences of an action or a decision. Second-order thinkers continually wonder, And then what? Second-order thinkers have better forecasting ability, they produce more future-proof solutions.
Programming (or, “coding”) is first-order thinking and software engineering is second-order thinking. To understand the distinction, consider Google’s insight that ”software engineering is programming over time“: programming is to solve the problem at hand, software engineering is to solve the problem in such a way that the solution will stand the test of time, that it can get adapted and reused by other people and systems in the future. The person who wrote our what the heck is this function doing? piece of code was potentially a brilliant programmer, but quite likely not a stellar software engineer.
In contrast to programming, software engineering requires much more than writing down the correct algorithm: choosing a future-proof programming language, writing tests to document edge case behavior and prevent regressions, writing API documentation, giving variables and methods intuitive and consistent names, documenting nuances of the design or implementation, or even sharing knowledge and insights with others through documentation, tech talks, and education programs. These behaviors can help avoid what the heck is this function doing? moments, they can help postpone or even avoid the knowledge singularity.
Coding tools for second-order thinkers
The code base knowledge singularity is too real a danger to be fended off with appeals only (“just code more carefully!”, “try harder!”). Much more important are concrete tools and routines that help anchor second-order thinking. Here are a few simple examples from my work at Palantir:
- Every code or configuration change at Palantir requires code review. A first-order consequence is to catch bugs, an (arguably more important?) second-order consequence is to share knowledge about the code base.
- We use Quip for Request for Comments (RFC) documents in which engineers propose technical problems and solutions. Their peers can join the discussion, comment, critique, ask questions, etc. Wherever possible given legal and compliance constraints, team Quip folders have open-by-default access policies.
- Many teams produce Architecture Decision Records (ADR) to document important, long-term architectural choices: which database technology or storage layout does this service use, why are WebSockets preferred in this service over REST-style RPC, how is access control designed for user-facing resources, etc. It’s practical to store ADRs alongside the code in the source code management system, this helps with both discoverability and archeology. Side note: A big challenge for many teams is the transition from the discussion phase (RFC) to the documentation phase (ADR). Too often, after a fruitful discussion on the RFC document, teams struggle to consolidate the discussion, update the written artifacts, and and persist the insights from the RFC as an ADR.
- We have regular internal Tech Talks in which engineers give whiteboard-based presentations of the technical underbelly of their software components. Tech Talks are recorded and serve as a great source of information for new hires or new team members.
- Our teams hold post-mortem retrospectives akin to five-whys in order to understand high-impact bugs or system outages and document the learnings for the future.
Despite all these good ideas and tools, also I experience what the heck is this function doing? moments on a regular basis, of course even with my own code… The code base knowledge singularity is real, and it is a major threat to the ability of software organizations to deliver new features safely and swiftly, or to evolve existing software according to emergent infrastructure requirements in a time and cost-effective manner. In contrast to singularities in mathematics, however, this is a singularity we can stand up against. Engineering leaders must recognize this threat and fight it hard, by championing engineering quality initiatives and incentivizing second-order thinking in software engineering.