The code base knowledge singularity

Robert Fink
Feb 11 · 7 min read

A singularity is a point in time or configuration space in which a system becomes unpredictable or misbehaved in some way. For example, in mathematics, the function f(x) = 1/x is singular at x=0 because it “explodes”; in general relativity, Stephen Hawking showed that the Big Bang has infinite density (i.e., space is point-like) and Nobel Prize winner Roger Penrose argued that black holes necessarily have a singularity; maybe most famously, the technological singularity is a hypothetical point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable changes to human civilization (Wikipedia).

Image for post
Image for post

In this blog post I argue that software code bases are subject to what I call the code base knowledge singularity. In a nutshell, this is the point in time when organizational knowledge about the code has been lost, when nobody remembers the original design trade-offs or the nuances of the implementation. Given my experience of working with hundreds of teams across thousands of code repositories at Palantir, I will describe the circumstances that lead to singularities and present the techniques and tools that we employ to avoid them.

Image for post
Image for post
O.P.C. (Copied from Abstruse Goose under CC BY-NC 3.0 US) — Other people’s code must not be terrible, it can be a pleasure to read if it was authored with consideration and empathy for the reader.

What the heck is this?

If I were part of an engineering organization or team that you are in charge of, would you prefer I were generally exhibiting the pessimistic or optimistic behavior? Correct, neither. The pessimistic variant is bad because it indicates that the code in question will never change, it is definitionally legacy, it is a dead code base. The optimistic approach is bad in more subtle ways: it’s quite likely that there were good reasons for the code being complicated and it’s even more likely that I don’t understand these reasons and will mess things up and introduce bugs rather than simplifying. This scenario is known as Chesterton’s fence; quoting from a Farnam Street article:

“When we seek to intervene in any system created by someone, it’s not enough to view their decisions and choices simply as the consequences of first-order thinking because we can inadvertently create serious problems. Before changing anything, we should wonder whether they were using second-order thinking. Their reasons for making certain choices might be more complex than they seem at first. It’s best to assume they knew things we don’t or had experience we can’t fathom, so we don’t go for quick fixes and end up making things worse.”

The code base knowledge singularity

Of course the implications are not as dire as those of the technological singularity (hypothetical point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable changes to human civilization). However, when approaching the code base knowledge singularity, developers become much more likely to introduce performance or behavior regressions, to waste time exploring futile previously-trodden paths, or to violate implicit or hidden cross-component assumptions.

So how did we get to this situation in the first place, why and how did we end up with two poor choices, the pessimist treating code as legacy or the optimist refactoring and potentially introducing bugs? Of course, many different factors determine how quickly code bases approach the singularity, but I want to call out two particularly important ones. First, communication style. In organizations with predominantly synchronous communication (tell tales: “let’s schedule a VTC to figure this out”, “can we discuss this in person next week when I’m in London anyway?”), decisions are primarily made in verbal, face-to-face communication. Unless such organizations have developed a strong culture of note taking and dissemination, decisions and — more importantly — the rationale behind the decisions are ephemeral and quickly lost.

In contrast, organizations with asynchronous communication styles (check out Basecamp or Gitlab, for instance) typically use collaborative, text-based communication tools (e.g., Quip or Wikis) that automatically produce written artifacts of discussion and decision-making. But watch out, don’t get tricked into relying on synchronous communication tools like Slack to capture decisions for asynchronous consumption! Even with collaborative documentation tools like Quip or Google Documents, knowledge organization (e.g., structuring, tagging, search, archival, access control, etc.) remains a major challenge.

Second, the software architecture (as a mirror of the organizational structure, per Conway’s law) directly impacts the half-life time of architectural decisions. For example, micro-service code and organization architectures often exhibit strong intra-service social bonds, but also strong per-service information siloing. Even if technical decisions within a given service are documented, organizations may lack an obvious forum for discussing and capturing cross-service, system-wide architectural decisions. Organization leaders who strive to promote organization-wide knowledge sharing will find themselves in a constant uphill battle against entropy and laziness.

Programming vs software engineering

Programming (or, “coding”) is first-order thinking and software engineering is second-order thinking. To understand the distinction, consider Google’s insight that software engineering is programming over time“: programming is to solve the problem at hand, software engineering is to solve the problem in such a way that the solution will stand the test of time, that it can get adapted and reused by other people and systems in the future. The person who wrote our what the heck is this function doing? piece of code was potentially a brilliant programmer, but quite likely not a stellar software engineer.

In contrast to programming, software engineering requires much more than writing down the correct algorithm: choosing a future-proof programming language, writing tests to document edge case behavior and prevent regressions, writing API documentation, giving variables and methods intuitive and consistent names, documenting nuances of the design or implementation, or even sharing knowledge and insights with others through documentation, tech talks, and education programs. These behaviors can help avoid what the heck is this function doing? moments, they can help postpone or even avoid the knowledge singularity.

Coding tools for second-order thinkers

  • Every code or configuration change at Palantir requires code review. A first-order consequence is to catch bugs, an (arguably more important?) second-order consequence is to share knowledge about the code base.
  • We use Quip for Request for Comments (RFC) documents in which engineers propose technical problems and solutions. Their peers can join the discussion, comment, critique, ask questions, etc. Wherever possible given legal and compliance constraints, team Quip folders have open-by-default access policies.
  • Many teams produce Architecture Decision Records (ADR) to document important, long-term architectural choices: which database technology or storage layout does this service use, why are WebSockets preferred in this service over REST-style RPC, how is access control designed for user-facing resources, etc. It’s practical to store ADRs alongside the code in the source code management system, this helps with both discoverability and archeology. Side note: A big challenge for many teams is the transition from the discussion phase (RFC) to the documentation phase (ADR). Too often, after a fruitful discussion on the RFC document, teams struggle to consolidate the discussion, update the written artifacts, and and persist the insights from the RFC as an ADR.
  • We have regular internal Tech Talks in which engineers give whiteboard-based presentations of the technical underbelly of their software components. Tech Talks are recorded and serve as a great source of information for new hires or new team members.
  • Our teams hold post-mortem retrospectives akin to five-whys in order to understand high-impact bugs or system outages and document the learnings for the future.

Conclusion

Author

Palantir Blog

Palantir Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store