Not Learning the Lingo: How a Bad Performance Review Gave Birth to a New Start Up

Part One of Two

I have had the following feedback with respect to Jesse:
It seems like Jesse is just cruising, and doesn’t take the job seriously. Jesse makes the same mistakes over and over in branches. Initial reviews of Jesse’s work is painful because it appears that he hasn’t thought the problem through.

I’m a software engineer working on Juju for Canonical (the company behind Ubuntu). Above is the first 360 performance review I received, in March 2015. I had no idea my colleagues felt this way about me and my work. That night I did not sleep. I seriously considered packing in software engineering — something I have devoted 15 years of my life to. I gave myself 48 hrs to decide. In that 48 hrs I realised something.

This is not a “woe is me” story, nor is it an excuse or a grudge story. It is a story about taking a devastating personal review, driving to the heart of the problem and returning with an insight and solution applicable to the whole software industry.


Scalable Engineering

… as an industry we mostly don’t know how to do it and consequently massively under-invest in making our engineering orgs actually effective.

Peter Seibel, Twitter’s Engineering Effectiveness Lead — http://www.gigamonkeys.com/flowers/

Peter makes the point that, as engineering teams, we are good at scaling up everything except ourselves. Juju is no exception. This article takes Juju as a case study to illustrate how the traditional model of engineering is fundamentally broken and unscalable.

Juju orchestrates the deployment of services in the cloud. The Juju engineers are some of the best in the industry and Juju itself is leagues ahead of any other cloud orchestration software on the market. But while the product targets scalable deployments of complex services to the cloud, the project itself is not scalable. The Juju core team is facing the same impediments to effective, scalable engineering that challenge our whole industry.


The Artefact Map

The engineer’s goal is to have both high velocity and quality: to be quick and correct. But the two are in constant tension. Engineers try to ensure the quality of their work by: following the specification, referring to the style guide, developer documents, in-line comments, existing implementations and the mailing list. We build up a map of the product: a contextual, tacit, “know-how” enabling the right decision to be made in the right context.

But this map of the product is built from artefacts. Project documents are artefacts. Source code files are artefacts. No sooner are they written, than they are out of sync and out of date. Maintaining these artefacts slows velocity, but neglecting the artefacts decreases quality.


Map vs Territory

1) The map never quite matches the territory. 2) If the map somehow did include all relevant features of the territory, it would be just as hard to hold it in your head.

William Reade, Juju’s Lead Architect — https://github.com/juju/juju/wiki/Managing-complexity

What the product should be — the map — is, almost by definition, out of touch with what the product is — the territory of its code. The majority of a lead engineer’s time is spent managing this tension.

Let’s bring this insight into context. On the 29th of May William wrote the following email to the juju-dev list, with the subject “Writing workers”:

I’ve noticed that there’s a lot of confusion over how to write a useful worker. Here follow some guidelines … If you’re passing a *state.State into your worker, you are almost certainly doing it wrong. The layers go worker->apiserver->state …

The guidelines were so comprehensive that this email became a wiki entry. A follow up email was sent on the 8th of Sept, with the subject “workers using *state.State:

People keep writing them, in defiance of sane layering and explicit instructions, for the most embarrassingly trivial tasks (statushistorypruner? dblogpruner? txnpruner? *all* of those can and should pass through a simple api facade, not just dance off to play with the direct-db-access fairies.)
There is no justification for *any* of those things to see a *state.State, and I’m going to start treating new workers that violate layering this way as deliberate sabotage attempts.

9th Sept:

`doc/architectural-overview.txt` — more than a year old — does state that all workers, whatever agent they’re in, should use the API server;

and finally:

I am genuinely surprised/saddened that our institutional knowledge didn’t catch and correct any of these before I threw a tantrum in person.

The engineer’s map is not matching the architect’s. The docs and source code artefacts are not enough to keep them aligned. We need a process to align them. Traditionally, that process has been the code review.


(Unscalabe) Code Reviews

Reviews both treat the symptom (catch bugs) and aim to address the underlying cause (improving the engineer’s map of the product). The engineer’s map of the product, built from the “somewhat correct” code and docs, is here updated and corrected by other engineers. Reviews are an investment in the tacit knowledge of the engineer.

But this knowledge, in the engineer’s head, is also an artefact. It is already going out of date and out of sync with all other artefacts. In addition, there is only so much time in a day and there is tension between disseminating, applying and updating that knowledge. This is not scalable.

Over a two month period (18th Aug to 15th Oct) Juju engineers waited a total of 8683 hours to receive a review:

http://lingo.reviews/d3/juju_review_wait_times.html

This bottleneck is not unique to Juju. Kubernetes, Docker and even Golang fail to process open pull requests (PR) in a timely manner.

On average, a Juju engineer will be waiting on reviews for a total of 17 hours 19 minutes per PR. That’s an average wait of 7 hours and 14 minutes per identified issue.

There are two common counter-arguments at this point:

  1. It is dishonest to use the average, as the median is much lower.

It’s true the median is lower. The full graph shows a long tail of many trivial PRs that only wait a few minutes for a review. These, though, are not what undermines a project. All PRs are co-dependent: directly or implicitly by sharing the same code-base. A significant chunk of PRs wait many hours (sometimes hundreds) for a review. Undermining the quality and velocity of one PR, undermines the quality and velocity of all PRs.

2. It is dishonest to call it wait time as the engineer can work on other PRs while waiting for a review.

I argue that the engineer is working on other PRs because they are waiting. Nothing good comes from having more PRs up than you need, for longer than you need. Every unmerged branch is falling out of sync with the master branch. The longer it is unmerged, the more conflicts there will be to resolve when it is merged.

The question is, how do we:

  • safeguard the quality of the code
  • share knowledge and
  • protect time to apply that knowledge

in a manner that can be scaled to hundreds of contributors?


The Insight

  1. Process over artefact

The backbone of a project should not be the artefacts, but the process that generates them.

2. Heuristic over tacit

Tacit knowledge is one of these artefacts. We need heuristic processes by which the engineer can gain “know-how” when and where it is needed.

Heuristic simply means “enabling a person to discover or learn something for themselves” (https://www.google.co.nz/webhp?q=heuristic). The following depicts two flows of knowledge between a core team and external contributors.

The left depicts an investment in building up the tacit knowledge of the core engineers. This is the traditional model, where a new engineer is not expected to be productive for three to six months while they “on-ramp”: grokking the artefacts, learning the lingo and building a mental map of the product and team best practices.

The group self selects and invests more in those who on-ramp quicker, display the desired tacit knowledge and need less correction. You end up with an ever tightening group of people “in-the-know” speaking the lingo of the project and an ever widening gap between them and all other contributors.

The right depicts another way: an investment in heuristic tooling. That is, tools which allow the contributor, external or core, to learn what they need, when and where they need it. Particular artefacts, be it code, documentation or even product specific knowledge, are seen as transient and short lived. As the saying goes, the only constant is change. If a project is to have a backbone, it is to be found in that constant: a well defined, scalable process to keep the map and territory aligned during constant flux and change. With that, I give you my core insight:

If we invest in heuristics over tacit knowledge, we enable contributions from people external to that knowledge. If we invest in process over artefact we enable both quality and velocity to scale.

Part two introduces a heuristic tool which embodies these principles — a tool which has gained the attention of several enterprises and investors in the past month.


Disclaimer

All opinions here are my own. All quotes and data have been taken from the public domain.