Response to Keunwoo Lee’s review of Accelerate

Jez Humble
18 min readJul 8, 2022

--

Jez Humble and Dr Nicole Forsgren

A software engineer, Keunwoo Lee, has published a critical review of Accelerate: https://keunwoo.com/notes/accelerate-devops/. Lee has concerns about the methods we use and our results. We think it’s helpful to address these concerns because we are taking a new approach to studying software engineering, and we actually want people to ask questions, and improve the quality of the criticism they make, and ultimately to apply more widely the behavioral approach we’ve pioneered¹.

Let’s start with the areas where we agree:

I came away thinking that the authors’ high level recommendations are mostly fine… However, I urge you to consider for yourself, from first principles, whether improvement along any given axis is the most valuable use of your limited resources, compared to other things you could be doing. For example, think hard about whether investing in improving deploy frequency is where you should spend your time now, and at what point you would see diminishing returns.

That’s correct. One of the influences on our approach is the theory of constraints, and what that tells you is that effort spent on something that is not a constraint is a waste. If increasing the speed and stability of your software delivery capability is not the constraint on your ability to achieve your organization’s mission, you shouldn’t focus on it. The book is written for people who do have that constraint.

We will say though that first, it often is the constraint, because it’s the primary feedback loop by which you can discover if your product hypotheses are correct. This is, as Lee notes, the main source of failure in software product development. Second, it’s an area where many organizations have plenty of room for improvement: in 2019, only 20% of respondents were deploying multiple times per day, and over half of respondents were deploying less frequently than once a week.

Now let’s address Lee’s four criticisms:

  • The things being measured are inadequate: The development performance metrics are incomplete and sometimes circular.
  • The instruments used to measure them are inadequate: The survey-based methodology is vulnerable to halo effects and other threats to validity.
  • The statistical analysis of the measurements is suspect: The presentation of statistical correlations is sloppy and misleading.
  • The published results are impossible to evaluate or replicate: The authors have released neither the full survey questions nor the data gathered from them; without these, nobody can precisely understand the measurements or the analysis thereof.

The things being measured are inadequate: The development performance metrics are incomplete and sometimes circular.

Lee has two criticisms here. Let’s address the inadequacy question first. We base our theory on the idea that there are two domains in the product development process: the design domain, and the delivery domain (these are not phases in the SDLC, they are both in play simultaneously depending on what activity you are performing at any given time). Table 1.2 is reproduced below from p15 of Accelerate.

Lee accuses us of taking a drunkards-and-lampposts approach by focusing on the product delivery domain, but the primary reason we focus on delivery rather than design and development is because it should be predictable and low-variability, not because it’s easier to measure (the fact that product design and delivery is harder to measure derives from the fact that it is highly variable and fuzzy in its starting point). In addition, good research takes an approach where a problem, system, or domain is decomposed and then tested; our research does just that by focusing on a part of the software process (in this case, delivery). This is a feature, not a bug. Others focus on other aspects, and their contributions are also important.

Further, contrary to Lee’s claim, we do in fact discuss product development in Chapter 8, where we validate lean approaches to product development. In particular, we find that fast software delivery performance is one of the drivers of being able to take a lean approach to product development: for example, if you can only deploy once a month, that will be a significant impediment to your ability to use A/B testing comprehensively in your product development process.

Lee also points out that our measure excludes “both pre-commit implementation and testing, and post-deploy observation and analysis.” However, included in our findings are that the following practices contribute to higher software delivery performance:

  • Merging code into trunk/main at least daily and having branches with short lifetimes (less than a day);
  • Comprehensive monitoring and observability tooling which (amongst other things) gives you fast feedback on the overall health of systems and key business and systems metrics.

Thus the factors Lee identifies will only make a significant difference to the lead time if you are doing it wrong: lots of people do, and of course this is controversial². But again, going back to first principles, going from starting to write code to checking in, and from releasing to getting feedback from production, should be fast and low variability processes and therefore belong in the delivery domain. Our research shows that when they are fast and low variability, that drives better performance.

Lee also accuses us of stating findings that are tautological or circular, giving the example of “’deployment automation’ is highly correlated with fast deployments.” We don’t have any findings that mention fast deployments, but we do find that deployment automation is one of the drivers of overall delivery lead time, deployment frequency, and time-to-restore.

It’s important to be careful when throwing around phrases like “tautological” and “circular argument,” especially when you’re accusing people of sloppy rhetoric. Circular reasoning means that the conclusion is logically contained within the premise. But it’s easy to see empirically that “fast deployments” is not logically contained within the concept of “deployment automation”! At the first startup Jez worked at back in 2000, we deployed manually by ftp-ing the files we wanted to change from our workstations into production. It took a few seconds. On Jez’s team at Google, the deployment is fully automated, but it takes significantly longer because of the complexity of the deployment process. There are multiple variables that drive even deployment lead time (which, again, we don’t measure) let alone overall delivery lead time³.

You can definitely make the argument that it’s obvious that deployment automation would drive lead time, although we would still disagree. But a large amount of regular science (or “normal science” in Kuhnian terms) is proving or disproving things that people think are obvious, and indeed some of our research shows that things people think are obvious don’t hold, such as our finding that managing work in process on its own has no statistically significant correlation with delivery lead times (you have to combine it with visual management and monitoring).

Finally, to address the comment that the measures are incomplete, we cover survey development in depth in Part II of the book. We’ll briefly address this here by saying that when developing the research instrument, we adapted a widely-used approach from psychometrics wherein complex or nuanced ideas (latent variables) are measured with many data points (manifest variables, in this case, survey questions). To determine how to best capture and convey an idea, we looked at how things are currently done across industry, speak with peers, and then write questions. We then applied a statistical technique called exploratory factor analysis. Exploratory factor analysis takes each item and determines which other items they are best matched with; combined with other statistical tests, we are able to identify latent constructs. Using this approach, we were able to determine how to most accurately define complex ideas (like continuous integration or automated testing) in ways that were predictive of performance improvement. Every practice or outcome we mention has therefore been empirically verified statistically as not being sufficiently covariant that it is trivially correlated with another (collinear).

Lee ends his section with the claim, “the Accelerate view of high performance amounts to ‘the team can push a button and deploy changes to prod quickly’. If you care about any software quality outcome other than that, then Accelerate does not even claim to measure it.” That’s an extraordinary mischaracterisation — it doesn’t even adequately capture one of the three items in our performance construct, since lead time includes total time from check-in to release, including integration and testing (assuming you run your tests post-commit, which you definitely should, even if you run some of them pre-commit too!).

The instruments used to measure them are inadequate: The survey-based methodology is vulnerable to halo effects and other threats to validity.

Now we’re on to Lee’s criticism of our methods. Lee states that he is not an expert in statistics, and the techniques we use are not commonly covered in basic statistics courses, so A for effort on his part, but in our opinion he has been too ready to move from “I don’t understand what the authors are doing here” to “the authors are wrong”. Furthermore, many of his criticisms have been covered in Part II of the book. He makes two criticisms:

  • Divergent understanding of survey questions;
  • Halo effects and confounding.

Lee summarizes his first criticism thus: “When respondents answer a question, the signal is always mediated by the respondents’ interpretation of the questions, and there is no guarantee that the readers agree on that interpretation, either with each other or with the investigators.”

It’s important to distinguish between claims against our particular application of psychometric methods, and psychometric methods in general. This falls into the second category. This is a great criticism in general, and one that critical readers should watch for in any survey. (I’m sure that all readers have been sent a very poorly-worded survey!) However, we were careful to point out that we minimized misunderstanding of survey questions (his specific criticism). That is, we avoided terms of art in our survey questions and instead used manifest variables (individual items) to create or build any nuanced latent variables (concepts). For example, no question in the survey asked readers if they did continuous integration — it’s too difficult to compare answers, because so many teams and organizations have different understandings of what continuous integration is! Instead, we asked respondents if their code check-ins triggered a build, triggered a series of automated tests, and if the feedback from their tests was available in a few minutes⁴. Furthermore, divergent understanding of data is not limited to psychometrics, it’s a problem of measurement in general! It happens all the time that you’re aggregating systems data from multiple sources, and the data is named the same thing but not measuring the same thing, or named something different but measuring the same thing (and thus excluded from aggregations and calculations), because a human who implemented the metric for a particular component made a judgment that differed from another human or from the specification. These cases happen regularly with system data, and for this reason, survey data can actually be much cleaner. There’s no escape from human judgment and subjectivity more generally, however many software engineers would prefer things were otherwise.

Since this is a foundational problem, you might think that it has already been discussed in psychometrics, a field first formally described in 1879 and which forms the basis of the modern fields of behavioral economics, social and cognitive psychology, and information systems (and through which the phenomenon of cognitive bias is studied). You would be right. There are multiple methods we use to make sure our measures are sound, including:

  • Using measures that have already been validated by other studies, such as the Westrum questions, where available;
  • Assembling a panel of subject matter experts and performing a card-sorting exercise, where each question is written on a card and experts are asked to sort them into logical clusters and identify items that don’t fit or are unclear in their meaning, and modifying the questions where problems were found;
  • Using multiple items (questions) per construct to make sure we weren’t relying on the interpretation of a single question to measure the construct as a whole;
  • Validating with exploratory factor analysis to ensure that the items that form a construct are in fact sufficiently collinear that they can be said to be measuring the same thing.

Together, these methods give us a high level of confidence that our measurement is valid. You might quite reasonably have further questions about these methods. Advanced students spend years of their life in graduate school learning about them! Just bear in mind that these are foundational questions about psychometrics (and really measurement more generally), not about our research in particular.

Next, Lee mentions that he has read a book by Phil Rosenzweig called The Halo Effect which discusses logical flaws people make in research on business performance. Fortunately you don’t have to read the whole book to understand what’s going on here, you can read the article he wrote for McKinsey that summarizes it. Lee cites the wikipedia article which defines the halo effect thus: “the tendency for positive impressions of a person, company, brand or product in one area to positively influence one’s opinion or feelings in other areas” and claims that survey respondents might be subject to this bias, such that if they feel that the organization is doing well, they might answer other questions more positively than is really true. For example, if they think they are doing well with lead time and release frequency, that might make them answer that they are checking into trunk more frequently than they actually are. (Lee then cites as an example of the halo effect a completely different phenomenon, a kind of mood-based availability bias that might lead someone to selectively recall good or bad events when answering a question. He also includes a long discussion about Microsoft’s hiring practices later on which is an example of the halo effect, and which is addressed by the discussion below.)

However the halo effect as described by Rosenzweig is more complex than Lee’s gloss suggests. In his McKinsey article, Rosenzweig says “The fact is that many everyday concepts in business — including leadership, corporate culture, core competencies, and customer orientation — are ambiguous and difficult to define. We often infer perceptions of them from something else, which appears to be more concrete and tangible: namely, financial performance. As a result, many of the things that we commonly believe are contributions to company performance are in fact attributions. In other words, outcomes can be mistaken for inputs.”

This is the exact reason we ask specific, strongly-worded questions that do not contain terms of art that you can like or dislike, not ambiguous ones that require significant interpretation (as mentioned above, we don’t, for example, ask if you are “doing continuous integration”⁵.) We do this so that you don’t need to make inferences that might be susceptible to the halo effect. Lee writes “I struggle to imagine how a question could measure, for example, how organizations ‘support a generative culture’ objectively” — but of course we wouldn’t ask such a question, as Lee presumably knows, because he quotes the actual statements we use for our Westrum questions, which have been validated in previous peer-reviewed research, in the previous section.

Finally, if such an effect were to occur (which, due to careful survey design, we have worked to exclude) it would bias people’s answers by more or less the same amount. However that’s not what we observe in the data. There are plenty of responses from organizations that have high software delivery performance but do badly at multiple given practices, and this is reflected in the fact that every year we have rejected hypotheses (as we discuss in the reports.)

As Lee correctly writes when he dismisses the possibility we have considered this, “any divergence between ‘high performing’ and ‘low performing’ businesses will have multiple plausible causes. And it will be extraordinarily hard to disentangle the effect sizes of all these causes.” The advanced statistical method we use to do precisely this — as well as calculating R squareds and a number of other variables we use to evaluate our model — is called Partial Least Squares Structural Equation Modeling and you can find the free software Jez wrote that implements the algorithm in Python at https://github.com/GoogleCloudPlatform/plspm-python.

The curious case of “inferential predictive analysis”

Lee spends a lot of time being confused about the fact that we use the term “inferential predictive” in our book because we wanted to emphasize that the methods we’re using allow you to make predictions. What we call “inferential predictive” is just what Leek (whom we reference) calls “inferential.” We’re sorry this was confusing, but we made this decision because it is a combination of what Leek calls “inferential” and what other fields simply call “predictive” (e.g., Management Information Systems, some conferences/journals in Computer Science). We felt that the distinction between inferential and predictive was an important one (and indeed the distinction is important in some fields), and chose the term we used in the book since it is designed to be read by non-specialists in the methods we used. We also wanted to honor and respect the fields that simply refer to “inferential predictive” analysis as “predictive” and use the terms “predict,” “drives,” or “impacts.” This is what happens when you do research between disciplines and need to know when to use which term based on the discipline. We include the discussion in Part II in the book, but for those interested in the stats, we wanted to provide that additional information, showing that these statistical relationships go beyond “just finding relationships in the data” — they are also suggested a priori by theories in existing literature.

Lee also engages in some armchair sciencing around the distinction between correlation and causation: “Anyone with a passing familiarity with statistics will have the obvious objection that it is trivial to find spurious correlations in a high-dimensional data set. The authors explicitly acknowledge this danger, and claim to have avoided it by formulating hypotheses before designing the surveys to test them. This is not as strong as performing a blinded experiment with a pre-registered hypothesis but it’s something. The authors contend that their methodology reduces the scope for mining arbitrary correlations out of the data, and then fitting constructs to them.”

This again is where inferential prediction comes into play: the “prediction” part of the phrase states that a predictive relationship was found in the statistical analysis; the “inferential” part of the phrase means that even before any statistics were run, theories (from the academic literature on management, or lean, or industrial and organizational psychology) suggested that those relationships should be there and we should look for them. It’s the opposite of “fishing for data” and is also short of causation.

It is also worth noting that the concept of a blinded experiment doesn’t make sense for the methods we are using; as we note in Part II, a double blind experiment would require that we split set of folks (teams and orgs in this case) into a set number of groups, assign them a condition (say, low/med/high testing, CI, etc), and then make them continue down that path for a set period of time while we collect data. In the real world, no organization is going to continue executing on a potentially bad strategy and lose money, market share, and customer satisfaction. So we conducted a field experiment — we gathered data in the field, where work is being done, where there are real, actual consequences, and control for confounding variables as best we can⁶. While we didn’t publicly preregister our hypotheses, we did state them in written form before analyzing the data, and we did not perform any analysis without first stating our hypotheses — this is consistent with current research best practices.

Let us emphasize the extent to which what we are doing is not anything special, and is the standard model of how you do a science experiment of the type we are doing in the field of Management Information Systems. You start with what you want to find out. You present the theory you are using. You discuss how the theory applies to your research. You develop hypotheses. You test your hypotheses by conducting data gathering and doing the data analysis. You absolutely don’t gather data and then hunt around for correlations. This is what some folks call “fishing for data” and it is indeed dangerous, because so much of data is highly correlated that we’re bound to find something.

This is what we did every year: come up with theory-based hypotheses (including the models which show how we expect the constructs to relate to each other) and then test them against the data. That’s why our results are inferential rather than just correlations, and why we can use words like “drives”, “predicts”, and “impacts”, but not “causes” since we’re not performing a randomized, controlled experiment (which would be impossible in software engineering, as it would in collider-based particle physics experiments). Yes, as Lee notes, we have been careful in our word choice and taken care to explain it to a level we thought appropriate for a non-statistically-technical audience.

The impossibility of evaluation or replication

It may be worth starting with a note that [exact] scientific replication is done by replicating existing research findings with new data, in similar contexts, with similar controls — often by other researchers or at least other projects⁷. Replication is not simply re-running the same (that is, our) existing data and models. This is why we have included a methodology section in the back of all of our State of DevOps Reports (even if only briefly) and in more detail in Accelerate — to show readers and researchers the analyses and paths we have used both for evaluation and initial steps to recreation. In fact we have also heard from a handful of other researchers who are working on replicating our research model (or portions of it, as is appropriate in incremental research), and look forward to seeing their work⁸.

To the question of sharing our data: unfortunately it cannot be shared. It’s important here to point out that it’s common — and even best practice in some cases — to keep research data sets private. Our user consent forms⁹, which were often reviewed by legal/privacy counsel and human subjects protection boards, stated that the data would not be shared. In addition, it’s very hard to get people to take surveys, and one of the ways you reassure people you’re doing proper research with sensitive data is by promising you won’t share their data. Keeping the data private also ensures that people can be honest and forthcoming; if this seems ridiculous to you, then you might have not had to worry about being identified at work before. By ensuring confidentiality and privacy of the dataset, folks answering the survey were assured they could answer honestly.

We haven’t shared all the questions, but we’ve shared a lot of them in our freely-available and detailed discussion of all of the constructs described in the book as well as the constructs we research after the book was published: https://cloud.google.com/architecture/devops/capabilities. In addition to the papers we’ve already published — Appendix A of this paper includes a number of the survey questions — we have journal papers currently in peer review which provide additional details on the methodology and questions.

Conclusion

We’re happy that Accelerate reached an audience wide enough that someone a) learned some statistics and b) took the time to write a ~11,000 word critique. It is unfortunate that much of this critique rests on misunderstandings of the book and the methods of behavioral science and the related statistical methods that underlie it, and we obviously would prefer that Lee had not so frequently jumped from “I don’t understand how the authors reached this conclusion” to “the authors must have made some fundamental error.”

Perhaps another lens can be used to view this work as well, and that’s the impact it has had. Many organizations and teams have benefitted by applying it. While Accelerate didn’t invent the capabilities we found to promote software delivery performance (that was never the goal), it did help to provide structure, clarity, and sometimes justification for teams and organizations who might need an additional nudge.

However, we do sincerely hope that more people apply behavioral methods to research how to do software engineering better, because we think there is a lot more to discover with these methods.

Thanks to Niall Murphy and Laura Nolan for feedback on a draft of this note.

Jez Humble holds a BA in Physics and Philosophy from Oxford University and an MMus in Ethnomusicology from the University of London. He is the co-author of four bestselling books on software, including Jolt Award winner Continuous Delivery and Shingo Publication Prize winner Accelerate. Jez works as a site reliability engineer for Google Cloud and is a lecturer at the University of California, Berkeley.

Dr. Nicole Forsgren is a Partner at Microsoft Research, where she leads Developer Velocity Lab. She is author of the Shingo Publication Award-winning book Accelerate: The Science of Lean Software and DevOps and is best known as lead investigator on the largest DevOps studies to date. She has been a successful entrepreneur (with an exit to Google), professor, performance engineer, and sysadmin. Her work has been published in several peer-reviewed journals.

Endnotes

  1. Note that we haven’t answered all of the questions and criticisms, just the major ones. The more minor ones will, we think, be answered when the papers we currently have in peer review come out.
  2. The idea that developers should break their work up into small chunks that are checked in regularly to trunk/main rather than using long-lived feature branches is still the most controversial idea we discuss, over two decades after Kent Beck’s Extreme Programming book came out.
  3. These dependent variables also have differing effect sizes which we calculate, as discussed later.
  4. While it is true that understanding each of those things may be slightly different, there is less opportunity for confusion than from asking if teams practice continuous integration.
  5. An earlier version of the questions we actually ask can be found in Appendix A of this paper.
  6. In some fields (like marketing, websites, etc), A/B tests are possible because very small changes and experiments can be done very quickly. We haven’t figured out how to do this for large, complex software delivery pipelines.
  7. We could back this definition up with several citations, but here’s one that includes a brief discussion of exact replication and conceptual replication. I’ll also note that our research projects were replication studies: while we expanded our investigations iteratively, we also repeated some aspects each year, therefore re-testing and replicating our findings. Some years, we reviewed and refined or improved our measures. (For example, our 2019 treatment of clear/ heavyweight change process is a refinement and update of the construct from 2014. While the overall idea held, the construct split, offering more nuance.)
  8. We do note there is not as strong an incentive to conduct and publish replication studies in many fields such as MIS, discussed here. A relatively recent article in CACM also points out publication bias, which disincentivizes replication research. Although we agree that replication studies are important, this may explain the lack of replication studies of our work from others and is not unique to our case.
  9. These varied across the several years the research was conducted, but most of them did stipulate the data would be kept private.

--

--

Jez Humble

Co-author of some books on software. SRE @google, lecturer @ucberkeley. Ex-@18F, co-founder @devops_research. PGP: http://keybase.io/jezhumble . He/him. Ⓥ