In-Depth: Is SAFe® Really That Bad?
A sobering investigation of scientific evidence on scaling frameworks, particularly SAFe®.
How do you feel about the Scale Agility Framework (SAFe®)? I don’t know about you, but I’m skeptical of SAFe and I’m very skeptical of scaling in general.
I’m not alone in this. My sense is that the professional community doesn’t like SAFe®. It is easy to find scathing opinion pieces and memes about SAFe® on social media. While there is absolutely nothing wrong with sharing personal experiences, those experiences are often treated as generalizations that apply to all implementations. This leaves me unsatisfied. In those cases, authors rarely offer evidence to support their dislike for SAFe®. In most cases, they don’t even qualify their level of experience with SAFe®.
In 2021, our good friend Stefan Wolpers from Age Of Product conducted a survey among Agile practitioners to calculate a Net Promoter Score (NPS) for SAFe®. With a sample of 505 participants, he arrived at a remarkably low NPS of -56. But how reliable was that? It is very likely that people who were already critical of SAFe® more eagerly participated, leading to a self-selection bias. Also, the measure didn’t ask participants to rank SAFe® against alternatives. So while -56 is a low result, we don’t know how other approaches would’ve faired. Finally, the NPS has been strongly criticized by scientists as a reliable measure (Fisher & Kordupleski, 2018). So I suspect that these results a more reflective of the public opinion of SAFe® among Agile practitioners than how effective it is in reality.
As I’ve written before, I think we should base our beliefs about Agile more on evidence, and this is one area where I think we’ve been dropping the ball for a while now. Fortunately, we can answer this question to a large extent with data. So I decided to fire up SPSS and load the most recent dataset from our Scrum Team Survey. I also read all the relevant scientific studies I could find via Google Scholar. Let’s see what the evidence has to say!
This post is part of our “in-depth” series. Each post discusses scientific research that is relevant to our work with Scrum and Agile teams. We hope to contribute to more evidence-based conversations in our community and a stronger reliance on robust research over personal opinions. As you may imagine, it takes a lot of time to compile such evidence-based posts. If you think our content is valuable, and you think we should write more of it, you can support us on Patreon.
Preface: About Our Data
In addition to the other scientific studies we cover in this post, we also run analyses on data from our own Scrum Team Survey. We created this tool for teams to diagnose and improve their process in an evidence-based way. You can use the free version for individual teams. For this post, we used a sample of data from 1.963 Scrum teams and 5.273 team members. Our data consists of responses to questionnaires that are periodically taken by teams, their stakeholders, and supporters. This data was taken at various moments, so some teams may have just started with their framework of choice whereas others will have years of experience with it already.
For team members, the questionnaire consists of 80+ Likert questions that measure the quality of processes that happen in teams (e.g. shared learning, collaboration with stakeholders, value focus). For stakeholders, the questionnaire consists of 12 Likert questions that assess their satisfaction with the outcomes of teams on four dimensions (responsiveness & collaboration, value, quality, and satisfaction). We used Exploratory Factor Analysis (EFA) and Structural Equation Modelling (SEM) to select and develop reliable psychometric scales out of the questions. Our scales for team members are published in a peer-reviewed paper (in press) and will do so for the stakeholders in the near future as well. An overview of how we develop reliable questionnaires is described in detail here.
A strength of this dataset is that it represents real teams from across the globe that use it to identify real improvements. We performed a scientific study with Prof. Daniel Russo from the University of Aalborg to identify five core factors that explain a large part of what makes Scrum teams effective.
The argument against SAFe®
I won’t describe SAFe® in this post. But you can go to this site to learn all about it. The criticisms of SAFe® in the professional community can be summarized as: “The Scaled Agility Framework only makes things seem Agile, without actually improving the Agility of the adopting organization”. So despite its promises of business agility, it just doesn’t work.
At the same time, VersionOne reported it as the most popular scaling framework in 2021 (37%). If it doesn’t work, why do so many organizations flock to it? This is indeed surprising, and probably what irks so many in our professional community. Many reasons for this discrepancy have been offered. Some authors blame the cynical motives of management (“SAFe® allows them to keep everything the same, but just label it with Agile terms”), or their gullibility towards an easy solution (“SAFe® is an off-the-shelf solution”). Others blame the aggressive marketing of SAFe®, or its commercial model to sell as many certification classes as possible.
While all of these reasons probably hold some truth, I doubt it's the whole story. Because if the core argument were true — that SAFe® doesn’t work at all — I wouldn’t expect this level of (continued) adoption of SAFe®. Fortunately for us, the core argument against SAFe® is a hypothesis we can test with data.
“Fortunately for us, the core argument against SAFe® is a hypothesis we can test with data.”
Finding #1: Teams that use SAFe® score similarly on key indicators of team effectiveness
A first useful test would be to see if teams that use SAFe® perform differently from teams that use other approaches. If we assume that SAFe® is less effective, we expect to see evidence of this in the processes that comprise the five core factors of effective Scrum teams as compared to other frameworks. For example, we would expect lower team autonomy, lower responsiveness, lower concern for stakeholders, and fewer opportunities for continuous improvement. We ask teams to specify which framework they use. The most prevalent ones are SAFe®, LeSS, Scrum of Scrums (a practice recommended in earlier iterations of the Scrum Guide), homegrown, and something else. We also added teams that don’t use scaling.
So we performed a simple statistical test called ANOVA to see if such differences exist. The results of this are shown below (click to enlarge):
For each core factor, except team autonomy, the results are indeed significantly different (p < .001). But this is hardly surprising since the sample is so large that even minor differences will be significant (N=1.873, 5.723 team members). We do indeed see that the absolute differences are small. For each factor, the scores between approaches differ on average by no more than 0.2 (on a scale from 1.0 to 7.0). We can also tell from the standard deviations (the bars) that teams vary widely in their results, but this range is quite uniform across the scaling approaches. This means that for each scaling approach, there are teams doing much better than the average and much worse. One pattern that jumps out is that teams that only use a “Scrum of Scrums” to scale score a bit better on all factors.
We also performed a secondary analysis to get a sense of the effect size of the choice of scaling approach. We performed linear regression with team effectiveness as the outcome and the scaling approach as the predictor. Although both the model and the predictor were significant, the standardized beta of the scaling approach on team effectiveness (with everything else kept equal) was .057 on a scale from -1 to 1. The choice of scaling framework alone predicted 6% of the effectiveness of Scrum teams. Such effects are generally classified as “very small” or “small”.
So, the quality of the core factors that make Scrum teams effective is not meaningfully different between the Scaling approaches. Teams that use LeSS can apparently be just as responsive and focused on stakeholder needs as teams that use SAFe. Their autonomy can be just as high as teams that use other approaches to scaling, or no scaling at all. The wide range of scores (the bars) strongly suggests that other factors are substantially more important than which Scaling framework is chosen.
“The quality of the core factors that make Scrum teams effective is not meaningfully different between the Scaling approaches.”
A caveat is that these results only reflect the quality of the processes as perceived by the teams themselves. It is possible that teams that operate in a highly scaled environment feel that they are very responsive, but that actual releases happen so infrequently due to slow release trains that it doesn’t matter. That wouldn’t be Agile. Nor does it satisfy the purpose of the Scrum framework, which is to deliver more value to stakeholders sooner. Ultimately, the stakeholders are in the best position to evaluate the differences. Fortunately, we also have access to data that allows us to investigate this.
Finding #2: Stakeholders seem to be similarly satisfied with the outcomes of SAFe®
The Scrum Team Survey also allows teams to invite their stakeholders to evaluate the outcomes on four dimensions: quality, responsiveness, release frequency, and team value. We always encourage teams to invite their actual stakeholders (e.g. users and customers), and not just internal stakeholders who are only professionally involved (e.g. marketing, management).
We ask stakeholders a set of 12 questions to assess their satisfaction with the outcomes, loading on four distinct dimensions. This offers a much more fine-grained analysis than just asking “How happy are you with the results?” and takes out sources of bias.
We were able to analyze the results from 857 unique stakeholders who evaluated the outcomes of 241 Scrum teams. 49% of these represented users, 28% represented customers and 22% represented internal stakeholders. We performed another ANOVA to test if stakeholders evaluate the outcomes differently for different frameworks (SAFe®, LeSS, Scrum of Scrums, Other, and None). The results of this are shown below:
The differences were not statistically significant between the scaling approaches. Note that we only had 3 teams in our sample that used LeSS. Just as with the core factors, we can also tell that any difference that exists is small at best. The absolute difference between the scaling approaches is on average 0.2, which is very modest on a scale from 1 to 7 and considering the standard deviations. We can see that SAFe seems to score a bit lower than the other approaches for each factor, but this is not significant. The range of scores is also quite broad for all scaling approaches for all factors, as we can tell from the standard deviations. It is also interesting to note that, again, “Scrum of Scrums” seems to result in slightly better results for each factor, though not significantly so.
We also performed a secondary analysis to get a sense of the effect size of the choice of scaling approach. We performed linear regression with the four dimensions of stakeholder satisfaction as the outcome and the scaling approach as the predictor. None of the results were significant, which means that the scaling approach does not statistically predict stakeholder satisfaction.
Taken together, this means that stakeholder satisfaction does not differ significantly between the scaling approaches used. Stakeholders can be just as satisfied with the outcomes that are produced through SAFe as they are through a custom approach or no scaling at all. The wide range of scores (the bars) strongly suggests that other factors are substantially more important than which Scaling framework is chosen.
“Stakeholder satisfaction does not differ significantly between the scaling approaches used.”
Of course, there are limitations to these conclusions. First of all, the number of teams that used SAFe and LeSS was rather low in our sample (about 100 stakeholders representing 28 teams). It is possible that a larger sample may lead to stronger and/or more significant results. However, it is unlikely that these differences will be more substantial. Another limitation is that there may be a selection bias in which stakeholders are invited. We can’t rule out that teams invite stakeholders only when they feel they are doing well enough. But this would require that we assume that this self-selection bias is stronger for SAFe® than for other methods, which is questionable.
Finding #3: Scientific investigations of scaling approaches see benefits and challenges to SAFe®
The findings we discussed up to this point are based on data from the Scrum Team Survey. Although this already offers a useful empirical perspective on how the various scaling approaches compare, we shouldn’t rely on any single source. So I went to Google Scholar and searched for review articles that contained the keywords “scaling agile comparison”. I found and read 9 papers by academic authors. Below, I summarize the insights that are most relevant to this post.
Almeida & Espinheira (2021) compared six Large-Scale Agile frameworks on 15 dimensions, including customer involvement, learning ability, and time-to-market. Their comparison included SAFe®, LeSS, Nexus, Scrum at Scale, DAD, and the Spotify model. They concluded that “the findings did not reveal a dominant framework that is better for all dimensions”. Instead, the authors argue that organizations do well to seek a framework that fits the mindset already present, and not to impose one that is very different (and based on the data, this may not be a bad strategy). So SAFe® or LeSS may work well in one organization, but not at all in the other.
A larger review was performed for the same scaling approaches by Edison, Wang & Conboy (2021). They collected 191 scientific studies across 134 organizations that investigated one or more of these approaches between 2003 and 2019. The aim of this study was not to compare methodologies, but rather to identify patterns in their adoption. However, the findings echo those of Almeida & Espinheira (2021) in that what works is highly dependent on context. A framework that works well in one setting, may not work at all in another. More importantly, the authors identify 31 challenges that are grouped into 9 areas that are typical when scaling Agile: inter-team coordination, customer collaboration, architecture, organizational structure, method adoption, change management, team design, and project management. Based on the 191 studies they reviewed, they conclude that none of these challenges are unique to specific large-scale methodologies. In fact, opting for a homegrown approach may lead to slightly more challenges according to these authors.
Riedel (2021) investigated how teams interact in different scaling frameworks (Nexus, LeSS, and SAFe®) for her master thesis. Through a literature review and interviews with 9 experts, she concluded that inter-team coordination is a crucial success factor when scaling Agile. This is also recognized by other academic authors (e.g. Dingsøyr & Moe, 2013). She found that of the three frameworks, Nexus provides a better foundation for high-quality, organic inter-team coordination, whereas SAFe® tends to discourage it through procedures and impersonal interactions.
Putta, Paasivaara & Lassenius (2018) investigated the challenges and benefits of SAFe®, specifically. They reviewed 6 academic studies and 47 unique case studies to identify patterns in benefits and challenges. A number of benefits were reported across studies and cases: 1) focus on continuous improvement, 2) increased alignment, 3) enhanced collaboration, 4) improved management of dependencies, 5) more empowered teams, and 6) improved employee satisfaction. However, the authors also identified a number of challenges across studies and cases, particularly 1) change resistance, 2) mindset, 3) low autonomy, and 4) difficulties with PI planning. A limitation of this study for the purpose of this post is that it only investigated SAFe®. However, this study provides a more nuanced view of the benefits and challenges of SAFe®.
Discussion: A different view of frameworks
So based on the data and research I had access to, there don’t seem to be substantial differences between key indicators of teams and the satisfaction of their stakeholders. The scientific studies we explored by other authors also support this pattern, although inter-team collaboration may be a particular challenge. In any case, the preference for simplicity that many in the Agile community have, and the resulting dislike for very complex SAFe®, are not reflected in the results.
“The preference for simplicity that many in the Agile community have, and the resulting dislike for very complex SAFe®, is not reflected in the results.”
Now, why is that? How is it possible that different frameworks — ranging from highly prescriptive to very minimalistic — don’t seem to create meaningful differences in what happens in teams and what is delivered to stakeholders? This seems to fly in the face of the assumption that simplicity trumps complexity. My interpretation of this is that frameworks don’t matter nearly as much as we assume they do and that other factors are far more important in shaping success or failure.
Because what is a framework, really? It's just a set of prescribed roles, events, artifacts, and some principles. It's basically a blueprint that is divorced from the reality of a real organization, real teams, and real people. It's a platonic ideal. But anyone who has applied frameworks knows that they are often loosely practiced and tend to differ from one organization to another. What a scaled effort with SAFe® (or Scrum!) looks like in one organization may be entirely different from another, except maybe for references to framework terminology. So I believe there is a loose correlation between what frameworks prescribe and what actually happens on the ground.
There is also a loose correlation between how parts of a framework are practiced from organization to organization. What PI planning looks like in one organization, as well as what it produces, is probably quite different between organizations. I would argue this is a good thing. Because frameworks are only idealized blueprints, they always need to be adapted to the uniqueness of an organization, the needs of employees, and internal politics. A consequence of this loose correlation is that there is a lot of space for other factors to determine how successful it will truly be. In fact, our data suggest that these factors are more important than the one we did include: the framework of choice.
I feel like there is a deeper truth buried beneath that insight. Potentially, it may show us that structure (events, roles, artifacts, etc) doesn’t matter all that much to how work is done in teams. The quality of their internal processes is much more important. To what extent do teams actually engage with their stakeholders? To what extent do they actually work to release as often as they can? How do they reflect upon, and improve their process? How do they leverage their skills and autonomy? And how are they supported by management? All of these processes can happen — and at high quality — regardless of the scaling approach used.
This is a freeing thought. And one that I’m sure will be recognized by many veterans in our industry who have seen many teams and frameworks in practice. It allows us to focus our energy more on ensuring the quality of these processes rather than the adherence to some idealized framework. Or expending a lot of energy fighting particular frameworks we don’t like.
“This is a freeing thought. […] It allows us to focus our energy more on ensuring the quality of these processes rather than the adherence to some idealized framework. Or expending a lot of energy fighting particular frameworks we don’t like.”
Potential limitations and alternative explanations
However, there are other ways to explain the lack of differences. The measures could be incomplete because they represent only one “moment in time”. Perhaps stakeholders in SAFe® are happy because they have no sense of how much better things would be with other approaches. To test this hypothesis, we would need data from teams and stakeholders who have experienced one framework and then another. However, such longitudinal data is hard to collect in volumes that are large enough to allow statistical generalization. Few teams experience such switches, and when they occur, they are usually accompanied by larger organizational changes that need to be controlled if we want to isolate the switch in the framework as the primary cause of changes. It is also difficult to control for beneficial learnings that were already incurred from the first framework.
A counterpoint to this alternative explanation is that it still assumes that frameworks make a big difference, but that we simply failed to detect it. While certainly possible, we also did not see any substantial differences in autonomy, continuous improvement, responsiveness, and stakeholder concern across teams that use different approaches. These core factors explain much of how effective Scrum/Agile teams are in practice. If they don’t vary meaningfully there is no strong theoretical reason to assume the outcomes will. Also, our measure for stakeholder satisfaction was much broader than just “are you happy?”. We did not observe a meaningful difference in how satisfied stakeholders are with the value delivered, the responsiveness of teams, the quality of what is delivered, and the level of collaboration with teams. So taken together, I feel that the explanation I propose here — that the choice of framework has little influence overall — is more parsimonious. Nonetheless, my explanation can be falsified if empirical evidence from a substantial sample shows that teams and stakeholders are more productive and satisfied when they switch from SAFe to another framework.
Another reviewer noted that we should compare also the organizational outcomes (e.g. financial results, market shares, etc). I agree that this would create a more complete picture, and would welcome such research. At the same time, I feel it is only fair to then also ask for such a “burden of proof” of all Agile methodologies. Finally, the data from the Scrum Team Survey is mostly focused on the team level.
Another limitation is that we did not control for the degree to which organizations adhere to their scaling framework. It is possible that the results would’ve been different for SAFe® when we would’ve analyzed only those organizations that adhered to the framework fully — in either direction. In a very cynical interpretation, one could argue that SAFe® didn’t perform as badly because many implementations (fortunately?) don’t go all the way and thus don’t hit the perceived bottom. However, it is very hard to properly measure adherence. How should we define “full adherence” vis-a-vis “low adherence”? Would the creators of the various frameworks agree with such definitions? For example, while critics of SAFe® argue against it because of its rigidity and prescriptiveness, proponents argue that only the principles of SAFe® are required while everything else is optional. So while it may be possible to measure and control for “adherence to the framework”, it won’t resolve actual debates as both sides resort to a “No True Scotsman”-argument to claim that the definition for “adherence” is either too rigid or too loose and simply discard the evidence. Without a crystal-clear definition that all sides agree on, there seems to be little value in controlling for adherence.
Regardless of these limitations, I feel it is a safer bet to trust the data we have than the data we hope to find at some point in the future. It is possible that closer and more thorough examinations will show that SAFe® is much less effective than other scaling frameworks. But the data we have at this point doesn’t it suggest it. So in the meantime, I think we do well to adjust our beliefs accordingly.
Implications for practitioners
So what does all this mean in practice?
- Focus less on the type of framework that is chosen to scale work across teams, and more on the health and quality of the processes that happen in teams and are key indicators of their effectiveness (e.g. collaboration with customers, high release frequency, autonomy, and continuous improvement).
- Periodically reflect with all involved teams to identify how inter-team collaboration can be improved to further improve the aforementioned processes.
- Regardless of what approach is used to scale work, management support and the quality of inter-team collaboration are always important. Measure it closely and involve everyone to improve it.
- You can use the Scrum Team Survey to diagnose your Scrum or Agile team for free. We also give you tons of evidence-based feedback. The DIY Workshop: Diagnose Your Scrum Teams With The Scrum Team Survey is a great starting point.
- Our Do-It-Yourself Workshops are a great way to start improving without the need for external facilitators. The DIY Workshop: Identity What To Start, Stop, And Improve In Your Scrum Team With Ecocycle Planning or DIY Workshop: Identify The Metrics That Are Most Relevant To Your Success are great starting points. Find many more here.
- We offer a number of physical kits that are designed to make teams more effective. We have the Scrum Team Starter Kit, the Unleash Scrum In Your Organization Kit, and the Zombie Scrum First Aid Kit. Each comes with creative exercises that we developed in our work with Scrum and Agile teams.
The empirical evidence we have to date does not seem to support the negative view that many in the Agile community seem to have of SAFe®. Would I personally recommend it? No. But the research I’ve done for this post made me recognize that the data and research I have access to suggest a more nuanced picture than my belief. This is good. When my experience conflicts with objective data, it gives pause for reflection.
“The empirical evidence we have to date does not support the negative view that many in the Agile community have of SAFe®.”
I believe that Agility is about making things simpler. The complexity I perceive in SAFe® makes me skeptical of its ability to do so. I worry that it is too restrictive and reduces personal freedom because of this. It also doesn’t align with our mission at The Liberators to unleash people, teams, and organizations, nor my ethical considerations.
However, the notion that simplicity leads to better outcomes is a belief. The data I researched for this post suggests that it may not be that black and white. Stakeholders seem to be similarly satisfied with the outcomes that are produced through SAFe® as they are through other (simpler) scaling approaches. Teams don’t score meaningfully differently on the five key indicators that give rise to effective Scrum teams. The scientific studies that compared scaling approaches also didn’t find any approach to be better or worse than others. Instead, all this research suggests that it is highly contextual what works. SAFe® may work really well in one organization, and LeSS or a homegrown solution may work better in another.
An intriguing hypothesis I developed from the research is that methodologies and frameworks — structure — may not matter nearly as much to the actual outcomes as we assume. Perhaps what actually matters are the quality of the processes that actually happen on the ground, and that such processes are mostly orthogonal to the chosen framework.
One hope I have for our community is that we spend less time on memes, flame wars, and vitriol towards SAFe® and frameworks in general. There is nothing wrong with making fun of something we don’t like. But we should be very careful with strong generalized statements. That you don’t like SAFe®, or that SAFe® didn’t work in your experiences, does not mean it is inevitably bad and evil. Instead, I hope we can spend more time exploring actual evidence and sharing experiences about what parts of frameworks are useful and under which conditions. A mature professional community is one that celebrates nuance and the accumulation of reliable knowledge. Let's do better!
This post took over 40 hours to research and write. If you think our content is valuable, and you think we should write more of it, you can support us on Patreon. Find more evidence-based posts here.
We thank all the authors of the referenced papers and studies for their work. In particular, we want to thank Annika Riedel for offering us access to her master thesis, in which she compares scaling frameworks. I also thank Sjoerd Nijland and other reviewers for their constructive feedback.
Almeida, F., & Espinheira, E. (2021). Large-scale agile frameworks: A comparative review. Journal of Applied Sciences, Management and Engineering Technology, 2(1), 16–29.
Edison, H., Wang, X., & Conboy, K. (2021). Comparing methods for large-scale agile software development: A systematic literature review. IEEE Transactions on Software Engineering.
Dingsøyr, T., & Moe, N. B. (2013). Research challenges in large-scale agile software development. ACM SIGSOFT Software Engineering Notes, 38(5), 38–39.
Fisher, N. I., & Kordupleski, R. E. (2019). Good and bad market research: A critical review of Net Promoter Score. Applied Stochastic Models in Business and Industry, 35(1), 138–151.
Putta, A., Paasivaara, M., & Lassenius, C. (2018, November). Benefits and challenges of adopting the scaled agile framework (SAFe): preliminary results from a multivocal literature review. In International Conference on Product-Focused Software Process Improvement (pp. 334–351). Springer, Cham.
Riedel, A.(2021), Master Thesis: Success Factors and Challenges in the Coordination of Inter-Team Dependencies within Scaling Agile Frameworks — a Qualitative Study of Nexus, LeSS and SAFe. Hochschule für angewandtes Management, Bayern.
Verwijs, C., & Russo, D. (2021). A theory of scrum team effectiveness. arXiv preprint arXiv:2105.12439.