Understanding Antifragility and Black Swans, and How They Impact Your R&D Organization

Published in

AppsFlyer Engineering

8 min readOct 24, 2022

Building a highly functioning software development team is not an easy task. It requires patience and persistence with hiring the right people, maintaining quality of delivery, and promoting your desired culture.

As an Engineering Manager, you must ask yourself: “How can I make sure my team will continuously grow and improve? How can we become 10 times better a year from now, given the volatile ecosystem we operate in?”

These are questions I’ve been asking myself since building my first team nearly three years ago. I knew I wanted to create a framework that will promise this kind of continuous growth and an ever-evolving mindset.

After a production incident we experienced two years ago (a major cascading failure that took us over two hours to mitigate), my manager at the time mentioned the concept of “antifragility” to me — or the idea of how we can focus on growing from such incidents.

It took me some time to process, and I went on reading Nassim Taleb’s Antifragile book. And then it hit me: The key to a truly successful team lies within its ability to grow from shocks.

In this series of posts I’ll discuss what it means to build an antifragile software team and what it takes to do so. Building such teams will result in higher confidence, faster delivery, more resilient systems, and a continuous learning mindset.

I’ll share lessons learned from my experience, and discuss how to avoid potential pitfalls so that you’ll be able to leverage your next crisis to team growth.

I’ll start with discussing the basic concepts related to antifragility, and explain how it relates to any R&D ecosystem.

Understanding Black Swans

Our world is full of unexpected events, many of which have a huge impact on our daily lives. Such events are referred to as Black Swans (BSWs).

The magnitude of BSWs can be measured by multiplying their probability by their cost (== P*C). This is basically the expected cost (or gain) of such events.

BSWs can happen everywhere and anytime, and predicting them can be very difficult. That’s why it’s often better to focus on reducing their cost and probability.

Let’s analyze a simple BSW example. Say your 2-year-old bike gets stolen. Generally speaking, of course, it can be quite easy to calculate such a probability.

Probability

The number of bikes stolen per year divided by the total number of bikes.

That being said, we should consider the following:

This calculation does not take into account different factors and variables that impact the actual probability, such as bicycle brand, age, color, where they are stored, whether they are locked, which type of lock is used, and so on.
Even if we do take all of these into account and get to a number, the fact is that people are often not good at understanding probabilities. Say for example that we took into account all of these factors, ran this super-sophisticated calculation, and reached a number (for example, a 1.9% chance that your bike will get stolen). How is this knowledge helpful to you? How would you act upon knowing it? Would you act differently if the number was 3.1%?

Cost

Analyzing the cost is more straightforward, but still isn’t crystal clear.

Say that a new bike like yours costs $500.

What is the cost of a two-year-old bike?

Let’s assume that their market value is $300. In that case what would it cost you if it were stolen? $300 or $500?

Well, in this scenario you could buy a new pair of bikes for $500, look for a used bike for $300, start commuting using public transportation, and so on.

Conclusion

As you can see, calculating the exact cost of such a BSW is not so simple. It’s extremely difficult for us to predict the magnitude of such an event. Both probability and cost are not easily calculated, and even if calculated, they’re not so easy to fully grasp and to change our actions accordingly.

In this case, what can you do?

You can first focus on reducing the magnitude, instead of predicting it. You can reduce the probability by never leaving them unlocked, using a stronger lock, or locking them only in populated areas. Or you could reduce your cost by purchasing theft insurance.

It was a cascading failure we experienced two years ago that sparked my interest in antifragility concepts within R&D organizations

How This All Ties in to R&D

Engineering organizations are basically a lake filled with BSWs.

A few examples include:

Production incidents of all kinds.
Employees leaving your team.
Employees joining your team.
Urgent demands raised by clients.
Cross team conflicts.
Internal team conflicts.

If you’ve been a part of an engineering organization for more than a few months, these examples should be very familiar to you.

After analyzing the bike theft example, let’s try to analyze a BSW event of a key employee leaving your team, in order to better understand its magnitude.

We are looking for (P*C), meaning the probability of this employee leaving multiplied by the cost of their attrition.

Probability

The probability of an employee attrition can be calculated by looking at industry standards. How many months does an employee usually stay at a company? (given a company’s size, vertical, employee salaries, and so on).

Again, say there is a 14% chance that your key employee will leave your team this year. How would you act upon this number?

Cost

Calculating the cost of attrition in this case is tricky, but possible.

You want to measure how much time (i.e. how many salaries) it will take your team to fully recover from this attrition. For instance, the training of a new developer replacing your former employee, recovering undocumented knowledge (when possible), covering for their areas of expertise, and so on.

That said, there are additional costs that are not certain but should be taken into account, such as the inability to handle urgent requests due to a lack of knowledge/skills, handling production incidents, and so on. All of these impact the actual cost of attrition.

Conclusion

As we can see here, it’s often quite difficult to quantify and predict the actual magnitude of employee attrition — so our best chance is to focus on reducing it.

The employee’s general sense of happiness, the impact they’re making, and the growth they’re experiencing (speed and direction) are among the factors that impact the probability of that employee’s attrition. So if we focus on improving these metrics, we can reduce the probability of attrition as well.

Additionally, if we focus on reducing undocumented knowledge and improving skills and expertise distribution across the team, we can reduce the cost of attrition.

This is a very basic analysis of a possible BSW in an R&D organization. We need to keep in mind that such events are extremely common, and that we won’t always be able to reduce their magnitude. We also can’t possibly prepare for all possible BSWs that come our way.

This is why we must always form a culture where BSWs are something to gain from.

“Engineering organizations are basically a lake filled with Black Swans.” (credit: Dall-E) — “*Engineering organizations are lakes filled with Black Swans.” (credit: Dall-E)*

The Triad, and Why Antifragility Is So Interesting

Now that we understand the concept of BSWs, let’s talk about how different entities respond to such events.

The triad consists of three types of entities: fragile, robust, and antifragile. They differ by their long term response to disorder, shocks, randomness, stressors and errors (i.e., BSWs).

Fragile entities are harmed by BSWs.
Robust entities are indifferent to them.
Antifragile entities gain value from them. In other words, unexpected events result in positive long term consequences for antifragile entities.

The most basic example of an antifragile system is the human muscle system. Putting stress on our muscles (i.e. lifting weights) initially results in muscle micro tears, which after a recovery phase will result in growing larger muscles, with the capability of lifting heavier weights.

Our goal should be to develop teams with an antifragile culture and mentality — in other words, a team that keeps growing and improving as a result of shocks and errors. (i.e. BSWs)

Antifragile entities benefit from shocks in the long term (credit: Guy Grinapell)

What Can be Gained from Black Swans

As software professionals of all ranks and titles, your units often face such events — but did it ever occur to you that it’s possible to gain from them? Did you put in the time and effort to achieve that? The beauty of an antifragile culture is that it doesn’t fear BSWs; it embraces them.

In fact, after practicing antifragile culture and concepts, I can say it out loud: I’m a huge fan of BSWs!

My reaction to a BSW within my team is: “great, how can we learn from it?” And so should be yours!

Let’s go back to the example of a key employee leaving your team. Is there really a way to gain from it?

The answer is yes.

While leading my previous team, one of my senior engineers approached me about his aspirations in facing specific challenges in a different team within the organization. He basically requested my approval for his move.

It shocked me at first, but it was very clear to me that if he really wishes to leave the team, he will. I can either support him and end it nicely and respectfully, or oppose him and end it unpleasantly.

I knew this would have a major impact on the team, but I also understood three important things:

It will send a message of trust to the rest of the team. They will know that the organization will support them when they want to make their next move. Surprisingly, this kind of a message increases retention, as employees feel less pressure to rush into opportunities heading their way.
It will create more space for other developers to grow into — so their impact and growth can increase. As mentioned, this can also increase employee retention in return.
It will most likely improve the knowledge distribution within the team — so that the team becomes more resilient to future employees’ attrition. This results in faster recovery and reduced cost of attrition.

Similar analysis can be done to any given BSW. The key is to create a culture that promotes growth and learning from any shock and disorder — regardless of whether you anticipated it or not.

In Conclusion

BSW events are nearly unpredictable events with a significant impact on the individual or the organization. Any R&D organization experiences BSWs and should focus on reducing their probability and cost.

Since BSWs can basically be anything and can happen anywhere, it’s not enough to focus on how you respond to each event, but rather to focus on building a culture that gains from such events, regardless of their specific nature.

In the rest of this blog post series, I will discuss how to actually build antifragile software teams and organizations, and share from my experiences in doing so.