Crisis leadership is the actions a leader can take before, during, and after a crisis to effectively reduce the duration and impact of these extremely difficult situations . While many companies and the people running these companies have a crisis plan in place, they may not have actually tested their plans, or the plans may be inadequate. Tough times like this really puts ones character to test. This article aims at reflecting how Matthew Prince, the CEO of Cloudflare, dealt with a recent outage at Cloudflare which had a significant impact on the internet.
“Faced with crisis, the man of character falls back on himself. He imposes his own stamp of action, takes responsibility for it, makes it his own.” — Charles de Gaulle
As outlined in their blog post, due to backbone congestion in Atlanta, the team decided to add some configuration that would allow traffic to hit other regions. Instead, a misconfiguration resulted in traffic being diverted to Atlanta. The backbone is a series of private lines between Cloudflare’s data centres for faster and more reliable paths between them. The increase in traffic to Atlanta resulted in CPU overload (red region in the picture below) which meant the affected data centres weren’t getting any traffic (white regions in the picture below) — hence the outage in multiple regions around the globe.
The incident affected 12 of its data centres impacting sites like Tumblr, Discord, League of Legends, and Shopify. Interestingly, neither it is the first time an outage at Cloudflare has had such an impact, nor it is the first time Prince responded to the situation (and this one) like a pro.
Let’s take a look at some key leadership takeaways from this incident. The lessons can be summarized in the acronym CART. Curious? Allow me to elaborate.
In a time span of about an hour, Prince tweeted updates about the Cloudflare outage, how it was being restored and what really caused the outage. Importantly, he assured users that this was not caused due to an attack on the system. A couple of hours later, he tweeted another summary of what had happened along with a link to the blog post which outlined the entire incident in greater detail.
What did he do right? Prince knows the importance of concise and timely communication. He was proactive in communicating the fact that they were aware of the outage, the impact it had and were taking the necessary steps to investigate and restore services. He provided information upfront, rather than keep people waiting and build unnecessary speculation. When he didn’t have all the details in hand, he made it clear that a blog post outlining the complete incident will follow soon, setting the right expectations amongst Cloudflare’s stakeholders.
Mistakes are also an opportunity. An opportunity to rethink creative ways to solve problems. An opportunity to reprioritize improvements and changes that were on the backlog. And most importantly, an opportunity to see what we can do better tomorrow.
Prince did just that. He announced that his team had applied safeguards to ensure that an incident like this doesn’t occur again. The linked blog post also talks about another change that was deployed on Monday, July 20, adding another layer of protection against changes to their backbone.
A quote from Cloudflare’s blog post on this outage, dated July 17, 2020:
We are making the following changes:
Introduce a maximum-prefix limit on our backbone BGP sessions — this would have shut down the backbone in Atlanta, but our network is built to function properly without a backbone. This change will be deployed on Monday, July 20.
Change the BGP local-preference for local server routes. This change will prevent a single location from attracting other locations’ traffic in a similar manner. This change has been deployed following the incident.
Responding to the tweet with the blog post, a user wrote “I’d hate to be the one who typed that. Yikes!” Given the impact this outage had, I can imagine it wouldn’t feel any good to know that you were the one who made the typo. However, mistakes happen. That’s just how it is. No one sets out to make a mistake — all you can do is put your best foot forward.
“We make mistakes all the time — but we make different mistakes all the time, which I think is a sign of a healthy organization.” — Matthew Prince, DCD, July 02, 2019.
Addressing this comment, Prince shifted the conversation highlighting the root cause was that there were no checks in place to prevent this from happening. The problem, as Prince puts it succinctly, was a leadership one and not engineering.
“Leadership is a potent combination of strategy and character. But if you must be without one, be without strategy.” — General H. Norman Schwarzkopf
Although accepting responsibility when things go south is leadership 101, it’s easier on paper than practise, especially in tough times like this. But by doing so, Prince won the trust of employees and other stakeholders.
In the updates, Prince didn’t shy away from admitting in clear writing that the outage was caused by a mistaken configuration they were applying to a router during a routine update. While this sounds silly, it is important to remember that many companies still hesitate to disclose the technical details and cause of an outage in fear of a potential financial implication or sheer embarrassment.
Individually, for Prince, it would have also been way easier to wait for the official blog post explaining the issue rather than put yourself out there in a digital world where nothing is forgotten. Instead, he chose to do the opposite, earning him a wave of positive reactions online admiring how well he and the team at Cloudflare dealt with the outage. Higher-ambition leaders realise the power of transparency to their organisation and Prince just proved this in practice.
Therefore, it is commendable that:
- He didn’t forget to articulate the root cause of the incident amidst the panic. I can only imagine what it must have been like at Cloudflare that day!
- He laid out the facts to the point — avoiding the use of any business jargons, avoiding sugarcoating the impact of the outage, and not trying to hide what really caused this incident.
- He followed it up with an action plan assuring users that preventive measures were being added to prevent this from happening again.
Well, there you go! Practical crisis leadership lessons from Cloudflare’s CEO and co-founder Matthew Prince. Should you find yourself in a difficult situation in the future, simply remember CART — Communication, Action, Responsibility, and Transparency.
That’s it. Thanks for reading!