When failing is success

Published in

Mercadona Tech

6 min readJul 14, 2020

Many times we are told that our software, as engineers, doesn’t have to fail and while it might look true it is not always the case. In this post I will explain why we, at Mercadona Tech, believe that failing is something that must happen.

Imagine a big forest that hasn’t had a fire once and then the day comes; a small fire will likely get out of control, getting bigger and bigger and needing a lot of time and effort to get extinguished. The forest itself doesn’t have the mechanisms to mitigate that fire. Most important we, the people that have to control it, won’t understand the problem because we have never faced it. It will take time and future opportunities to understand the nature of the problem, and in the end we will realice that it is just the fuel that the forest itself has; straw, branches, etc. which could be avoided by just cleaning it up every year.

This analogy can be easily applied to us. Our system could be resilient for months, but once an incident happens we would not be prepared. It would be something new for us, most likely very big, needing a lot of time to get it back online, having a high impact for our end users and making them unable to do their groceries. Not just that, but something could have been unnoticed because it just worked for a long time and didn’t require any change. We wouldn’t understand the impact and the nature of the problem since we have never faced it. Instead, if our incidents are more often the probability of them having less impact and we being more prepared will be higher, just by the fact that we have learnt about them before.

So, are you suggesting that my system has to keep failing forever?

Of course we want to have as few incidents as possible, but they will come sooner or later. The only way to not have them is to not change a single line of code and even in that situation we can have downtimes, for instance our databases or servers could stop working due to a high demand on traffic.

The goal here is to have as little impact as possible for our end users, learn about the incidents and understand its nature. Having incidents is not that important as long as they don’t get noticed by the users and even if they do the key is to recover as fast as possible and affect as less percentage of users as possible. Of course we have red lines that we shouldn’t cross, such as having the same incident again. Just remember that the objective in this case is to have a low mean time to restore.

How do we achieve this?

There are many tools and techniques that, although they are not infallible, help us from having incidents such as pairing, code reviews, tests, etc. In this case we focus on the strategic scope rather than on the tactic one, which is to build a strong team culture and vision. Our vision is to become a reference development team while delivering a reliable solution. If we don’t accept the following topics we won’t be able to achieve that vision.

First of all, experiment as much as possible. Encourage your colleagues to get out of their comfort zone, don’t be scared of making a mistake, cutting their wings won’t let them fulfil their potential. You won’t be able to learn if you keep doing what you know that works best. Experiment, fail and you will learn a lot about it, having failures allows you to understand and control both sides.

https://management30.com/practice/celebration-grids/

Second, accepting that failing is ok and blaming is off the table. We are humans, we fail and there is nothing we can do about it. It makes no sense to point fingers to someone when the whole team should be responsible because we are all involved. In the end, if a user cannot finish their order or it doesn’t arrive on time they will blame Mercadona, not a specific person who maybe did a mistake. We are seen as a team, so the success or failure of our platform depends on it. Focus on resolving the incident as a team, rather than pointing at someone or leaving them alone, the goal is to resolve it as fast as possible. It is much more important to focus on what happened and how to avoid it in the future. Always remember that even the most experienced developer will make mistakes. Having this atmosphere will allow your colleagues to be more transparent, helping each other and building a unified group, since it might be you the next time something happens.

Third, making sure that the culture stays within the company. The Atlético Madrid FC is a perfect example for this. They might not have the top stars but when they group up their performance increases drastically, just by the fact that they work as a single unit. That culture doesn’t go away when someone leaves to some other team, it stays within the team and new players learn from it. In our case it is the same, we will have junior people coming and very senior developers leaving, but what matters is the team as a whole, not individual players. We want to have a group of people aligned and working together rather than super stars trying to do everything by themselves.

Finally, focus on iterativity. Try to have as much feedback loops as possible. The goal is to deploy as often as possible with small changes. This will allow you to have quicker information, understand your system deeply and learn from it, which will make you know better the overall status and in the case that something fails you can revert it fast enough to not have an impact on your users.

Conclusions

Failing and accepting it is part of the process of becoming a better engineer and it can be applied to many other areas as shown with the previous examples. Remember that everyone sooner or later fails, but as Alfred Pennyworth said: “Why do we fall sir? So that we can learn to pick ourselves up”, it is in that exact moment when you want to have a team backing you up, where having a strong culture team comes in hand. We truly believe in this methodology and it is working for us since, so far in this COVID-19 year, we have had 108 incidents maintaining a 99.997% uptime while deploying around 80 releases per week to production with thousands of users consuming our services. This is not a golden egg, it takes time to build such a culture and we still have incidents that are more serious but as long as we have quick feedback we will iterate and improve.

If you like what you just read and want to be part of it don’t hesitate and check out our open positions, we would like to keep growing our team with people like you!

When failing is success

So, are you suggesting that my system has to keep failing forever?

How do we achieve this?

Conclusions

Written by Antonio Escudero Huedo