In our systems we trust

Published in

Mercadona Tech

7 min readOct 26, 2021

Here at Mercadona Tech we believe and promote a blameless and failure tolerant culture throughout the whole company. But in an environment where failure is part of the journey, how can we maintain the trust of our users and stakeholders in our systems and products? And what about trust between us internally as team members, feeling confident in the quality of each other’s work? I would like to give an example through a story.

Health of code

In the SRE team we are currently working on the foundations of a dynamic hybrid-cloud infrastructure. A part of this effort is to set up an on-premise Kubernetes cluster in our warehouses to host our logistics applications closer to the end-users, and potentially mitigate the blockage of the order preparation processes in case the warehouse’s connection to the cloud-hosted environment gets cut off. If you are interested, a more detailed explanation is available from the pen of Pedro Díaz:

Warehouses as code

How much impact on our customers we will have if one of our warehouses gets a network outage?

medium.com

This means we need to host the databases locally, too. In our cloud environment we chose a managed PostgreSQL solution, so spinning up a self-hosted database cluster was kind of a virgin territory for us. Furthermore, we decided to host the DB cluster in Kubernetes. After taking a look at the market, Stolon as a clustering solution for PostgreSQL seemed the most ideal fit.
It uses k8s-native objects to manage the DB cluster, and the featureset is sufficient for our needs, since we do not have any fancy requirement.

GitHub - sorintlab/stolon: PostgreSQL cloud native High Availability and more.

stolon is a cloud native PostgreSQL manager for PostgreSQL high availability. It's cloud native because it'll let you…

github.com

In the SRE team I have the privilege to work with 6 other, very talented engineers, all of us with different experiences and backgrounds, but database management is a field where all of us still have some more to learn.

We set up our first on-premise cluster around a year ago as our staging environment, and shortly after the first production cluster was running with applications in the Valencia warehouse. Since then, we haven’t had any serious incidents involving the on-premise database. We even tested the resiliency of the whole setup by pulling out random cables (quite literally), and we experienced no data loss, and minimal downtimes. The most serious incident occurred, when the autovacuum process decided to take a longer break, causing all kinds of issues.

However, in the last couple of months, we have been suffering from occasional OOMkill events affecting the master PostgreSQL pod. In these cases, due to the pod is restarting, the database closes all connections for around a minute, before the master gets back into action. This behaviour is still under investigation from our part. The article below talks a bit more about the issue:

Deep PostgreSQL Thoughts: The Linux Assassin

When Linux detects that the system is using too much memory, it will identify processes for termination and, well…

blog.crunchydata.com

Health of relation

In our internal and even external communication we tend to frequently interchange the terms Stolon and PostgreSQL, myself included. Although they are two separate components, one responsible for the PostgreSQL DB clustering, and the other, well, the database application itself. This imprecise use of terminology seemed like an absolute non-issue until recently.

Combining the above mentioned incident(s), and the fact that we are still in the learning phase of the DB management, a very slight frustration started to build up in our team, and we could even hear concerned voices from other parts of the office.

The resolution, however should be quite simple: we need to be more precise about the issues we have. If we use terminology and names interchangeably, that could have some undesired, and in some cases serious side-effects. It may even impact incident management, e.g. checking the logs of component X, while the real issue is present in component Y. As time is key resolving and incident, this is a luxury you can not afford.

Another example of this phenomenon could be when, due to the built up frustration, your team decides to look for another alternative to substitute component X, in hope of getting rid of the issue and the presumed source of frustration. But maybe the issue is rooted in component Y, and you burn effort and resources implementing a change, which in the end provides little to no value.

Let’s talk about trust.

Every IT component is subject to constant change, we can take this as given. With each change, there is a possibility to introduce errors, and break things. With each new error, or incident, there is an associated level of frustration and discontent in the user, or likely so, in the maintainer.

Trust of others

Watching the news, and seeing how Facebook’s reputation suffers blow after blow almost every week, made me realize, that one of the most valuable, and often most overlooked assets you can have is the trust of your users.

You may build the most awesome systems in the world, but if you have lost the trust of your users, people will start looking for ways to circumvent or substitute your solutions, oftentimes even creating security holes in the process, or causing even more issues. It may even have a serious impact on the whole business.

The follow-up message from Facebook’s engineers is a great example of how such an incident should be communicated to third parties. It clearly explains the root cause, and the technical details are quite easy to understand even for the not so tech-savvy.

More details about the October 4 outage

Now that our platforms are up and running as usual after yesterday's outage, I thought it would be worth sharing a…

engineering.fb.com

But even with the above communication, people started to spread all kinds of conspiracy theories about the presumed reasons and alleged dark intentions.

Trust in your own work

At the end of all of this, if others are sceptic with the product you provide, how can you maintain a positive attitude working on it? If you don’t trust in your own work, you may start to over-engineer, adding extra complexity in the process, which only makes the product more error-prone. This potentially burns your error budget faster, and without remediation, your relation with your users suffers further. Besides, now you have to struggle with the prioritization between bugfixes and new features.

As we can see, it’s pretty easy to get in a never ending loop of losing trust, and also a potential degradation of product quality.

We suffered this kind of loss of trust in our own work, and we did not even have to had any major issue to cause it. As mentioned above, we are moving workloads from the cloud to on-premise Kubernetes clusters. During the early stages we had incredible momentum, until we started to question things. Things that were not working great, but well within the acceptable margin. This sudden wave of doubt in ourselves was enough to break the initial momentum and slowed down the project significantly until we realized the problem and acted on it.

A word of advice

Now we know how a simple naming mismatch can lead to a wide variety of issues, even some serious impacts on business. The earlier you realize these kind of communication issues in your team or at your company, the better.

All of this may sound pretty bad, but fear not, I have some advice that might help.

Be precise in your everyday communication both internally, between team members and externally, towards third parties. Just as you don’t want to be sloppy with your code, be direct with your words, too, and try to minimize the guesswork. If you leave things free to interpretation, then you are inviting trouble.
Be open about the issues. If you are honest about the daily struggles you and your team have, you can expect the same from others, and may even find some help or kind words from places where you otherwise would not expect them.

Use the data you have. Data is power. It is a really powerful weapon in arguments, since it is hard to argue against data backed statements. Also if you have any kind of doubt, check if it’s supported by data. Do you feel that the database is not performing as expected? Get those metrics talking, that’s why you collect them.
When you are dealing with an issue that affects others, spend some time and effort to explain to them why it happened and how you resolved it, even if it’s just a quick triage. Your users will be more tolerating of future issues, and you may even pique their interest to help you out with a better solution.

Final thoughts

As engineers, we often forget that the communication with our peers is equally as important as any other part of a project, if not more. Hopefully you match your code against some high standards. Why not use the same approach in your communication? When you make a typo in a API request, you expect the response to fail, and similarly you should expect a human interaction to fail in case of improper expression. Be specific. Be direct. Be honest. Start to build trust with your words.

If you have similar stories or experiences, please share in the comments!

And if you see an interesting challenge in the problems we are facing, take a look at our currently open positions at LinkedIn.