Oops, something went wrong! Ok, now what?
By Ricardo Lopes, Principal Engineer.
This post was originally published on our F-Tech Blog. Come check it out here :-)
How many times have we all visited a website, tried to do something and got the famous error “Oops, something went wrong!”.
This is a terrible experience for any user, which can lead to abandoning that website and trying a different one. The user will, most likely, go to a competitor that provides better user experience, continues using the competitor’s website, and completely forgets about the first one.
A company may have the best professionals in the world, the best tools and resources in the market and the most exceptional software engineering processes ever created, and still turn away potential customers to its competitors. No matter how good you are, one thing is for sure: your system will eventually fail.
To reduce the likelihood of breaking the production environment, we need to change our mindset when designing a solution for a specific problem. We need to identify the possible points of failure of an application and stop assuming that everything will work. We need to start asking, “What if this fails?” This change of mindset is not easy because we usually expect the system to be up and running correctly, and because we plan for success and not for failure.
Let’s discuss an example. Imagine that we have a caller that needs to get the prizes for a specific user. The following diagram helps demonstrate the dependencies.
When the caller needs to get the prizes for a user, it will make a call to the Prize Service by passing through the user’s identification. The Prize Service will then call the User Information Service to get more information about the user. It will also call the database to retrieve the available prizes, and then it calculates the prizes and returns this information to the caller. Just by looking at this simple workflow, we can already see that the Prize Service requires two strong dependencies to work.
After analysing the workflow and its dependencies, we can ask the following questions:
- What happens if we can’t contact the User Information Service?
- What happens if we can’t contact the database?
These are the most direct points of failure of our service, so what can we do to try to minimise this problem?
User Information Service dependency
First we must identify why we need to call that service. If we need to know some information about the user before calculating the prizes, we can change our approach and have that information locally available within the Prize Service.
By doing so, we remove the direct dependency on the User Information Service, but this approach replaces it with another database dependency. This makes the Prize Service more resilient to failures at the User Information Service.
“Ok, that is great, but if the local database is down, we have the same problem. Hence, we just swapped one dependency for another one.”
You are right, but now we can also have a fallback plan. For instance, we can call the User Information Service if we cannot read the user information data from our local database. If we do this, we have added a new layer of resilience to our service regarding user information — now both the local database and the external service would have to be down for the caller to not receive this user information.
This dependency can be tricky. One possibility is to employ a cache system for fallback in case of database failure. Although we might introduce the risk of data inconsistency, the service can still work while the database is down. This buys us time to address the problem and allows the system to operate under conditions that would otherwise be a certain failure.
Make it resilient
Always remember that our dependencies can and will eventually fail, and we should protect our system to handle these failures. The approaches described in the example above are just a couple of the available options for making a more resilient system.
While we should always have a fault-tolerant mindset and work towards the goal of resilience, we must also align our resilient designs with business and system requirements.
For example, in some cases we may want to fail instead of introducing the risk of data inconsistency. This requires us to clearly understand how various services should behave when something unexpected happens.
In short, always design your system by taking into account that its dependencies can and will fail. Ensure that the system can handle unexpected failures so that you can have time to react and correct the problem without impacting your customers.