Blueprint for reliable software solutions

Agustin Lucchetti
Patagonian Notes
Published in
4 min readJun 23, 2021

With online now platforms being a key part of our daily lives, experiencing downtimes can be very costly, both from end-users perspectives and the organizations behind them.

The COVID-19 pandemic has pushed companies, institutions and governments to adopt new software solutions to replace face-to-face interactions and processes at a speed that we have never seen before, and while in the great schemes of things this is great progress towards technology adoption, we are now faced with the reality that many of these systems are not reliable enough, creating lots of issues and hassles to users and organizations alike.

At Patagonian, we have been building software solutions for over a decade, and that experience has taught us a very important lesson:

Software reliability is a conscious team effort that requires planning and the right set of skills.

But I’m sure you didn’t click this article just for catchy phrases, so here is our blueprint for building reliable software solutions:

Prepare for failure, and you will -almost- never be surprised

(bear with me, I still have a few catchy phrases left).

Assume that any component in your system can fail at any moment (specially in the worst possible moment) and build accordingly. Avoid single points of failure by adding geographically-aware redundancy to every component and layer of your system, and when possible use cloud services that are already highly available and fault tolerant to save time and money.

Infrastructures based on Container technologies like Kubernetes or AWS ECS are great for this, the built-in auto-scaling and self-healing features can enable your team to build highly available solutions at a very fast pace. Just keep in mind that no single tool o service is a silver bullet, and in most of the cases it will require the inclusion of an experienced DevOps Engineer in the team to manage them. Redundancy can also quickly become a never ending money sink, so align the team and stakeholders expectations with the budget to maximize the resources available.

Get the developers involved as early as possible.

Adding redundancy and scalability to every component of your system is not a drop-in solution that may be simply covered by adding a Site Reliability Engineer (SRE) to the team a few weeks before going live. Every piece of the system needs to be designed and coded with that objective in mind. Architects, Developers and SREs/DevOps need to work together to make sure, for example, that the applications are stateless and capable of leveraging distributed cache solutions.

The silver bullet :D

Proper testing and certification process.

We have all heard the phrase “but it worked in my local environment” or “that wasn’t happening during QA”. Testing every new piece of code that you deploy to production is a fundamental part of building reliable solutions. This is not a new concept at all, software testing is a discipline that has been part of our industry for decades, but the emergence of “the cloud” and other new technologies has added new dimensions to the software testing process. Production Environment replicas or down scaled replicas are a great way of testing new releases to see how they will behave on production. In combination with automated load/stress testing, you can eradicate most of the problems when launching new versions of your systems.

Your testing process should not only cover the software itself, but also the deployment process to make sure that when it’s time to go live, pressing the green button won’t bring any unpleasant surprises. Leveraging Infrastructure as Code (IaC) tools like Terraform and automated deployment pipelines will greatly help the team with the process of building, maintaining and deploying multiple replicated environments.

Monitor everything.

Having a robust monitoring and logging solution in place is fundamental to being able to detect potential problems before it’s too late, and to give the team the right set of tools to quickly react to any unforeseen situations. Open source solutions like Prometheus and Grafana, or the managed services provided by clouds like AWS Cloudwatch are great tools with a lot of flexibility, and they provide a lot of out-of-the-box features that can be leveraged by the team without much initial effort.

There is a lot more that can be said about building highly available and reliable software solutions, but in our experience these are the most important factors to making sure that your team starts with the right foot. We will continue to explore this subject more in-depth on following articles, stay tuned!

--

--