Availability basic principles

Andrii
7 min readJan 27, 2022

--

Now I have my tech blog here: https://andreyka26.com/

Introduction

This article will be about basic tactics and approaches for implementing the availability in your system with a few live examples. All knowledge and information presented here are based on my personal experience as a software engineer and on books and articles read by me.

If you would like to see the video instead of reading the article — please follow the link: https://www.youtube.com/watch?v=SQE7TwG6XnI&t=488s&ab_channel=andreyka26_programmer

You can see the list of the used literature at the end of that article.

What is Availability?

To start considering different patterns and techniques for availability we have to define the term “availability”.

By “Designing Data-Intensive Applications”:

By “Software Architecture in Practice”:

So in simple words, availability is about making your system respond with correct responses even if something goes wrong or even the service as a whole goes down (The perfect case)

Or slightly realistic picture, availability is about giving guarantees to your clients that outage lasts not more than a specified period.

Stateful and Stateless

Now we should differentiate 2 types of application: stateless and stateful.

So, stateful application means that it preserves some data (usually information about previous actions) as a state inside itself. And it implicitly means that for 2 same actions at different times we may end up with 2 different outcomes.

Examples: client applications, storages (database, cache, blob storage).

On the other hand, stateless applications act the same for the 2 different requests (produce the same output) and they don’t preserve any state of previously done actions.

Examples: RESTful APIs and any kind of servers that don’t preserve previous requests.

Why do we care?
Simple availability techniques for stateless applications are much easier to implement compared to the stateful.

Availability Patterns and Techniques

The most usually used technique for availability is Redundancy. The reason is simple: if we have only one instance of the application, and it fails, we can do nothing then. There is no magic.

So the most popular pattern is to create more instances so that they would be able to replace the dead one.

The main goal of the article is to consider fault recovery techniques and patterns, but still, I have to cover fault detection and fault prevention topics as well.

So the most popular patterns for fault detection are different types of pinging of the highly available service:

  1. Ping/echo — checkpoints by the third party application,
  2. Monitoring — memory, thread pool, network, etc.,
  3. Heartbeat — checkpoints by the highly available application in the third party application
  4. Logical clocks — to detect that some events were lost.
  5. Sanity check — checking the validity of the output of the highly available application

To prevent faults, we try to anticipate anything that can go wrong and leave behavior to handle such cases, set of assertions, tests(integration, e2e, unit tests), clean code and best engineering practices, anticipation and handling exceptions, retries, etc.

To be honest, setting up any kind of Redundancy includes both preventing faults and fault recovery techniques in one place from my point of view.

Implementing redundancy for stateless

Basically for the high availability, it is better to use the cloud, if you have a sufficient budget, but I will show examples on my local machine.

Let’s create a stateless API for this sample. Actually, you can create that API in any language, but since I’m a .NET software engineer I’ll create it using .NET Core.

Note, you can find this repository by accessing this URL: https://github.com/andreyka26-git/videos/tree/main/availability

If the URL is not valid — contact me on Instagram or any social network, https://www.instagram.com/andreyka26_programmer/

Create sample API, and paste this controller behavior

Then patch appsettings.json to be able to get NodeId from config and override it in docker.

Then let’s create Dockerfile and paste it into the root of the project (where .sln is located)

And docker-compose after that in the same location:

As you can see we will use Nginx as load balancer.

Then few Nginx configs in the same directory:

//app.conf

//nginx.conf

Our architecture looks like that:

Since our function has nothing with the state, it is kind of pure, which means that 2 instances of the application are the same in terms of behavior. And it means that we easily create and remove them.

Run docker-compose up -d

Let me show responses for this setup.

Sometimes we call first_app

But sometimes second_app depends on internal rules of Nginx, but they are both alive, no problem with that.

Now let me stop one of the instances:

No matter how many times we send the request, only alive stateless_api2 is responding.

The only problem is that Nginx should experience a timeout prior to making the decision that the node is dead, but these are the implementation details of the load balancer.

This is what “Designing Data-Intensive Applications” told us:

Some of you may say that okay, we have different instances of the stateless application, but still, we have our Nginx as a single point of failure. Does this architecture prevent us from failing the Nginx?
And I answer — no. This is the typical problem of all load balancers, they are the single points of failure.

As a possible workaround — we may set up multiple load balancers with the same configuration, and give the IPs of those load balancers to the clients. If the client is experiencing the timeout or failure with requests to the load balancer it simply switches the load balancer to the alive one.

Implementing redundancy for stateful

For this sample, I will use a replication mechanism using PostgreSQL.
You can read more about replication in the great book “Designing Data-Intensive Applications’’. There are plenty of types of replications: asynchronous, synchronous, leader-leader, leader-follower, leaderless, etc.

Sorry, I will not dig into the details of creating Postgres replication from scratch, and I’ll use the already configured images with docker-compose: https://github.com/bitnami/bitnami-docker-postgresql

Paste the content into docker-compose.yml

As I understood from the documentation we have leader-follower asynchronous replication.
It means that updates are populated asynchronously from the leader to the followers. On top of that, it means that we are able to write only to the leader. Followers are only open for reads.

You may ask, what is the purpose then? The thing is a leader-follower is the safest and the easiest way to implement replication. But if you need full availability — you may use leaderless or leader-leader replication to be able to accept writes on all nodes, but they are harder to set up and have more problems in terms of distributed systems.

Then run docker compose and see the system is running.

In this example, we have our kind of load balancer (leader node) instead of Nginx. This leader node is responsible for replication.

Now let’s connect to both follower and leader from the PgAdmin (it is a client for PostgreSQL), and create simple table with one column.

I called the table “entities”.

We can see that the table is empty.

Now let me insert a few rows.

And then query them on both nodes:

Let me now exit the master and see what happens.

As you can see, we cannot query data from the master, but still we are able to query the previously replicated data from the follower.

This is the purpose of replication. In that case, we are not only able to query the data, we also preserve it. Because if the disk on the leader node breaks — then data is not lost, we can use the data from the follower.

Literature

I used

a) 2 greatest books:

b) Ready-made docker-compose with docker images: https://github.com/bitnami/bitnami-docker-postgresql

My repository with samples: https://github.com/andreyka26-git/videos/tree/main/availability

Thank you for your attention. If you have any questions — please contact me on any of the social media.

--

--