The Service MOT takes influence from the UKs MOT which is a yearly test on motor vehicles to check they are roadworthy, rather than checking a car, we check our services. By doing MOTs you can be confident in your service and also know it is ready to be worked on at any time.
The team which I am part of looks after a large number of applications. Often we will be focusing on adding features to one application at a time. This results in the other applications not having much attention and becoming stale. …
My team has recently started implementing CSP on our website. As we started building out the configuration we realised that we were manually testing things and our feedback loop was not as small as we would have liked. We decided to create some tests so we wouldn’t have to retest all the different pages after changing things.
This story walks through some of the key parts of how we tested our CSP header using an example application which can be found here.
It is inevitable that something will fail in a distributed system, and we should plan as if it is a normal occurrence. One solution to this problem is to run multiple instances of a service. That way, if one fails, the others can take over.
In this article, we will explore some of the different ways we can achieve this on Kubernetes (K8s).
Redundancy has a cost to it, and we should consider this when deciding how much resiliency we need. …
As easy as it is to change configuration in User Interfaces maintaining this over a long time becomes a hassle. This is especially true if you are looking after configuration for multiple environments and applications. In this article, we look at a new Terraform provider for AppDynamics to get around this issue using configuration as code.
The ‘as code’ practice has been thrown around a lot recently. You may recognise it in infrastructure as code which is often used to build applications on AWS. Ther are various tools used for this such as Terraform, CDK, CloudFormation and Puppet to name a few. …
We all strive to build resilient and self-healing applications, but occasionally we make a mistake and have to restart one. Hopefully, we will have the time to fix this, but until then, we may need manual intervention. In this article, we understand what happens when we delete a Kubernetes (K8s) pod while it is serving live traffic. We can then apply this knowledge to our operations so we don't affect our customer's experience.
First, let's understand what actually happens when a pod is deleted.
Kubernetes sends two signals to the process in a container when it is deleted. The initial one is
SIGTERM, followed by
SIGTERM is kind of like asking the process nicely to shut down, and
SIGKILL immediately stops the process. We can listen for
SIGTERM and tidy up any resources we are using, such as databases and other connections. Applications should not instantly shut down when they receive
SIGTERM. Rather, they should stop accepting new requests and wait for existing requests to finish. If there are any background tasks running, the process should also wait for them to finish before exiting. …
I have heard many people say we should scale applications horizontally rather than vertically, but is this actually the best way to scale? In this article, we will explore how Node.js scales with CPU and see if there is anything else we need to take into account if we do so.
To test Node.js, a demo application was created with endpoints that could be used to simulate a load. The application is Dockerised and can be found on the Docker Hub. The source code can be found on GitHub.
The application was deployed on AWS ECS with different CPU limits and a load balancer was put in front to make it publicly accessible. The code used to deploy this infrastructure can be found on GitHub. If you would like to spin it up yourself, check out the repository and run
yarn build to build the CloudFormation stack. Then run
yarn cdk deploy. The different instances are deployed at
<loadbalancer DNS>/<CPU>, where CPU is one of
2048. Once you have finished, you can delete everything with
yarn cdk delete. …
In this article, we will explore a case when one of our services scaled to its maximum and how we changed our alerting to stop this from becoming an issue in the future.
The service we are using as an example in this article is deployed on Kubernetes (K8s) with autoscaling enabled. We scale based on requests per second and K8s is configured to keep the requests per second (RPS) at 50. There is a slight delay before the service is scaled, as RPS is averaged over one minute. For more information on K8s scaling, check out its documentation.
To maintain high availability, we run two K8s clusters. The graphs below show these clusters as
region-2. This creates extra complexity when autoscaling is concerned, as the clusters are completely separate and don’t share metrics. Our website runs active-active and is load-balanced across the two regions. …
Now shops are limiting the number of people inside, shopping takes a lot longer than it used to. Most times I visit a store I am having to queue outside for an extended period of time. Probably longer than I spend inside. This article explores why this is the case from the eyes of a software engineer.
Putting a limit to the number of people in a store is like having a thread pool for a server. Most new servers are moving to asynchronous processing rather than using thread pools e.g. Nginx and NodeJS. …
This story is by no means trying to make light of the Chernobyl incident. It tries to show that observability and resiliency is a problem which has been around for a long time and we can learn from each other to make our systems better. I wrote this after watching the HBO TV series if you haven't watched it add it to your watch list as it is a really interesting and eye-opening show.
Below are some key points which I connected with when watching the series in relation to some of the problems we have seen with our observability. …
This article discusses the theoretical extended service outage caused by mixing circuit breakers with auto-scaling. We investigate the cause of the outage and also some potential mitigation solutions.
If you are thinking about using these together in your services, make sure you understand what you are getting yourself into.
One of the benefits of deploying to the cloud is making the most of on-demand pricing. This allows us to provide more resources when we need them and cut costs when we don’t. Most traffic profiles will change with the time of day, the seasons, and even specific days of the year. …