Image for post
Image for post
Photo by Nina Mercado on Unsplash

The Service MOT takes influence from the UKs MOT which is a yearly test on motor vehicles to check they are roadworthy, rather than checking a car, we check our services. By doing MOTs you can be confident in your service and also know it is ready to be worked on at any time.

Birth of the Service MOT

The team which I am part of looks after a large number of applications. Often we will be focusing on adding features to one application at a time. This results in the other applications not having much attention and becoming stale. …


Improve the security of your Express app today

Image for post
Image for post
Photo by Irvan Smith on Unsplash

My team has recently started implementing CSP on our website. As we started building out the configuration we realised that we were manually testing things and our feedback loop was not as small as we would have liked. We decided to create some tests so we wouldn’t have to retest all the different pages after changing things.

This story walks through some of the key parts of how we tested our CSP header using an example application which can be found here.

Image for post
Image for post

Content Security Policy Headers

Content security policy (CSP) headers allow pages to specify where external resources can be loaded in from. The main goal of this header is to mitigate XSS attacks. The header is made up of a number of “directives” which give you granular control of the various types of resources that pages may load in, such as image, CSS, and javascript. …


Strategies against failure in distributed systems

Image for post
Image for post
Photo by Mitchell Griest on Unsplash.

It is inevitable that something will fail in a distributed system, and we should plan as if it is a normal occurrence. One solution to this problem is to run multiple instances of a service. That way, if one fails, the others can take over.

In this article, we will explore some of the different ways we can achieve this on Kubernetes (K8s).

None

Redundancy has a cost to it, and we should consider this when deciding how much resiliency we need. …


As easy as it is to change configuration in User Interfaces maintaining this over a long time becomes a hassle. This is especially true if you are looking after configuration for multiple environments and applications. In this article, we look at a new Terraform provider for AppDynamics to get around this issue using configuration as code.

Image for post
Image for post

Why Configuration as Code

The ‘as code’ practice has been thrown around a lot recently. You may recognise it in infrastructure as code which is often used to build applications on AWS. Ther are various tools used for this such as Terraform, CDK, CloudFormation and Puppet to name a few. …


TL;DR: Yes, if everything is set up correctly. Keep reading to find out if you have

Image for post
Image for post
Image credit: Author

We all strive to build resilient and self-healing applications, but occasionally we make a mistake and have to restart one. Hopefully, we will have the time to fix this, but until then, we may need manual intervention. In this article, we understand what happens when we delete a Kubernetes (K8s) pod while it is serving live traffic. We can then apply this knowledge to our operations so we don't affect our customer's experience.

Pod Lifecycle

First, let's understand what actually happens when a pod is deleted.

Kubernetes sends two signals to the process in a container when it is deleted. The initial one is SIGTERM, followed by SIGKILL. SIGTERM is kind of like asking the process nicely to shut down, and SIGKILL immediately stops the process. We can listen forSIGTERM and tidy up any resources we are using, such as databases and other connections. Applications should not instantly shut down when they receive SIGTERM. Rather, they should stop accepting new requests and wait for existing requests to finish. If there are any background tasks running, the process should also wait for them to finish before exiting. …


Learn how Node scales with CPU

Image for post
Image for post
Photo by Edward Howell on Unsplash.

I have heard many people say we should scale applications horizontally rather than vertically, but is this actually the best way to scale? In this article, we will explore how Node.js scales with CPU and see if there is anything else we need to take into account if we do so.

Test Infrastructure

To test Node.js, a demo application was created with endpoints that could be used to simulate a load. The application is Dockerised and can be found on the Docker Hub. The source code can be found on GitHub.

The application was deployed on AWS ECS with different CPU limits and a load balancer was put in front to make it publicly accessible. The code used to deploy this infrastructure can be found on GitHub. If you would like to spin it up yourself, check out the repository and run yarn build to build the CloudFormation stack. Then run yarn cdk deploy. The different instances are deployed at <loadbalancer DNS>/<CPU>, where CPU is one of 256, 512, 1024, or 2048. Once you have finished, you can delete everything with yarn cdk delete. …


Live issues are a great opportunity to learn and improve. Here’s what happened to us

Image for post
Image for post
Photo by Fleur on Unsplash.

In this article, we will explore a case when one of our services scaled to its maximum and how we changed our alerting to stop this from becoming an issue in the future.

Our Infrastructure

The service we are using as an example in this article is deployed on Kubernetes (K8s) with autoscaling enabled. We scale based on requests per second and K8s is configured to keep the requests per second (RPS) at 50. There is a slight delay before the service is scaled, as RPS is averaged over one minute. For more information on K8s scaling, check out its documentation.

To maintain high availability, we run two K8s clusters. The graphs below show these clusters as region-1 and region-2. This creates extra complexity when autoscaling is concerned, as the clusters are completely separate and don’t share metrics. Our website runs active-active and is load-balanced across the two regions. …


Image for post
Image for post
Photo by Hal Gatewood on Unsplash

Now shops are limiting the number of people inside, shopping takes a lot longer than it used to. Most times I visit a store I am having to queue outside for an extended period of time. Probably longer than I spend inside. This article explores why this is the case from the eyes of a software engineer.

The Problem

Putting a limit to the number of people in a store is like having a thread pool for a server. Most new servers are moving to asynchronous processing rather than using thread pools e.g. Nginx and NodeJS. …


Applied to Software

Image for post
Image for post
Photo by Vladyslav Cherkasenko on Unsplash

This story is by no means trying to make light of the Chernobyl incident. It tries to show that observability and resiliency is a problem which has been around for a long time and we can learn from each other to make our systems better. I wrote this after watching the HBO TV series if you haven't watched it add it to your watch list as it is a really interesting and eye-opening show.

Below are some key points which I connected with when watching the series in relation to some of the problems we have seen with our observability. …


A recipe for disaster

Image for post
Image for post
Photo by Magnus Engø on Unsplash.

This article discusses the theoretical extended service outage caused by mixing circuit breakers with auto-scaling. We investigate the cause of the outage and also some potential mitigation solutions.

If you are thinking about using these together in your services, make sure you understand what you are getting yourself into.

Auto-Scaling

One of the benefits of deploying to the cloud is making the most of on-demand pricing. This allows us to provide more resources when we need them and cut costs when we don’t. Most traffic profiles will change with the time of day, the seasons, and even specific days of the year. …

About

Harry Martland

Senior Software Engineer and Observability Guild Lead at Booking.com — Transport, writing mainly about observability and micro front ends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store