Canary Monitoring, part 1

Marek Śmigielski
6 min readJan 17, 2022

--

As an SRE, If I have to pick one monitoring for a IT system, it will be the canary monitoring. Why? Let me explain this to you in this article and the next part of it.

What is canary monitoring?

It is the system, tool, or program that checks the availability of the other system by accessing it from external either with HTTP calls or API calls or any other protocol that is exposed by that system. Two facts are important: First, the way the system is accessed or probed should be the same as real customers will access it. The second is that it should not require access to any internal part of the system such as a database or any other internal API.

Before we go into more details let’s clarify why such a name is used.

The analogy is to the “canary bird used in the coal mine”; It is said that miners in the old days were taking a caged canary down into a coal mine with them. It was used as an “early warning system” of dangerous gases. Contrary to popular rumor, the miners usually didn’t wait for the canary to die; they were instead listening for the canary to stop chirping.

We can read from this that in that kind of monitoring it is not important what is the exact danger or in case of IT system “root cause” of the issue. What is important is to know that the system is in a safe state working fine. Of course only in the scope of the given use case that is under the test.

To give your better understanding of what is canary monitoring, let’s start with an example.

e-Commerce website

Suppose that you administer an e-commerce website. For canary monitoring, it is neither important what architecture this system is built nor what programming language is used. What is important is to realize what is the most important use-case that your customers realize through it.

I hope it is quite clear that in this case, it browsing through a catalog, adding some products to the cart, and doing the checkout. Most probably fact that your main webpage can load is also very important.

To be able to access your site in a similar way that your customer will do, tools that can do HTTP requests should be used. I discuss it in the next section. For now, you can imagine a tool that just uses “curl” in a sequential way.

Having this as a guideline I would recommend a scenario similar to that:

STEP 1. Get the main page.

In this step, the main webpage is accessed and some text that should always be there is checked ie. “login” or “cart”. It is not necessary to test for all HTML references but at least a few should be checked.

STEP 2. Head to some category.

In the next step, your product catalog should be tested. You should pick a category that is the core of your business and is fairly stable. Avoid new categories and the ones that the marketing department experiments a lot with. It is important to do some iteration over the products and check that they are rendered properly.

STEP 3. Pick some product

Again pick a product carefully. You don’t want to be surprised with marketing decisions. In this step, it is wise to check that at least the main product image can be loaded and things like price and availability are set correctly. Obviously “add to cart” button is critical here as well.

STEP 4. Add the product to the cart

Few considerations here. First, before adding it to the cart check that the cart is empty. Second, check that product is truly added. And third, usually, you will have to work with some fancy javascript that is used to do this on your side. It is quite important to have it right and also test for javascript problems that you can have, so don’t skip it.

STEP 5. Do a checkout flow

It is not enough to just check that you can add the product to the cart. It is also required to test that you can place an order. You can do it in any form you like and usually only testing one flow is enough. This is a place where you should be particularly careful not to mess with your ordering system too much. Usually, choosing something like “pickup in person” without online payment is the best choice. It is also good to leave some clear indication that this is a test order just in case.

STEP 6. Cancel an order

The last step has to be done so your system will not be flooded with pending monitoring orders. You should try to leave the system in the state as it was before so you can repeat this flow over and over again.

When we have a scenario defined what we should also decide is how often we need to test it. For such a complex scenario that is actively changing the system, I would not advise o go too much crazy. Once a half an hour is most probably enough. You will get enough testing thought out the day anyways and while you are reading about canary monitoring you probably have some other tools to do monitoring as well.

How to build it

Firstly, many services specialize in canary monitoring, especially of regular websites. They usually are the way to go if you have simple use cases to cover. I will not recommend any particular system nor do feature comparison as I believe that this is a task that anyone can best do with google.

For some more complex use-cases, you might need to implement a custom solution. There are a few considerations when doing it.

The first is reliability. The monitoring system should be at least as reliable (preferably higher) than your system or desire system reliability. The good choice is to base monitoring on another technology stack so the same problems are not impacting both at the same time. In 2022 my recommendation is to use python and just do some coding. Python is popular and easy enough that anyone who works with computers should handle it. At the same time, there is plenty of libraries or add-on so almost everything should be just a few lines of code.

The second is being external. Your monitoring system should be external to your service. If you are using a cloud provider, just use a different region. If you have some hosted solution put your monitoring in the cloud. The need is that you want to know when your system is down from the user perspective and be able to detect network failures as well. This requirement includes using external API only. (Just to be clear: you cannot delete record in the database directly as this violates this rule)

The third is software architecture. While your system is evolving, your monitoring solution has to do the same so don’t take any shortcuts and treat all technical debt seriously here. Hire senior developers or consult with them while you are building your solution. Think about layers, good interfaces, and abstractions. Watch out for Spaghetti Code antipattern in particular.

Fourth is automation. Keep your monitoring in git, treat it like any other software that you are building with at least continuous integration in place. Ideally, you should have pull requests, code review, some static code analysis, and continuous delivery. Think about maintenance of your monitoring and future changes.

What is coming in the next part:

  • more deep dive into canary monitoring
  • considerations and what you need to watch out for
  • how to use the same tools beyond monitoring
  • and of course, some more example

If you have reached that far please let me know in the comments if it was useful and understandable for you. Also, check out the second part of the canary monitoring.

--

--

Marek Śmigielski

I have Master’s Degree in Statics and Econometrics although for 15 years now I am working in IT as a developer, system architect and product owner.