Ensuring good service health by automating thorough integration testing and alerting

Palantir automates service health checks and communication with the developers

There are only so many things that make you look more unprofessional than your clients informing you about a failure of your service without you being aware and transparent about the failure beforehand. It is your responsibility as a service provider to be the first to know when something breaks and inform clients that you are aware of the failure and working on a fix.

When your client informs you about your API being down.

The company that I am currently developing is a prime example of a fragile service where things are expected to regularly break – at Applaudience we aggregate cinema data from thousands of remote sources. While it is unusual for APIs to make breaking-changes, when you spread this probability across thousand-plus integrations, these breaking changes become daily occurrence. As such, I’ve designed the entire monitoring infrastructure with two key objectives:

  1. Automate anomaly detection
  2. Automate communication with the responsible developers

Automating anomaly detection ensures that we as the service provider are the first to know when things break and automating communication with the developers (e.g. GitHub issue creation) with the responsible developer reduces the feedback-loop time.

Compare this with the traditional monitoring solutions. Existing monitoring software primarily focuses on enabling visual inspection of service health metrics and relies on system maintainers to detect anomalies. This approach is time consuming and allows for human-error. Even when monitoring systems allow to define alerts based on pre-defined thresholds, a point-in-time metric is not sufficient to determine service-health. The only way to establish service-health is to write thorough integration tests (scripts) and automate their execution, just like we do in software-development.

For this purpose, I have developed Palantir.

Palantir

Palantir is used for communication and as a means of seeing events in other parts of the system.

Palantir continuously performs user-defined tests and only reports failing tests, i.e. if everything is working as expected, the system remains silent. This allows service developers/maintainers to focus on defining tests that provide early warnings about the errors that are about to occur and take preventative actions when alerts occur.

Palantir decouples monitoring, alerting and reporting mechanisms. This method allows distributed monitoring and role-based, tag-based alerting system architecture.

While it might sound complicated, it is just two programs:

  1. Monitoring program
  2. Alerting program

Both programs accept user-defined scripts for monitoring and alerting purposes.

Palantir test

A Palantir test is an object defining methods used to query data and assert an expectation:

type TestContextType = Object;
type QueryResultType = *;
type TestConfigurationType = Object;
/**
* @property configuration Test-specific configuration passed to `beforeTest` and `afterTest` as the first parameter.
* @property description Test description.
* @property interval Returns an interval (in milliseconds) at which the test should be executed.
* @property tags An array of tags used for organisation of tests.
* @property query Method used to query the data. If method execution results in an error, the test fails.
* @property assert Method used to evaluate the response of query. If method returns `false`, the test fails.
*/
type TestType = {|
+configuration?: TestConfigurationType,
+description: string,
+interval: (consecutiveFailureCount: number) => number,
+tags: $ReadOnlyArray<string>,
+query: (context: TestContextType) => Promise<QueryResultType>,
+assert?: (queryResult: QueryResultType) => boolean
|};

In practice, an example of a test used to check whether HTTP resource is available could look like this:

{
description: 'https://applaudience.com/ responds with 200',
interval: () => {
return interval('30 seconds');
},
query: async () => {
await axios('https://applaudience.com/', {
timeout: interval('10 seconds')
});
},
tags: [
'go2cinema'
]
}

The assert method is optional. If query method evaluates without an error and assert method is not defined, then the test is considered to be passing.

Monitor program

Palantir monitor program continuously performs user-defined tests.

$ palantir monitor ./tests/**/*

Every test file must export an array of Palantir tests.

Alert program

Palantir alert program subscribes to Palantir HTTP API and alerts other systems using user-defined scripts.

$ palantir alert --configuration ./alert-configuration.js --palantir-api-url http://127.0.0.1:8080/

The alert configuration script allows to setup event handlers used to observe when tests fail and recover. In practice, this can be used to configure a system that notifies other systems about the failing tests, e.g.

/**
* @file Using https://www.twilio.com/ to send a text message when tests fail and when tests recover.
*/
import Twilio from 'twilio';
const twilio = new Twilio('ACCOUNT SID', 'AUTH TOKEN');
const sendMessage = (message) => {
client.messages.create({
body: message,
to: '+12345678901',
from: '+12345678901'
});
};
export default {
onNewFailingTest: (test) => {
sendMessage('FAILURE ' + test.description + ' failed');
},
onRecoveredTest: (test) => {
sendMessage('RECOVERY ' + test.description + ' recovered');
}
};

The above example will send a message for every failure and recovery, every time failure/ recovery occurs. In practise, it is desired that the alerting system includes a mechanism to filter out temporarily failures. To address this requirement, Palantir implements an alert controller.

Alert controller

Palantir alert controller abstracts logic used to filter temporarily failures.

We can rewrite the earlier example to delay notification about a failing test only if it remains in a failing state for at least 5 minutes, e.g.

import interval from 'human-interval';
import Twilio from 'twilio';
import {
createAlertController
} from 'palantir';
const twilio = new Twilio('ACCOUNT SID', 'AUTH TOKEN');
const sendMessage = (message) => {
client.messages.create({
body: message,
to: '+12345678901',
from: '+12345678901'
});
};
const controller = createAlertController({
delayFailure: (test) => {
return interval('5 minute');
},
delayRecovery: () => {
return interval('1 minute');
},
onFailure: (test) => {
sendMessage('FAILURE ' + test.description + ' failed');
},
onRecovery: () => {
sendMessage('RECOVERY ' + test.description + ' recovered');
}
});
export default {
onNewFailingTest: (test) => {
controller.registerTestFailure(test);
},
onRecoveredTest: (test) => {
controller.registerTestRecovery(test);
}
};

Because every part of the Palantir configuration is a script, you are able to set test-specific failure configuration, e.g. we might to have no delay for database related tests.

delayFailure: (test) => {
if (test.tags.include('database')) {
return 0;
}
  return interval('5 minute');
}

Takeaway

Palantir tests allow to write thorough integrations tests, e.g. testing customer-journeys using Puppeteer, querying database to construct conditional tests, as well as asserting service-health using 3rd-party services, such as https://www.webpagetest.org/.

In the context of the alert program, “notification” of the other systems can be as simple as sending a text-message or as thorough as creating an issue in a issue tracking system. In case of Applaudience, when possible, every test is assigned tags that establish test relationship with the code repository and even the relevant tags within that repository – this information is used to create an issue in a repository, which in turn notifies the responsible developer.

To sum up, with Palantir developers are always the first to become aware of the service failures this way drastically reducing the amount of time it takes to restore service health to normal. The next step in our roadmap is to enable communication with our clients when we know that the issues will impact their service (keep an eye on Palantir report program).

Enjoy Palantir! We are sharing Palantir to enable other developers to focus on building new tools without being distracted by dashboards in the background showing pretty, non-actionable system metrics. And if you would like to work in a company that strongly promotes open-source culture, where you get access to exclusive movie screenings on regular basis, and challenges you to solve large scale data problems then drop me an email at gajus@applaudience.com – we are hiring!