Ensuring good service health by automating thorough integration testing and alerting

Palantir automates service health checks and communication with the developers

Gajus Kuizinas
Jun 22, 2018 · 6 min read

There are only so many things that make you look more unprofessional than your clients informing you about a failure of your service without you being aware and transparent about the failure beforehand. It is your responsibility as a service provider to be the first to know when something breaks and inform clients that you are aware of the failure and working on a fix.

When your client informs you about your API being down.

The company that I am currently developing is a prime example of a fragile service where things are expected to regularly break – at Applaudience we aggregate cinema data from thousands of remote sources. While it is unusual for APIs to make breaking-changes, when you spread this probability across thousand-plus integrations, these breaking changes become daily occurrence. As such, I’ve designed the entire monitoring infrastructure with two key objectives:

  1. Automate anomaly detection
  2. Automate communication with the responsible developers

Automating anomaly detection ensures that we as the service provider are the first to know when things break and automating communication with the developers (e.g. GitHub issue creation) with the responsible developer reduces the feedback-loop time.

Compare this with the traditional monitoring solutions. Existing monitoring software primarily focuses on enabling visual inspection of service health metrics and relies on system maintainers to detect anomalies. This approach is time consuming and allows for human-error. Even when monitoring systems allow to define alerts based on pre-defined thresholds, a point-in-time metric is not sufficient to determine service-health. The only way to establish service-health is to write thorough integration tests (scripts) and automate their execution, just like we do in software-development.

For this purpose, I have developed Palantir.


Palantir is used for communication and as a means of seeing events in other parts of the system.

Palantir continuously performs user-defined tests and only reports failing tests, i.e. if everything is working as expected, the system remains silent. This allows service developers/maintainers to focus on defining tests that provide early warnings about the errors that are about to occur and take preventative actions when alerts occur.

Palantir decouples monitoring, alerting and reporting mechanisms. This method allows distributed monitoring and role-based, tag-based alerting system architecture.

While it might sound complicated, it is just two programs:

  1. Monitoring program
  2. Alerting program

Both programs accept user-defined scripts for monitoring and alerting purposes.

Palantir test

A Palantir test is an object defining methods used to query data and assert an expectation:

In practice, an example of a test used to check whether HTTP resource is available could look like this:

The method is optional. If method evaluates without an error and method is not defined, then the test is considered to be passing.

Monitor program

Palantir program continuously performs user-defined tests.

Every test file must export an array of Palantir tests.

Alert program

Palantir program subscribes to Palantir HTTP API and alerts other systems using user-defined scripts.

The alert configuration script allows to setup event handlers used to observe when tests fail and recover. In practice, this can be used to configure a system that notifies other systems about the failing tests, e.g.

The above example will send a message for every failure and recovery, every time failure/ recovery occurs. In practise, it is desired that the alerting system includes a mechanism to filter out temporarily failures. To address this requirement, Palantir implements an alert controller.

Alert controller

Palantir alert controller abstracts logic used to filter temporarily failures.

We can rewrite the earlier example to delay notification about a failing test only if it remains in a failing state for at least 5 minutes, e.g.

Because every part of the Palantir configuration is a script, you are able to set test-specific failure configuration, e.g. we might to have no delay for database related tests.


Palantir tests allow to write thorough integrations tests, e.g. testing customer-journeys using Puppeteer, querying database to construct conditional tests, as well as asserting service-health using 3rd-party services, such as https://www.webpagetest.org/.

In the context of the alert program, “notification” of the other systems can be as simple as sending a text-message or as thorough as creating an issue in a issue tracking system. In case of Applaudience, when possible, every test is assigned tags that establish test relationship with the code repository and even the relevant tags within that repository – this information is used to create an issue in a repository, which in turn notifies the responsible developer.

To sum up, with Palantir developers are always the first to become aware of the service failures this way drastically reducing the amount of time it takes to restore service health to normal. The next step in our roadmap is to enable communication with our clients when we know that the issues will impact their service (keep an eye on Palantir report program).

Enjoy Palantir! We are sharing Palantir to enable other developers to focus on building new tools without being distracted by dashboards in the background showing pretty, non-actionable system metrics. And if you would like to work in a company that strongly promotes open-source culture, where you get access to exclusive movie screenings on regular basis, and challenges you to solve large scale data problems then drop me an email at gajus@applaudience.com – we are hiring!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store