PepperReport monitoring stack
How to monitor thousand of endpoints every minutes.
A one line CURL command could monitor any http endpoint with the same effectively that PepperReport does it.
But when you want to do this for thousand of endpoints, every minutes from many location in the world, with a history queryable in a fraction of second and detect even tiny Apdex trend, you will face many issues.
Creating this with a viable SaaS vision also add more new constraints:
It should be launched quickly and with a small cost per month infrastructure. It have direct incidence on how the service should be designed. For example there’s no money for renting many servers around the globe. Let’s see how PepperReport is working.
The project is splitted in two parts: the administration-app and the data-app.
The administration-app represent the UI and the configuration of the organization. It’s powered by Php7.2 and Symfony4, mostly because html rendering is made by Twig and the development is fast. With a Redis instance for managing (and sharing) server sessions, it won’t be an issue to scale it later. Frontend ressources are made using Twitter Bootstrap and Gulp to manage CSS and JS concatenation and minification.
The database repository for users, domains and organizations informations, is a PostgreSQL database. The application is designed to use the command bus (the Messenger component of Symfony) because it fit perfectly with the DDD vision I’m familiar with. For the first release, the API layer and UI Twig layer are the same application. It could be improved, but later.
The data-app job is to collect and provide monitoring data. This is the main congestion point, all the rentability of the service depend on how many endpoints could be monitored each minute. This is why the monitoring probe is in GoLang: to take advantage of the concurrency model of the Goroutine. It’s very efficient when facing network latency while only few calculations are done. It can manage hundred of checks in few seconds. An other objectif of the probe was to make it stateless: so on it could be handle thought a FaaS (AWS Lambdas at this time). It’s triggered by AWS CloudWatch every minute and it loads the monitoring endpoints through a REST API (version using GRPC is under developpment).
For every monitoring check, there are only two possibilities:
If there’s a failure (no response or bad response), the probe directly trigger the “micro-failure-service” via Amazon API Gateway. The failure is directly processed and notifications are sent.
If there’s a correct response and so speed metrics are send to a micro-service, the “micro-receiver-service”. To reduce latency and close the monitoring lambda as quickly as possible, it saves the data in a Redis list and return a http response code 201. Now, data can be processed by checking previous speed response and detecting metrics trends, depending on the alert configuration.
It’s about time
Saving metric for thousand of website, every minute, for many months represent a large amount of data. I’ve quickly chose to use the best tool for this: a time series database and read a lot of things about it.
TSDB are specially designed to do this task, and so, they have significants advantages. One of them (you can find it also on EventStorage Database), is their internal “append-only data engine”. It’s very rare to require an edition of an existing value so it’s rather designed to be fast on new insertion.
I’ve chose the most famous (TSDB): InfluxDB. It’s very powerful: easy to setup and to use. It doesn’t require special new skill, because it’s queryable by SQL through an API. Their are great APIs integrations in any language, like Golang or PHP.
In the future
Lot of thing can be improved or optimized. Using more GRPC between micro-services, optimize queues and using more cache are possibilities. But I will wait for more feedback about our users before doing this. So feel free to join PepperReport and use it as your monitoring tool.
🌶Quickly identify which of your web services are underperforming in terms of speed and availability with our detailed performance reports. 🔥