METRICS — MONITORING GUIDELINES

Kevin GEORGES
3 min readAug 22, 2018

--

At Metrics, we collect, process and analyze most of the OVH monitoring data. This is about 300M metrics pushing data points at a steady rate of 1.2M dp/sec.

Those data can be classified in two groups: host vs application monitoring. Host monitoring is based on hardware performances (CPU, memory, network, disk, …) while application monitoring is based on the service and its scalability (REQ, processing, …).

Scollector, Snap, Netdata, Graphite, Collectd, …

We tried most of the collection tools, but we always got to the same conclusion: We witness a metrics bleeding. Each tool focuses on scraping every reachable data and this is great if you are graph addict, but it could be counterproductive if you have to monitor thousands of hosts in an operational way.

At OVH, we use laser cut collections of metrics. Each host has a specific template (web server, database, automation, …) which exports a fitted amount of metrics used for health diagnostic and to monitor application performances.

Beamium & Noderig — The Perfect Fit

Our requirements:
Scalable: We have to monitor thousands of nodes.
Laser cut: We only want metrics that are relevant.
Reliable: We want metrics even in the worst conditions.
Simple: We want multiple pluggable components instead of intricate ones.
Efficient: We believe in impact free metrics collections.

The first shoot was Beamium

Beamium handle two aspects: application data scrapping and metrics forwarding.

Application data are collected through the well-known and widely used Prometheus format. We choosed Prometheus as the community is growing rapidly and as many instrumentation libraries are available on it. Starting collecting metrics is really easy — simply tell where your metrics are and Beamium will start collecting them right away.

Metrics are on a disk first and then will be forwarded to a remote backend. This behavior allows network or remote failure recovery (void proof graph guarantee).

And next Noderig

Noderig collect OS metrics (CPU, memory, disk and network) using a simple level semantic. This allows you to collect the right amount of metrics for any kind of host.

Noderig also supports external collectors, following the Scollector definition. External collectors are simple executable and responsible for collecting data which are collected by Noderig as any other metrics.

Collected metrics are available through a simple rest endpoint, allowing you to see your metrics in real time and easily integrate them with Beamium.

Is it working?

Beamium and Noderig are extensively used at OVH and support monitoring of very large infrastructures. At this time, we collect and store more than six million metrics, pushing measure at a steady rate of 1.2 million per second. So don’t worry, it works ;)

Stay in touch

For any question feel free to join community
Follow us on twitter @OvhMetrics

--

--