THRON tech blog
Published in

THRON tech blog

How we implemented RED and USE metrics for monitoring

Putting Prometheus, Grafana, RED and USE metrics all together to improve monitoring

What is Monitoring?

Amon/PRTG and why we choose to change them

  • the monitoring was limited to just EC2 servers;
  • we couldn’t use it for applicative monitoring;
  • it runs on Windows machines and this has a increased operational and cloud infrastructure cost;
  • it didn’t work well with autoscaling instances;
  • it only sent mail for alerting and it had a laborious integration with Slack;

Different approaches to monitoring

  • White-box monitoring: this way is based on metrics exposed by the internals of the system, in the white-box category we have the complete map of the system, we know every detail about process and components, there is nothing hidden or closed for us. It includes logs, interfaces like JVM Profiling Interface or HTTP Handlers emitting internal Statistics. The success of this monitoring type depends on the ability to inspect the innards of the system with the correct instrumentation. White-box allows the detection of imminent problems, failures masked by retries and, of course, plain failures :-)
  • Black-box monitoring: this way is based on testing externally visible behaviour as a user would see it. We don’t know how the system works internally but we can create reports quantitatively where metrics are extended/exceeded or when they have changed significantly (a value of 10% is always a good variation reference, for better or for worse) and in different conditions (time, geographical location, connection method, different computers and/or operating systems, etc.); this methodology gives a perception about how users receive your service but won’t usually help you in preventing issues to arise;

Our monitoring goals

  • Analyzing long-term trends: how big is my database and how fast is it growing? How quickly is my daily-active user count growing?
  • Alerting: something is broken, and somebody needs to fix it right now! Or, something might break soon, so somebody should check soon.
  • Building dashboards: dashboards should answer basic questions about your service, and normally include some form of the HTTP metrics.
  • Conducting ad hoc retrospective analysis (i.e., debugging): our latency just shot up; what else happened around the same time?

Prometheus or “How to fire up your monitoring”

  • a data model based on time series data identified by metric name and key/value pairs
  • a really flexible and powerful query language that helps to aggregate data, the results can be aggregated in real time and directly shown or consumed via HTTP API to allow external system to display the data
  • no reliance on distributed storage; nodes are single server and autonomous

Grafana - The colourful way of reading your data

The Four Golden Signals

  • Latency : The time it takes to service a request;
  • Traffic : A measure of how much demand is being placed on the system;
  • Errors : The rate of requests that fails;
  • Saturation : How “full” our service is, basically how close we are to exhausting system resources;

From the Four Golden signals to the RED way of creating Metrics

  • Rate: the number of requests our service is serving per second;
  • Error: the number of failed requests per second;
  • Duration: the amount of time it takes to process a request;

Infrastructure and the USE method

  • Utilization: the proportion of the resource that is used, so 100% utilization means no more work can be accepted;
  • Saturation: the degree to which the resource has extra work which it can’t service, often queued;
  • Errors: the count of error events;

it could be a problem but not the problem.

Problems encountered during development

Future development, are we happy with our new monitoring?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store