How Grafana and the open source community came together

Published in

FT Product & Technology

5 min readApr 16, 2018

By Nayana Shetty & Kamran Muniree

A couple of weeks ago we attended GrafanaCon, a two-day event centered around Grafana and the surrounding ecosystem. We’ve been using Grafana for visualisation of metrics and monitoring at the FT for several years now and it was interesting to find out how other organisations use it, and the best practice they recommend.

Highlights

Lots of things made this conference amazing for us.
Here are some of the talks and discussions that we liked the most.

“The RED Method: How To Instrument Your Services” by Tom Wilkie, Grafana Labs:

For system level monitoring, the Utilization Saturation and Errors (USE) Method is a methodology for analyzing the performance of any system using Utilisation, Saturation and Error rates. This helps engineers to build a checklist for all the resources ie CPU, Memory, Disk capacity, Disk IO, Interconnects and Network used by an application. This concept was introduced by Brendon Gregg of Joyent.

However, with more focus on building microservices and a self healing infrastructure, the USE method doesn’t apply to services. Hence the RED method, to monitor request Rate, Error and Duration for each service. Using this method for all the microservices will give a consistent view of how the entire architecture is behaving allowing scaling of operations and Support (Ops).

These methods are not alternatives to each other for monitoring the same things: the RED Method is about how happy our users are while the USE method is about how happy our machines are. If we could run our machines at 100% utilisation whilst keeping our users happy that would make for a very cost effective service.

“Inherited Technical Debt — A Tale of Overcoming Enterprise Inertia” by Jordan Hamel, Amgen:

In most enterprise companies there are far too many monitoring tools. Most monitoring tools work in silos, lack automation, lack flexibility to change, lack the API integration points for the tools and having too many means there is a “tool fatigue” “why do we need to fund yet another tool?”. At Amgen, the way they approached this problem was by first building tool-agnostic requirements for monitoring. These requirements were then used to categorise existing tools based on their capabilities into Collectors, Aggregators, Visualisers and Alerters (CAVA).

Collectors: telegraf agent, Splunk agent, CloudWatch, AppDynamic, StatsD
Aggregators: Kafka msg queue, ElasticSearch(logDB), AppD(MySQL)
Visualisers: Grafana, Splunk
Alerters: Grafana, Kapacitor, Attention, OpenNMS(REST interface & SNMP)

This has led to Amgen working on the CAVA API which allows users to request for some kind of monitoring setup of their application with some POST parameters and the api picks the right tools from the CAVA and finally returns the user a dashboard. This is also integrated with their service registry (servicenow in their case) to help build a service-specific dashboard. This API is planned to be open source later this year.

What’s new in Grafana v5.0?

Grafana 5 was launched during Grafanacon and here are some highlights of the features that were released.

Dashboards can now be organised into folders, which can be really useful if a team has multiple dashboards. This also means that we can assign permissions for specific dashboards.

The new dashboard layout engine also allows for much easier movement and sizing of panels that have been created, as other panels can now move out of the way in a very intuitive way. Panels are sized independently, so rows are no longer necessary to create layouts. An example of the new layout can be found below.

With regards to annotations, this allows a user to mark points on a dashboard which then can be hovered over on to get event description and event tags. The text field can also include links to other systems with further detail.

Hallway discussions with Grafana Labs devs:

In our discussion with Grafana, we spoke about how we are using Grafana at the FT where we have created dashboards which cover servers, kubernetes, AWS account compliance using config rules dashboard and the patch status of all our infrastructure etc.

We discussed how it might be possible to monitor the networking infrastructure using Grafana so that we can see things such as the internet latency etc.

Another thing we mentioned was the fact that to view a dashboard a user must be logged in and currently the only way to showcase dashboards on TV screens are by creating snapshots which are outdated almost immediately. The devs informed us that this is something that they are going to try and work on so that we can share dashboards with specific permissions that don’t require a login but yet still have read only access etc.

INTERESTING FACTS

Some of our favorite anecdotes from the conference:

We saw Erwin de Keijzer from Snow BV work out how long his washing machine takes for one wash and general power usage of the appliances at home using Grafana.
We heard from 4 different time series database providers — Graphite, InfluxDB, Timescale and Prometheus — which cater to similar needs but have different querying languages, ie functional language, SQL, PromQL query.
We saw how “Energy Weather” use Grafana for weather, power, market and energy utilisation forecast. The graph below shows an example of wind and solar power forecasts

4. We also heard about how graphs tell stories and we can relate this to the time when we had issues with our graphite servers due to meltdown & spectre.

5. And finally we had our Grafana servers update automatically to Grafana5 during the launch of Grafana5 with no issues. Impressive!

Conclusion

We felt discussions about the benefits of open sourcing project/code that are not the core of a company’s business, and the use of prometheus for monitoring microservices stood out.

The conference was a great experience overall and allowed us to broaden our depth of knowledge about collecting metrics and monitoring. Lastly it allowed us to explore the possibilities and benefits of how we could go about open sourcing some of our projects and code so that we could contribute to the world of open source.