GSoC ’17: Work Product

Saumay Agrawal
5 min readAug 27, 2017

--

During the coding period, I have created a plug-in/method to consolidate the time sequences of performance counters given by various OSDs of a Ceph cluster into a single place and perform statistical calculations(average, standard deviation, etc) on them. These stats are then sent to the Ceph Dashboard, where I have created interactive graphs to visualize the performance of a Ceph cluster.

Planning and Prototyping

I started out by finding the performance metrics, as guided by my mentor Kefu Chai, which will give the dashboard user an overview of overall performance of the Ceph cluster by showing any laggy osds etc. After researching quite a lot in the documentations for Ceph, I found out(here and here) that the read latency and write latency are the most important performance metrics, which I need to visualize on the dashboard. I made some graph mock-ups for the same, containing latency value distribution for three OSDs along with their average values. The mock-ups can be viewed here. Then I put them on the mailing list for suggestions of users as well as developers of Ceph. I also queried for any fixed latency value which can be considered as a benchmark to measure the OSD performance, like above that value the performance would be considered bad.

The most recurring suggestion I received from users and developers was that minimum and maximum values for a given time instant is also very insightful. Also we cannot set a specific latency value as a benchmark for performance indication and comparisons, because the values vary from hardware to hardware. So it would be better to show the distribution of latency values of all OSDs against the first few standard deviations around average. So, I took these suggestions and made some prototypes of the graphs (can be viewed here) using the ChartJS library. This also helped me in understanding how ChartJS works. These prototypes were using a sample data for each plot. However, they need to be connected to the Ceph mgr for visualisation of values given by a cluster.

Meanwhile, I also kept exploring the codebase of dashboard and the various plugins and libraries it was using. I developed a good understanding of the codebase when the time came to implement these prototypes and mock-ups.

Making the Time Sequence Plugin

For this I made some additions to module.py of ceph dashboard directory. This is the python module which sets up the whole dashboard at localhost:41000 on the browser using CherryPy, and uses the Jinja2 templating to render data on HTML templates. There is a method, “get_counter()” which takes daemon type, daemon name and path of performance counter as its arguments, and returns the time sequence (of 20 instants of time having 5 sec differences) for the same. I used this method to consolidate such time sequences for all the OSDs of a cluster into a single JSON format, in a method “get_counter_allosd(self, path)” where path of performance counter is passed as an argument. I then added another method, “get_counter_stats(self, path)” which processes the data given by “get_counter_allosd()” to find the average, standard deviations, minimum, maximum etc for each time instant.

Introducing Graphs on the Dashboard

I added two HTML templates. One to demonstrate the general machinery of graphs. Another for the graphs of read latencies and write latencies, which shows how those graph templates can be used to add the graphs for any performance counter on the dashboard. The read latency and write latency graphs shows the average, minimum and maximum distribution for both of these performance counters, along with prepare and process latencies for the same. I added respective methods, “perf_graph_templates(self, path)” and “osd_perf_graph(self, path)”, in the module.py to expose these HTML templates on the dashboard and render the data on them.

The key features of the graphs are:

  • The performance counter values for all OSDs are represented by thin white lines. Since there can be several OSDs, I have kept the line width for these plots thinnest possible. This will avoid cluttering the graphs, make more relevant info(average etc.) more visible, and will distinctively show the OSDs for which value spikes occur.
  • The minimum and maximum performance counter values are both represented by thick green colored lines. This gives the upper and lower bounds to the OSD values, aiding to their visualization.
  • The average performance counter values are represented by thick orange colored line.
  • The standard deviations on both sides of the average value are represented by thick black lines. These give a rough indication of the ranges of acceptable OSD values.
  • On hovering the mouse pointer on points plotted on the graph, exact value for that dataset is shown in a tooltip along with its label. For example, OSD id can be identified on seeing a spike on white line by hovering over it.
  • If two or more datasets intersect at the same point, i.e. they have same value for a time instant, then the tool tip lists all the values along with their labels.

How to access these graphs

  1. Clone the ceph repository on your system.
  2. Build ceph and run the test cluster.
  3. Point your browser to “localhost:41000” to see the main page of ceph dashboard.
  4. Use “localhost:41000/perf_graph_templates/{perf_counter_path}” to view the various graphs for respective performance counter. For example, use “localhost:41000/perf_graph_templates/osd.op_latency” to view graphs for the op_latency performance counter.
  5. You can find more such performance counters using “http://localhost:41000/get_perf_schema”, which shows an exhaustive performance counters list for all the daemon types of Ceph.
  6. For the latency graphs, navigate to Cluster > OSDs using the sidebar and click on the link “View OSD Performance Graphs”.
  7. Use the benchmarking commands given in this wiki article to see the graphs in action.

Snapshots of Working Graphs

Final Words

It was a huge challenge for me to manage GSoC along with my college. My college follows a continuous assessment programme because of which I have assessments and project reviews quite frequently. It used to get very hectic for me sometimes, and I found myself working all through the nights from time to time.

To sum up my work, I have created a general machinery under the guidance of my mentors Kefu Chai and John Spray, which can be used to visualize the general statistical distributions of any OSD performance counter in future, if needed. I would like to thank my mentors for their support and guidance. I hope I have made a worthwhile contribution to Ceph.

--

--