TIGeR on guard for performance — Part 3

Introduction

Lisa Botty
ITA Labs
9 min readOct 31, 2018

--

In the previous (second) part of the article, we went into more details about the TIGeR components architecture and told you how to install and set them up in their basic configuration.

In this (third) part, we will describe how to set up Telegraf for reading the specific Windows counters (PerfCounters) of our product, apply load to the tested server (we will feed the TIGeR), show how to display values from these counters in Grafana, and also consider some Grafana’s fine-tuning issues related to visualization, variables, and notifications (start “TIGeR training”).

Part 3. Taming the TIGeR

Feeding the TIGeR (applying load)

First of all, let us remind that InfluxDB and Grafana’s server part should be launched and running by now (for example, they are allocated to a separate server where they are always running).

For our test product, we use a load client, which we created specifically for this purpose. You will probably have your own methods to apply load (dedicated tools), but in any case, we will record metrics from the Windows counters:

· The counters which our product creates during the installation process (all important code places are covered with them)

· System counters (CPU, Memory, Disk, etc.) — optional, depending on objectives.

At this point, we have to choose counters to take metrics from. In our configuration, we have the basic, system counters (let us leave them there), but we will add our own ones. To do so, insert the following lines at the very end of the Telegraf configuration file:

[[inputs.win_perf_counters.object]]

# Your product name

ObjectName = “Test Engine — Client Manager

Instances = [“default”]

Counters = [

Counter name 1”,

Counter name 1”,

Counter name …

]

Measurement = “Test_CM

FailOnMissing = true

The text in bold type shows the places you will have to replace with your options:

· Test Engine — ClientManager and Counter name … — category and counters names: you can check them when you add them to the Windows System Monitor snap-in:

· default — the selected object (counter) instance name, an example is shown in the screenshot above

· Test_CM — the readings retrieval or run name; in fact, it is the table where the data will be recorded (we were reviewing it, using the InfluxDB chart in the second part of the article)

· FailOnMissing = true — the debugging parameter (it can be deleted or set to false afterward); it interrupts the Telegraf startup if there are issues with finding any counters when it is run. Otherwise, the agent will start but will not retrieve readings from such “bad” counters. Consequently, nothing will be displayed in Grafana, and that may prove confusing — there is no load, the counter does not work, or there is an issue with Telegraf.

Now, to simplify understanding, we are showing an example with one load scenario for one run of autotests. In real testing, we can have more scenarios during one run (for example, “Rough working morning”, “Calm day off”, etc.). That is why we have to somehow split the data stream for each scenario in such cases so that during one run, we could distinguish among the tested scenarios displayed in Grafana afterward (as for us, this is a very important point).

For example, we can do it by:

· adding specific tags to the Telegraf agent’s configuration section [global_tags] and using them to have a possibility to group data on the charts:

[global_tags]

scenario = “morning”

· changing the Measurement (table) name in the same place — in the agent configuration — and then filtering data by these parameters when plotting the chart:

Measurement = “Test_CM_Morning”

We will touch on this topic in more detail in the next parts of the article when we start automating data collection, taking the product build numbers into account.

Now, it is time to apply the load (we run the load client on our side: we mentioned it at the very beginning of this part of the article) and start the Telegraf agent after that. If you do it the other way round, the agent will return zero data for a while, and that may “spoil” the final chart appearance. On the other hand, you may need a scenario where it will be important to “capture” the moment when the load starts. So, both start options may prove useful.

In general, the TIGeR components start charts look like this:

While the load is being applied and data is being obtained, it is time to slightly fine-tune our test dashboard in Grafana:

· add a new Graph-type panel and get in its settings

· select Data Source (it is Perf_DB with us, we created it above)

· specify FROM default Test_CM SELECT field (counter name) below in the query builder

· we can similarly add other counters in the same builder (the Add Query button)

We return to the dashboard after that:

Marginally, this is already enough to see the big picture (see the screenshots above — the example of two load test runs with two charts on two counters).

However, we commonly have to adjust something for our specific objectives as well (in the first part of the article, we set them on our own example), and now, we will start making improvements.

Training the TIGeR (visualization, variables, and alerts)

Let us start with alerts. They are warnings sent via the pre-configured channel (like in Grafana) generated by a specific trigger. Channels are more than enough:

We will not consider everything — mail, instant messengers, webhook, etc. now. Everything is simple there, but in our test sample, notifications are set to be sent to mail. At the same time, we want to note that the SMTP parameters are set up in the Grafana configuration file — defaults.ini

Let us have a look at one example: send a letter if the server processor utilization exceeds 50% in a certain period of time while testing the product.

Select the “CPU Usage” property on our test dashboard, then, select the Alert tab and create a rule:

The arrows indicate what default values were altered:

· threshold value of 50 (%, as in the chart vertical scale) was set

· in the query box (A, 5m, now) we altered 5m for 1m,

here A is the query sequence number from the builder (it is displayed in the builder, on the “Metrics” tab next to the queries)

5m (1m) and now — view the period from the current time to 5 minutes (1 minute) back.

As a result, according to our settings, we will receive alerts to the mail specified if in each such time interval (between the current time and one minute back) the average value of all “A” query results (in this panel, there is one query for the CPU) will exceed 50%.

Put it simply: the load was applied and one minute passed, and then, if the average CPU utilization level during this minute was higher than 50%, a letter will come. And every minute it will happen like this — the alert will come or will not. Certainly, this is just an example. Actually, with such settings, the alert letters may “bury” us (if the CPU is highly utilized as often happens during the tests). Therefore, you have to make settings based on experience and on your notification requirements.

All the other alert parameters are not difficult to comprehend. So, we will not describe them.

Practically, it happens like this: we apply the load to the workstation’s processor (where the Telegraf agent is installed) to periodically exceed the 50% utilization level, and soon we will receive an alert mail titled “[Alerting] CPU Usage alert”, which has a chart snapshot attached to the time interval when this alert triggered:

A vertical red broken line shows the place the alert triggered. A horizontal one shows an alert threshold value (in our example — 50%).

A crucial point: no variables (like /^server$/ or $server) should be selected in the metric query designer:

Otherwise, the alerts will not work because variables in metrics are not supported in the query subsystem for alerts:

The official bug tracker contains lots of complaints about this issue or flaw (for example, here), but the developers have not yet fixed or implemented it (this limitation is officially commented here).

This issue makes it difficult to configure alerts in many scenarios. You will either have to give up on them in this metric collection chart or skip using variables. For example, if we want to handily filter data and charts on the dashboard according to the build numbers:

We will have to use a variable, for example, “Build”:

Now, if we want to add an alert to this chart, Grafana will return a message “Template variables are not supported in alert queries” (see the example in the screenshot above, Fig 11).

Here are some facts about variables in Grafana. They are meant to customize the display of panels and other dashboard visualization units according to predetermined filters, for example, data source, station, disk, network adapter name, product build number, and more. They are not limited by their dashboard only. Find out more details about the variables in the official documentation.

Variable controls are in the dashboard settings:

Here we can add variables of different types:

In the example shown, from the query results tags, we get the server name, drive letter, adapter name, and product build number. Next, we can use these variables in the metric query designer and when filtering the dashboard data (see several screenshots presented above with the variable “Build” as an example).

At the end of this part of the article, we would like to recapitulate Grafana’s panels, tables, and other data display units available to be added to the dashboard:

Actually, there are many more of them than shown in the screenshot because this Grafana feature is expanded with plug-ins, the complete list of which is here.

Each panel can be flexibly configured for different data display logic. We will get back to this feature when we show how to calculate the number of supported “virtual clients” for our test product in the following parts of the article about automation and load autotests.

As your homework, try to “play around” with the built-in panels: add and customize them for your objectives and preferences. You can also see many samples of different types of dashboards and panels. You can also “play around” with their settings, on the Grafana demo portal.

Summary

In this third part of the article, we described how to configure Telegraf to read specific counters tracking our product, applied a load to the test server using a load client, showed how values picked up by these counters are displayed in Grafana, and saw some of Grafana’s configuration issues regarding visualization, variables, and notifications.

Next, in the fourth part, we will start automating the TIGeR components operation by connecting the test automation server (“teach TIGeR to act by itself”), as well as configure Telegraf and Grafana to display our test product performance dynamics both in test run scenarios and for each build (“observe TIGeR success”).

Links

TIGeR on guard for performance — Part 1

TIGeR on guard for performance — Part 2

Author: Eugene Klimov, System Engineer, ITA Labs

--

--