Transform your system metrics into KPIs with the Rating Operator !

Sihem Cherrared
alter-way-innovation
4 min readOct 27, 2021

--

Rating Operator is a Kubernetes based application aimed at providing system administrators, DevOps or business-oriented users, such as BI developers, with indicators or KPIs (Key Performance Indicators) derived from metrics.

The need for KPIs can be manyfold, from saving storage space by aggregating metrics over time and environments, to providing higher level or domain oriented information over the ongoing operations. These information can be cost, energetic use, performance analysis, at real-time or forecast…

Whereas the classical scheme for a DevOps engineer is to build its own KPIs by defining a PromQL (Prometheus), LogQL (Loki) or InfluxQL (InfluxDB) queries and then visualizing them in Grafana, Rating Operator API helps into defining transformation rules, that will target metrics, define the mathematical operations to be performed on them, so to output a domain-oriented KPI. Moreover, Rating Operator enables users that are non familiar with PromQL to define their own KPIs using existing transformation rules templates or to create new ones. These templates can be defined by developers and applied by clients of any business domain.

In order to define KPIs, the Rating Operator API applies three rating rule objects: templates, values and instances. The rating rule template is a PromQL rule defined by the developer. This template contains a list of variables that are set to specific values to create the desired KPI (i.e. rating rule instance).

Suppose, you want to calculate the cost of your kubernetes nodes if you deploy your infrastructure in AWS and apply the aws_a1_large flavor.

To do so you need to create a PromQL rule that considers the cpu and memory consumption of Kubernetes nodes and apply the price of the aws_a1_large flavor. To make this PromQL reusable for other flavors, you can create a rating rule template using the Rating Operator API:

- query-name : kubernetes-nodes-cost
- query-group : cloud-cost
- query-template :(ceil(sum(instance:node_memory_utilisation:ratio) * max(node_memory_MemTotal_bytes)/(1024*1024*1024)))/sum({nb_ram}) > (ceil(sum(instance:node_cpu:ratio) * ax(instance:node_num_cpu:sum))) / sum({nb_cpu}) or (ceil(sum(instance:node_cpu:ratio) * max(instance:node_num_cpu:sum)))/sum({nb_cpu})) * sum({price})

The created kubernetes-nodes-cost template can then be applied for any flavor in AWS. If we consider the example of aws_a1_large flavor, we can then specify its size and price values as a rating rule valuesobject. The aws_a1_large values are defined as metrics and the modification history is stored in the database for further use. The Rating Operator API creates the rating rule instance according to the values you specified fo the aws_a1_large flavor:

Name: aws-a1-large-flavor
Namespace: rating API
Version: rating.alterway.fr/v1
Kind: RatingRule
Spec:
Metric: ((ceil(sum(instance:node_memory_utilisation:ratio) * max(node_memory_MemTotal_bytes)/(1024*1024*1024)))/sum(2) > (ceil(sum(instance:node_cpu:ratio) * max(instance:node_num_cpu:sum))) / sum(4) or (ceil(sum(instance:node_cpu:ratio) * max(instance:node_num_cpu:sum)))/sum(4)) * sum(0.051)
Timeframe: 3600

You can then visualize your results in Grafana or apply other flavors with the same template.

Grouping templates and metrics:

In the Rating operator, the templates and metrics are organized into groups to simplify their reusability. The following figure showcases the rating rule template hierarchy with the example cited above.

Rating operator templates and metrics grouping

Rating Operator architecture:

The Rating Operator architecture is divided into three main components that interact with Prometheus, Grafana and Kubernetes APIs:

  • Rating Operator API: proposes a list of endpoints to create and modify your KPIs.
  • Rating Operator Manager: watches and updates CustomResources, helping other components achieve their roles. It plays a role in most of the features of the rating-operator, handling all event based callbacks.
  • Rating Operator Engine: uses custom resources to configure its workers and trigger rating workloads, with a variable timeframe. Each worker holds its own configuration, that can be updated at runtime, and query, rate and store data in PostgreSQL through the Rating operator API.

Natively, all the data generated by rating mechanisms is accessible by the system administrator. If the context requires data segregation, we provide a mechanism (Local, Keycloak or using LDAP schema), to authenticate your users (tenants) and assign them namespaces. A logged in tenant can only access data coming from its namespaces. Any tenant registered to the application has a Grafana account with the same credentials created. In Rating Operator, we provide three ways to authenticate users:

  1. Local authentication using the PostgreSQL database.
  2. Keycloack: an open source identity and access management.
  3. Lightweight Directory Access Protocol (LDAP): compatible with LDAP open source solutions such as OpenLDAP.
Rating operator architecture

Rating Operator use cases and upcoming features:

Rating Operator proposes a set of ready to use dashboards and templates. For instance, the cloud costs comparison dashboard that allows you to compare the costs of your kubernetes infrastructure resources consumption over different IAAS cloud providers. One of the last added features is the automatic rules detection, that enables you to find templates that match your needs with machine learning.

Follow us on Github, we need your feedbacks: https://github.com/alterway/rating-operator

Rating Operator was made in Alter Way with ❤ by its R&D team and contributors (Jonathan Rivalan, Sihem Cherrared, Phuc Vu, Valentin Daviot)

--

--