Leveraging Prometheus to get MySQL database insights

Published in

Ordergroove Engineering

11 min readJul 28, 2021

Even the most simple implementations often have complex interactions. Multiple services talking to each other and each one of them also talking to one or many backends, databases, external services. It’s a fact you can’t survive (and thrive) in the current state of software development without monitoring, metrics, and eventually alerts: insight is the keyword here, and more often than not will find that the tool you initially deem enough, combined with the practices you have is most likely not cutting it, leaving you in the dark when you need it the most.

Let’s be honest about it. Nobody wants to invest a considerable amount of time (budget) nor it seems reasonable to have more monitoring services than actual application services. But that’s why we have SRE. It’s not about just implementing the latest trends in monitoring solutions, but thinking about which ones, why, and making it as lightweight as possible. And lightweight is not just a matter of the measurable impact it has on your clusters, but also about how easy it is to maintain, how resilient it is and how accurate the information it provides.

In this particular case, the setup was straightforward: k8s and VM based application services, talking to various backends — in our case, MariaDB (a MySQL fork). From time to time (it happens…) we would get database spikes, and with the implemented tooling we had (NewRelic) the information we had was rather partial, raising a red flag but not allowing us to identify who and where, basically. We essentially needed more light into the matter. That’s where Prometheus came in to fill the insights gap.

Prometheus is an open source system monitoring toolkit, and from its many virtues, we cared about these ones:

Multi-dimensional data model. Think of it as a cube matrix. Normally when talking metrics you would have the value itself (i.e., memory usage) over a span of time. That’s two dimensions. Prometheus’ data model allows for faceting over other keys, such as host name, environment, user name. That allows for an easy way of layering related metrics for querying and display, for instance, over the same essential metric.
The query language. Prometheus leverages Prom-QL, which allows querying metric points with built in aggregation functions.
Flexible backend. We wanted to have either a hosted backend solution or host it ourselves, but we should be able to get that aspect out of the way if possible.
Simple configuration and discovery of polling targets. Self explanatory, we wanted to automate as much as possible the installation and configuration of targets to scrape data from.

We did our research and decided to implement a Prometheus server from Helm 3 charts, combined with prometheus-mysql-exporter pods, one for each database we want to monitor. Prometheus gets its data mostly from exporters — sort of intermediate services that gather data from a particular service and present it in a format Prometheus can read and store. There are other ways that Prometheus can get its data but are not covered here. All in all, we had read and heard great things about it, and it pretty much checked all the boxes. We decided to rely on a hosted backend solution to not worry about storage, backups, data cleanups, and to leverage its Grafana dashboards — Grafana is a must to solve our problem!

Here’s a quick diagram on how things turned out:

The implementation was pretty straightforward; we’re deploying Prometheus onto its own namespace: prometheus.

$ kubectl create namespace prometheus

Our values will consist of two settings files; values.yaml with the global configuration, and secrets.yaml, sops encrypted, which holds access info for Hosted Graphite backend, plus a couple network policy k8s resources to allow Prometheus to go outside of the cluster and also into the exporter pods (these may not be needed depending on your network policies);

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-internet-egress
  namespace: prometheus
spec:
  podSelector:
    matchLabels:
      app: prometheus
  egress:
  - ports:
    - protocol: TCP
      port: 80
    - protocol: TCP
      port: 443
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 9104
  policyTypes:
  - EgressapiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-mysql-exporter-egress
  namespace: prometheus
spec:
  podSelector:
    matchLabels:
      app: prometheus
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: prometheus-mysql-exporter
    ports:
    - protocol: TCP
      port: 9104k apply -f np-prometheus-internet-egress.yaml
k apply -f np-prometheus-mysql-exporter-egress.yaml

Installing Prometheus with Helm is as simple as it gets, just edit the values with your preferences and launch it;

helm3 secrets install prometheus prometheus-community/prometheus --namespace prometheus -f secrets.yaml -f values.yaml --version 13.8.0

To verify Prometheus is up, we can port-forward its port to a local port and hit it with a browser:

$ kubectl --namespace prometheus port-forward $POD_NAME 9090Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090

And on localhost:9090 in our browser we will get the Prometheus dashboard, where we can query metrics, as well as check the status on our polling targets.

Now to the exporters; the procedure is similar: a values file and a secrets file. Mysqld_exporter supports many options about what it should collect metrics from, as well as discovery settings and metric labeling. It also requires a user in MySQL with proper grants, (select, process, replication_client grants) as well as the userstat setting enabled — it’s all on the exporter documentation. Here’s our working config for a database master:

image:
  repository: "prom/mysqld-exporter"
  tag: "v0.12.1"
  pullPolicy: "Always"serviceMonitor:
  # enabled should be set to true to enable prometheus-operator discovery of this service
  enabled: false
  interval: 60scollectors:
  global_status: true
  global_variables: true
  info_schema.userstats: true
  info_schema.clientstats: true
  info_schema.tablestats: true
  binlog_size: true
  info_schema.processlist: true
  info_schema.query_response_time: trueannotations:
  prometheus.io/scrape: "True"
  prometheus.io/path: "/metrics"
  prometheus.io/port: "9104"podLabels:
  environment: production
  target: my-db-master

The interesting parts are:

Collectors, where we enable explicitly what we want to scrape
Annotations, the way Prometheus server will discover this exporter and add it to its scraping targets
podLabels, which we will use for labeling metrics

The secrets file is simple enough:

mysql:
     user: exporter
     host: my-db-master
     pass: <password>
     port: <port>

And our helm3 command line would be something like

helm3 secrets install mysql-exporter-my-db-master prometheus-community/prometheus-mysql-exporter --namespace prometheus -f my-db-master/values.yaml -f my-db-master/secrets.yaml

The exporter also exposes a port so we can port forward our local into it:

export POD_NAME=$(kubectl get pods --namespace prometheus -l "target=my-db-master" -o jsonpath="{.items[0].metadata.name}")
kubectl port-forward $POD_NAME 8080:9104Forwarding from 127.0.0.1:8080 -> 9104Forwarding from [::1]:8080 -> 9104

Now if we point our browser to localhost:8080/metrics, we should see the raw dump of metrics the exporter is already scraping from the target server.

First thing that stands out here is — these are a lot of metrics, these will probably exceed the quota on my <insert here hosted prometheus backend service>. No worries, we can get Prometheus server to do a nice parsing on these, and drop whatever metric we don’t actually need. Remember that for the time being, Prometheus should be getting all of these and forwarding to the hosted backend. We don’t want that situation to continue for much more time…

First things first — let’s see if our Prometheus server actually sees the exporter by the miracle of annotations. To do so, we need to go to the Targets submenu in Prometheus UI (remember, at localhost:9090/)

Now on the Targets page, we should see all exporters we have configured — if some of these are new, or just added, they may appear as DOWN — no worries, will be UP in no time, provided they’re working properly.

There it is. Probably we can query it too then (hint, the Graph menu option)…

Nice — we’re getting data. And notice the labels for each metric: they make up the faceting and are key for getting a better view of our data. These labels were already added by our custom config, so you may be seeing something different here. So now that we know the setup is working, let’s dig into our current (already tuned) Prometheus server values.yaml file, and how we implemented regex based relabeling to drop and relabel metrics. We’ll go through it block by block.

server:
  global:
    scrape_interval: 1m
  retention: 30d
  persistentVolume:
    enabled: true
    size: 30Gi

Main server config, scraping interval, retention and a local persistent volume so we can have local querying if needed without relying on the remote backend. Remember we’re sending data to Hosted Graphite, anyway.

nodeExporter:
  enabled: falsekubeStateMetrics:
  enabled: falsealertmanager:
  enabled: falsepushgateway:
  enabled: false

Disabling things we don’t need. nodeExporter will push current node metrics. kubeStateMetrics is for k8s-based systems metrics. Alertmanager is for alerting, clearly — not using it. And pushgateway serves as a proxy for publishing any custom metric you may want.

Now for the main prometheus configuration, let’s also split it in blocks so it’s easier to explain:

serverFiles:
  prometheus.yml:
    scrape_configs:
    - job_name: 'og-mysqld-exporter'
      scrape_interval: 60s
      kubernetes_sd_configs:
        - role: service
          namespaces:
            names:
              [ 'prometheus' ]

Here we’re defining the contents of the prometheus.yml file. One interesting part of the Prometheus k8s deployment is that it consists of two pods, prometheus server itself and a config monitoring sidecar. Thus if any config file (mapped as a configmap) changes, sidecar triggers a config reload on prometheus server service. So, here we’re defining a global job_name (og-mysqld-exporter), setting scrape_interval for it and not much else.

relabel_configs:
        - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance]
          action: keep
          regex: '.*mysql-exporter-(staging|production).*'
        - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance]
          regex: .*-staging
          target_label: environment
          replacement: staging
        - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance]
          regex: '.*-production'
          target_label: environment
          replacement: production

Now for the relabel configs. Prometheus supports many transformations on metrics and labels. Documentation can be a bit confusing, to be honest, but sometimes the best way to learn is through examples, trial and error. Hopefully this helps.

Our first rule action is keep but it actually works the other way around: it drops any target that does not match the regex on the specified source_label.

Second and third rule will create a new label: environment, and set it to the replacement string based on the regex match. Note we could also capture the environment — given server names are consistent — and use the backreference in the replacement field. Backreferences works as one would expect: $1.

- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance]
          regex: '.*-my-db-master)'
          target_label: role
          replacement: "master"
        - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance]
          regex: '.*-my-db-slave-1.*'
          target_label: role
          replacement: "slave"
        - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance]
          regex: '.*-my-db-slave-2.*'
          target_label: role
          replacement: "slave"
        - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance]
          regex: '.*-my-db-slave-([0-9])*'
          target_label: instance
          replacement: "slave-$1"

And here are a few examples on backrefs: On the first and second rules we relabel by roles (master, slaves) and on third and fourth, we set the instance field (useful for Graphite/Grafana down the road) to a more readable name — with a regex that would match any slave.

Now these have helped us narrow down our targets and relabel them so we can see them in a pretty way when querying — or in Grafana. But we need to drop other metrics that we’re not going to use or don’t care about, otherwise we’ll be storing tons of metrics.

metric_relabel_configs:
        - source_labels: [ __name__ ]
          regex: 'mysql_info_schema_((user|client)_statistics_(bytes_(received|sent)|total_(ssl_connections|connections)|rows_(fetched|read|updated)|busy_seconds|cpu_time)|table_statistics_rows_(read|changed)|table_(rows|size)).*'
          action: keep

Again, the keep action will instruct Prometheus to drop any metric that does not match this beautiful tree-like regex.

Last but not least, the secrets.yaml file will hold credentials to our hosted Prometheus backend service:

serverFiles:
    prometheus.yml:
        remote_read:
        -   url: https://prod.promlts.metricfire.com/read
            bearer_token: xxxx
        remote_write:
        -   url: https://prod.promlts.metricfire.com/write
            bearer_token: xxxx

Documentation is far larger than these examples show, but most likely with some ad-hoc config you’ll be up and running in no time. Just saving the config and triggering a helm secrets upgrade will get the config monitoring sidecar trigger a config reload on Prometheus server.

Now let’s leverage Grafana to get some meaningful and why not, nice graphs. As stated before, we are using hosted graphite to store all metric data, and the service also provides a Grafana UI that can query the data. On top of that, we’ll import some nice dashboards from Percona; here is a repo with a lot of useful dashboards in json format, not only for MySQL, and many of them you can use right out of the box. We just take the json of the dashboard we want and paste it into the Manage->Import menu. Once that’s done, we’ll have something like this:

Grafana dashboard for MySQL User statistics

Note it may need some fine tuning in the settings and the panel queries: for the $host variable, we’re doing the following:

Yes, that’s the instance label we set in Prometheus server relabeling config.

If you edit any panel to see the actual PromQL, you’ll notice that it depends on two functions: rate and irate. Both of them work slightly differently and you may have to do a bit of reading to decide if opting for one or the other depending on what you’re trying to catch. To begin with, it’s important to note that all metrics from MySQL are of the counter type, they move constantly up, and what we are trying to do is catch the speed that it moves over a time interval. Irate can be less accurate in the sense that it only uses the last two points of the time series to calculate, and depending on the zoom and the resolution, you may or may not see spikes. On the other hand, rate averages all interval points thus creating a much smoother representation. The default query statement for these dashboards relies on rate, only resorting to irate if the first one can’t be calculated.

The actual PromQL query used to get the graph

In our implementation, we opted for the following dashboards: MySQL User statistics, MySQL Client statistics and MySQL Table statistics. There are other dashboards but you may have already monitoring in place for those, i.e. CPU and system overall status.

Now with everything in place, it will be a lot easier to pinpoint any irregular activity on the databases. Through labels we have the where (affected database), the who (user name), the source (client ip), and the how fast (the calculated rate), all by putting information together from a couple dashboards. From here on, you can fine tune your config, or your scraping intervals, and put new dashboards together combining the panels you care the most about. Even going one step further and running your own PromQL queries to get the metrics you need. Why not beyond just monitoring MySQL and use prometheus for other services?