Automating Managed Prometheus and Grafana with Terraform for scalable observability on Azure Kubernetes Service and Istio

Published in

Microsoft Azure

4 min readApr 14, 2023

In my role at Microsoft I help customers running Istio on AKS. I maintain a repository with the necessary Terraform code to deploy AKS and install Istio.

In the past I was using the Prometheus Community Kubernetes Helm Charts to demonstrate how Istio improves the observability of your workload. However, running Prometheus at scale is a challenge. For a small team having the possibility to use a managed Prometheus and a managed Grafana installation adds a lot of value, because the engineers can focus on the product rather than on the observability platform.

Istio Workload Grafana Dashboard version 1.17.2

Inspired from Heyko Oelrichs’s article I published a variant of my work that automates with Terraform the following components:

Azure Kubernetes Service
I used the Azure Verified Terraform module for AKS. You will need at least the module version 6.8.0 because it contains my PR341 that exposes monitor_metrics to specify a Prometheus add-on profile for the Kubernetes Cluster. If you forked an older version of the module, make sure you cherry-pick this change.
Azure Monitor Managed service for Prometheus
As of April 2023, this product is still in preview, so I had to use the AzAPI Terraform provider. If you are interested how Microsoft makes this product scalable read this medium article with a very interesting deep dive.
Azure Managed Grafana
This is a GA product fully supported also in the azurerm Terraform provider with the azurerm_dashboard_grafana resource. The hardest part was figuring out the correct role assignments.
Istio
In this context Istio helps with observability because if your workload is not instrumented, you can scrape the Istio sidecar prometheus endpoint to obtain some networking metrics. The Istio project also publishes Grafana dashboards to make it very easy to consume data metrics emitted by the sidecars.

Automation challenges

The Terraform code is organized in 3 distinct projects in the folders aks-tf, istio-tf and grafana-dashboards-tf. This means you have to perform 3 terraform apply operations like it is explained in the Terraform documentation of the Kubernetes provider. The reason is that you can’t configure the Terraform Grafana provider until the Grafana instance is deployed. In the same way you cannot configure the Helm and Kubernetes providers until the AKS cluster is deployed. If you use Terraform interpolation to configure the providers, intermittent and unpredictable errors will occur, because of the order in which Terraform itself evaluates the provider blocks and resources.

The challenges writing this Terraform code where the following:

To understand that I needed just the ama-metrics-prometheus-config configMap to enable the scraping based on pods’ Prometheus annotations. At this time there are 3 differentconfigMap that can be configured to change the default settings for the metrics-addon.
I made a very opinionated choice of assigning the Grafana Admin role to my identity that I am using also to run Terraform. This way I can log in into Grafana with admin access after the Terraform run is finished. This part of the code needs to be refactored if you plan to run this Terraform code in a CI/CD Pipeline where the Terraform identity is going to be different from the identity used to login to the Grafana dashboard.
When using Azure Managed Grafana you usually give the Monitoring Reader role to the Managed Grafana principal ID. However, when using it in combination with Azure Monitor Managed service for Prometheus you need an additional role Monitor Data Reader . I found these 2 roles with similar but different names a bit confusing.
To install the Istio dashboards in Grafana I use the Terraform Grafana Provider. To authenticate with Grafana, the provider needs a Grafana API token. I mean the provider does not support using Azure credentials directly. Because I did not want to manage the problem of storing this secret token, I generate an ephemeral token with a 4 minutes expiration time using az-cli, and I pass it to Terraform as an env variable. If you have a better idea on how to improve this, please write in the comments or propose a PR :)

About mTLS encryption and Observability

Istio makes it easy to enforce mTLS for encryption in transit for traffic between your workloads. When using Strict mTLS Prometheus will need to be configured to scrape using Istio certificates. This is documented in the Istio web site, and it is applicable when you run Prometheus in the same cluster. When using Azure Monitor Managed service for Prometheus the Istio control plane, gateway, and Envoy sidecar metrics will be scraped over plaintext. To have the scraping continue to work, you can write the specific PeerAuthentication with a portLevelMtls field to disable the scraping port. This is an example to scrape in plain text the sidecar of the application echoserver :

---
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: allow-scraping-echoserver-sidecar
  namespace: default
spec:
  selector:
    matchLabels:
      run: echoserver
  mtls:
    mode: STRICT
  portLevelMtls:
    15020:
      mode: DISABLE

Conclusion

I have shared my experience using Azure Managed Grafana and Azure Monitor Managed service for Prometheus with Istio to improve observability. The Terraform code that I have shared automates the deployment, and it is distributed under the MIT License, allowing customers to fork and modify the code according to their specific needs. I strongly recommend testing these managed observability offerings, especially when working with small platform teams. The amount of work required to keep these tools updated and secure is not negligible. Unless significant customization is needed, the managed services offer a good deal.

Automating Managed Prometheus and Grafana with Terraform for scalable observability on Azure Kubernetes Service and Istio

Automation challenges

About mTLS encryption and Observability

Conclusion

Written by Saverio Proto