Azure Monitor for Containers — Optimizing data collection settings for cost

Vishwanath
Microsoft Azure
Published in
9 min readMar 2, 2020

Azure Monitor for containers collects lots of data to effectively monitor Kubernetes clusters. Many people use most/all of this data, but not everyone will need or use all the data collected automatically out of the box for every cluster.

This article explains how to analyze your monitoring data and fine tune data collection to keep the data ingestion cost optimal for you/your organization, but still continue to monitor your Kubernetes clusters effectively, by collecting the data you want and use.

Official documentation page(s) will host formal & detailed documentation for all features in Azure Monitor for Containers. You will also see this story pointing to official documentation pages as required for better/advanced understanding.

For all the examples below, we will use a fairly large AKS cluster that is monitored by Azure Monitor for Containers [200+ nodes and around 2000 pods].

There are a few things that can cost a lot in terms of data, which can be customized based on your usage:

1) Container logs (std-out, std-err logs from every monitored container in every k8s namespace in the cluster)
2)
Container environment variables (from every monitored container in the cluster)
3)
Completed k8s jobs/pods in the cluster (that does not require monitoring)
4)
Prometheus metrics that the agent can scrape thru its Prometheus extension/integration feature

We will analyze data for each of the above case and derive action items and will apply all of the action items in the end of the article .

As first step, lets query the log analytics workspace (to which the monitoring data is sent from the cluster), and understand which datatype/table in log analytics has the most ingested data. Below is a log analytics query & its result that shows data volume ingested by Azure Monitor for Containers (out of the box without any customizations) to monitor this large AKS cluster in its entirety for the last hour. [All queries are filtered for last one hour as we can use an estimate on the hourly data rate, which you can extrapolate for a day/month etc. You can also try these queries for longer durations/specific times if you suspect more is happening in your cluster during specific times/peak hours. Also note that these queries does not filter by clusterID, as they assume only one cluster is sending data to this log analytics workspace. If you have multiple clusters sending data to the same log analytics workspace, you can filter by _ResourceId which will have the clusterID in every record in every table]

union withsource = tt *
| where TimeGenerated > ago(1h)
| where _IsBillable == true
| summarize BillableDataMBytes = sum(_BilledSize)/ (1000. * 1000.) by tt
| render piechart
Figure 1: Billed data volume for the last hour by Table/Type

Above figure 1 shows that, this cluster is ingesting around 53Gb per hour, out of which ContainerLog data type/table is contributing 51.2Gb (97.6%). Lets start analyzing ContainerLog type first.

  1. Container Logs

Lets analyze the Container logs data to see what logs are ingested by which namespaces.

First lets check whether std-out or std-err logs is contributing to this high log volume -

ContainerLog
| where TimeGenerated > ago(1h)
| where _IsBillable == true
| summarize BillableDataMBytes = sum(_BilledSize)/ (1000. * 1000.) by LogEntrySource
| render piechart
Figure 2: Billed ContainerLog volume by LogEntrySource (std-out/std-err)

As you can see above in figure 2 , std-out logs are ~62% and the remaining are std-err logs. Now lets see which k8s namespace(s) is/are generating this amount of logs.

let startTime = ago(1h);
let containerLogs = ContainerLog
| where TimeGenerated > startTime
| where _IsBillable == true
| summarize BillableDataMBytes = sum(_BilledSize)/ (1000. * 1000.) by LogEntrySource, ContainerID;
let kpi = KubePodInventory
| where TimeGenerated > startTime
| distinct ContainerID, Namespace;
containerLogs
| join kpi on $left.ContainerID == $right.ContainerID
| extend sourceNamespace = strcat(LogEntrySource, "/", Namespace)
| summarize MB=sum(BillableDataMBytes) by sourceNamespace
| render piechart
Figure 3: Billable ContainerLog volume by logsource & k8s namespace

As you can see above in Figure 3, 62% of logs are std-out logs from containers in ‘dev-test’ namespace. Also around 31% of logs are std-err logs from ‘dev-test’ namespace.

In this specific cluster, we did not intend to collect Container Logs from ‘dev-test’ namespace as they run some developer/test workloads and those are not useful to us. Also we do not want to collect std-out logs from any namespaces across the cluster as std-out logs could be too chatty.

So to reduce ContainerLog data volume we need to do below :

Action item — 1 : Disable std-out logs across the cluster (meaning across all namespaces in the cluster)
Action item — 2 : Disable collecting std-err logs from ‘dev-test’ namespace ( We still want to collect std-err logs from other namespaces {prod, default} as we use them to troubleshoot and alert)

Now that we have analyzed Container logs, lets go to the next one, which is container environment variables.

2. Container Environment variables

Azure monitor for containers collect environment variables periodically from every container it monitors. They are stored in ContainerInventory table.
Following query analyzes the size of all environment variables collected in the past hour.

ContainerInventory
| where TimeGenerated > ago(1h)
| summarize envvarsMB = sum(string_size(EnvironmentVar)) / (1000. * 1000.)
Figure 4 : Billable volume for container environment variables

As you can see above in Figure 4, ~48Mb per hour of environment variables are collected and ingested across the cluster from all containers. In our case, we don’t use environment variables and hence no need to collect environment variables for us. Your case might vary, and you can disable environment variables per container, per namespace or across the entire cluster. Please check with our documentation for finer details.

Action item — 3 : Disable environment variable collection across the cluster (applicable to all containers in all k8s namespaces)

Lets move on to the next one (Completed jobs)

3. Completed Jobs

When you deploy jobs to your cluster, its good practice to define cleanup policy. If not, these completed pods remain in the k8s system and need to be cleaned up manually from time to time. These completed pods are also monitored by Azure Monitor for Containers periodically, which might add to cost.

let startTime = ago(1h);
let kpi = KubePodInventory
| where TimeGenerated > startTime
| where _IsBillable == true
| where PodStatus in ("Succeeded", "Failed")
| where ControllerKind == "Job";
let containerInventory = ContainerInventory
| where TimeGenerated > startTime
| where _IsBillable == true
| summarize BillableDataMBytes = sum(_BilledSize)/ (1000. * 1000.) by ContainerID;
let containerInventoryMB = containerInventory
| join kpi on $left.ContainerID == $right.ContainerID
| summarize MB=sum(BillableDataMBytes);
let kpiMB = kpi
| summarize MB = sum(_BilledSize)/ (1000. * 1000.);
union
(containerInventoryMB),(kpiMB)
| summarize doneJobsInventoryMB=sum(MB)
Figure 5: Billable inventory data from completed jobs

As you can see above in Figure 5, completed jobs (that needs cleanup) contributes ~14Mb of ingested data per hour.

Action item — 4 : Auto cleanup completed jobs [by specifying cleanup policy in the job definition]

4. Prometheus scraping/integration

If you are utilizing Prometheus integration feature in Azure Monitor for Containers, please ensure you consider the following to limit the amount of metrics that you collect from your cluster -

a) Ensure scraping frequency is optimal — Default is 60s, you can go with more finer frequency like say 15s, ensure that the metrics you are scraping are published at that frequency, else there will be lot of duplicate metrics that gets scraped and sent to Log Analytics workspace at frequent intervals that will add to data ingestion & storage cost, but will be of less value.

b) Azure Monitor for Containers support exclusion & inclusion lists by metric name. For example if you are scraping say, kubedns metrics in your cluster, there might be hundreds of them that gets scraped by default, but you are most probably using only a handful. Please ensure that you specify a list of metrics to scrape (or exclude others except a few) to save on data ingestion volume. Its very easy to enable scraping and not look into many of those metrics, which will cost in log analytics.

c) When scraping through pod annotations, ensure you are filtering by namespace, so that you don’t end up scraping pod metrics from in-significant namespaces that you don’t use (ex;- dev-test namespace in this article).

We plan to write a separate article for Prometheus integration with Azure Monitor for containers, so we will explain/address the above in depth in that article.

Now that we have some action items, lets see how you can make these configuration changes in the Azure Monitor for container’s agent config map and apply to the cluster in a few minutes.

Action item — 1 : Disable std-out logs across the cluster (meaning across all namespaces in the cluster)

[log_collection_settings]       
[log_collection_settings.stdout]
enabled = false

Action item — 2 : Disable collecting std-err logs from ‘dev-test’ namespace ( we still want to collect std-err logs from other namespaces {prod, default} )

Note: kube-system log collection is disabled by default. So we keep it the same and add our dev-test namespace to the list of exclusion namespaces for std-err log collection.

[log_collection_settings.stderr]          
enabled = true
exclude_namespaces = ["kube-system", "dev-test"]

Action item — 3 : Disable environment variable collection across the cluster (applicable to all containers in all k8s namespaces)

[log_collection_settings.env_var]
enabled = false

Action item — 4 : Cleanup completed jobs [By specifying cleanup policy in the job definition]

Ensure your jobs have auto cleanup policy defined using ttlSecondsAfterFinished in your job spec template/definition.

apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100

Now that we have all the three agent configuration settings updated in our agent config map, lets save and apply the configmap to the cluster. We will then query for the data volume ingested after an hour to see what difference these settings make.

kubectl apply -f <config_map_yaml_file>

After an hour, lets run the same query as we did first to see the ingestion data volume by table/type for the past hour. Below shows the hourly data ingestion volume after we applied the config map.

union withsource = tt *
| where TimeGenerated > ago(1h)
| where _IsBillable == true
| summarize BillableDataMBytes = sum(_BilledSize)/ (1000. * 1000.) by tt
| render piechart
Figure 6: Billed data volume for the last hour by Table/Type after applying configmap settings

As you can see in Figure 6 above, ContainerLog hourly volume went down from 51.2Gb per hour to 4.2Gb per hour (92% less data from the first time we measured), and the total ingested data volume went down from 53Gb per hour to 5.1Gb per hour (91% less data after we applied the collection configuration settings to the cluster). We are still collecting logs but only the logs we want/use, along with other metrics that are collected by default.

Thats all for now. Hope you found this story useful & actionable! Thanks for reading fully, if you reached here 😃

You can contact us through email (askcoin@microsoft.com) with any questions or suggestions you might have.

Helpful Links :

* Azure Monitor for Containers
* Azure Monitor for Containers — FAQ
* Azure Monitor for Containers — Agent data collection settings
* Azure Monitor for Containers — Agent configmap
* Azure Monitor for Containers — Prometheus Integration
* Azure Log analytics — managing usage & costs
* Azure Log Analytics —Query language reference
* Azure Monitor
* Azure Kubernetes Service (AKS)

--

--

Vishwanath
Microsoft Azure

Software Engineer@Microsoft #AzureMonitor #AzureMonitorForContainers #AKS #AzurePrometheus #AzureMonitorManagedPrometheus. Twitter @_vishiy_