Comprehensive look at Azure Databricks Monitoring & Logging

Objective

Inderjit Rana
Microsoft Azure

--

Recently I delved deeper into Azure Databricks Logging & Monitoring to provide guidance to a team heading their project into production and learned a ton from a variety of sources. The purpose of this blog post is to share my learnings so that you have a comprehensive resource to understand the different logging and monitoring capabilities available, how to enable them and the most important what purpose each serves. I felt all this information exists but its somewhat scattered making it harder for folks to get a good grasp on what the possibilities are and what somebody should be doing bare minimum with costs in mind. It’s a common requirement to capture logs and metrics in a centralized store like Azure Monitor so that information is available for analysis past the lifespan of you Azure Databricks clusters which could be ephemeral in nature so that aspect will be covered. There are other options like Data Dog, Splunk, etc. which are used by customers but I will stick to Azure native technologies.

Pre-Requisite: Basic understanding of Azure Monitor which is a grouping of a wide set of capabilities to meet end to end monitoring and logging needs. Please read the overview if you are not familiar — https://docs.microsoft.com/en-us/azure/azure-monitor/overview

Credits: I would like to express my gratitude to another fellow co-worker Tomas Kovarik who shared quite a bit of information which was very helpful developing this blog post.

Ganglia Metrics

I will classify this as a native monitoring capability available within Azure Databricks without any additional setup. This is a good mechanism to get live picture of your cluster metrics (CPU, Memory, etc.) as your code/application is running but retention/availability for longer term can vary based on whether it is a Jobs or an All-Purpose (interactive) cluster type. If the cluster is no longer running, the cluster metrics are available as an image which makes it super hard to analyze for historical analysis. You can read more about here on the public doc page here — https://docs.microsoft.com/en-us/azure/databricks/clusters/clusters-manage#--monitor-performance

You can browse to the Ganglia UI screen from Metrics tab on cluster details page for an All Purpose Cluster and Job Runs page for Job Clusters from Azure Databricks Workspace.

Link to Ganglia Metrics for All-Purpose Cluster
Link to Ganglia Metrics for All Purpose Cluster
Link to Ganglia Metrics for Job Cluster
Link to Ganglia Metrics for Job Cluster\

Spark UI

This is a Spark Native mechanism to dig deeper into optimization of your spark code and available out of the box without any additional setup, Spark Developers would be the main audience for Spark UI. In case of Jobs Compute the Spark UI is available after the job has complete but for All-Purpose clusters this information is cleared out once the cluster is restarted. Spark UI can be accessed from Cluster detail page for the All-Purpose cluster and Job Run History page has the link to Spark UI for Jobs Clusters. You can read more on the public docs page here — https://docs.microsoft.com/en-us/azure/databricks/clusters/clusters-manage#--view-cluster-information-in-the-apache-spark-ui

Azure Databricks Diagnostic Settings

If you are familiar with Azure ecosystem most Azure Services have this option to enable Diagnostic Logging where logs for the service can be shipped to Storage Account, Azure Event Hubs or Azure Monitor (Log Analytics). Its pretty quick and easy to enable Diagnostic Logs but it’s important to note that this feature is only available for Azure Databricks Premium Plan. You can think of this logging as Auditing which give you information like who created or deleted the cluster, created a notebook, attached a notebook to cluster, scheduled job start or end and a lot more. Usually I recommend sending the logs to Azure Monitor because it has provides the ability to query these logs easily, different log types are available in separate tables in Azure Monitor and the names of these tables start with “Databricks” as shown in the screenshot below. Its always a good idea to monitor how much log data is generated but its not expected this to be overly large volume.

Diagnostic Log Tables in Log Analytics (Azure Monitor Logs)

Read more on the public — https://docs.microsoft.com/en-us/azure/databricks/administration-guide/account-settings/azure-diagnostic-logs

Cluster Logs — Cluster Event Logs

As described in the public docs the cluster event log displays important cluster lifecycle events that are triggered manually by user actions or automatically by Azure Databricks. There might be minor overlap with Diagnostic Settings (like Cluster started, restarted, etc. available in Diagnostics logs as well) but these logs give useful event information like termination of the cluster, resizing of cluster and my favorite if the Init script execution was successful or failed. Unfortunately there is no out of the box method to export these logs out of the Azure Databricks Workspace to a destination like Azure Monitor, the volume of logs is not expected to be too high so if you want to export these to Azure Monitor for deeper analysis, auditing or troubleshooting following article shares sample scripts where Azure Function invokes Azure Databricks REST API to pull these event logs and then post to Azure Monitor Logs API — https://cloudarchitected.com/2019/10/exporting-databricks-cluster-events-to-log-analytics/

Cluster Event Logs

Read more on the public docs — https://docs.microsoft.com/en-us/azure/databricks/clusters/clusters-manage#event-log

Cluster Logs — Spark Driver and Worker Logs, Init Script Logs

These are Spark logs from driver and worker nodes:

  • Driver Node Logs which includes stdout, stderr as well as Log4J logs are available from Cluster Details Page > Driver Logs tab
  • Worker Node Logs which includes stdout and stderr are also available from Cluster Details Page > Spark UI Tab > Executors tab
  • Init Script Logs — These are logs for any scripts configured for initialization activities on clusters and are useful to troubleshoot cluster initialization errors but I was not able to track down these logs anywhere unless Cluster Log Delivery (mentioned below) is configured, although they should be available on dbfs:/databrickis/init_scirpts as per the documentation

There is no out of the box easy method to export these to Azure Monitor nor it is common to do so, the best thing to do is to configure these logs to be delivered to DBFS using the feature referred to as Cluster Log Delivery.

Cluster Logs on DBFS

Read more here on the public docs:

Log Analytics (OMS) Agent for Resource Utilization

This is an easy to install agent to collect VM Infrastructure level metrics and nothing specific to Azure Databricks but I will call this out as a basic thing one should definitely do to have visibility into utilization of the Databricks VMs. You can install Log Analytics agent on Azure Databricks clusters using init scripts, the following links have the instructions: https://github.com/Azure/AzureDatabricksBestPractices/blob/master/toc.md#Installation-for-being-able-to-capture-VM-metrics-in-Log-Analytics and https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/OMS-Agent-for-Linux.md

One interesting thing I learned while trying out Log Analytics Agent installation was that if you are part of larger organization which uses Defender for Cloud (formerly Azure Security Center) you might not need to do anything and installation of Log Analytics Agent might already be enforced by Defender. You can check by browsing to the Defender for Cloud Environment Setting page from Azure Portal as follows — Defender for Cloud > Environment Settings > Find and Click your Azure Subscription, then select Auto Provisioning and the screenshot below shows the section where you can check if Log Analytics Agent installation is enforced.

Defender for Cloud — Auto Provisioning of Log Analytics Agent

Another observation on the same line was that if installation is done using init script instead of Auto Provisioning forced through Defender for Cloud then the Databricks VMs do not show as connected on Log Analytics Workspace but in reality data from Databricks VM still shows up in the Log Analytics Workspace, so it would be good for you to be aware of this false alarm. This has probably something to do with Azure Databricks provisioning VMs in a Managed Resource Group (where as a customer you do not have control over the Resource Group although it is in your Azure Subscription).

Log Analytics showing Databricks VMs not connected but still metric collection works

Sample Kusto Queries for Azure Monitor to analyze common metrics like CPU, Memory, etc. — https://docs.microsoft.com/en-us/azure/azure-monitor/agents/data-sources-performance-counters#log-queries-with-performance-records

Spark Monitoring Library

Update Jan 2023 — This library supports Azure Databricks 10.x (Spark 3.2.x) and earlier. Not supported on Databricks 11 onwards at this point in time, check the associated Github repository link below for latest information

This is a very comprehensive library to get deeper metrics and logs around Spark application execution (Spark App concepts like Jobs, Stages, Task, etc.) captured into Azure Monitor. Although there might be a slight overlap between metrics captured by Log Analytics Agent and Spark Monitoring Library, I consider Spark developers to be the primary audience for this logging method. In my opinion, Log Analytics Agent captured metrics can be used to identify opportunities where resource utilization for Databricks is low and then the next level of work should involve either using Spark UI or capturing better details of Spark jobs using Spark Monitoring Library for short period of time for optimization efforts.

Spark Monitoring library can also be used to capture custom application logs ( logs from application code), but if it is used only for custom application logs and nothing else then OpenCensus Python SDK for logging to Application Insights might be a better fit (see the next section) for that specific need.

Setting up Spark Monitoring library is little bit more work as well as one needs to be cautious about setting up the appropriate configuration otherwise it has the potential of capturing lots of data to increase the costs. Spark Monitoring Library is available on the below GitHub Repository with great documentation around how to build the jars, configure appropriately to filter what needs to be captured as well as useful Log Analytics Queries. https://github.com/mspnp/spark-monitoring

Azure Architecture Center has elaborate documentation on the same as well — https://docs.microsoft.com/en-us/azure/architecture/databricks-monitoring/

Once the library is configured you will see the information flowing into following three tables in Log Analytics Workspace — SparkMetric_CL, SparkLoggingEvent_CL and SparkListenerEvent_CL

OpenCensus Python Library for Application Insights

Application Insights is part of the Azure Monitor platform and is more commonly used in Web Applications but I have found that it has a spot in the Azure Databricks world as well. If your need is to simply capture your own custom application log messages like tracing statements, exception details, etc. OpenCensus Python library provides an easy to setup lightweight method to log this information to Application Insights. This library can be used for other things like metrics as well but I would use this only for application level log messages from your own code. The same goal of capturing custom log messages from applications can be accomplished using Spark Monitoring Library as well but that might be more effort.

Few useful pieces of information which I learned when trying this out:

  • Use Application Insights associated with Log Analytics Workspace (not the Classic AppInsights), benefit being that single Log Analytics Workspace has tables for all other captured logs as well as these application insight logs so information can be joined if need.
  • AppTraces and AppExceptions are the tables in Log Analytics Workspace where these messages are sent (please note that previously these tables were called Traces and Exceptions).
  • When using with within Databricks Jobs Clusters make sure to put a short delay (like 20 seconds) at the end of the notebook so that logs get flushed to AppInsight, the issue documented here — https://github.com/census-instrumentation/opencensus-python/issues/875#issuecomment-607572786

Relevant Documentation Link — opencensus-python/contrib/opencensus-ext-azure at master · census-instrumentation/opencensus-python (github.com)

Disclaimer

My usual disclaimer that the information here is to the best of my knowledge but if I find inaccuracy or add on items as I learn more in this ever changing world of technology I will try my best to come back and update this post but no guarantees so please make a note of publish date.

--

--