Real-time Monitoring of Apache Spark Streaming Jobs with Power BI

Patrick Pichler

Published in

Creative Data

4 min readApr 26, 2022

How to push Apache Spark’s streaming metrics into Power BI in real-time

Real-time Dashboard Power BI (Image by Author)

Introduction

Real-time data streaming and integration is being increasingly demanded by organizations to improve their business and better serve their customers. By comparing the development with traditional batch data, real-time data integration pipelines usually require more up-front simulation and testing. Though, unified processing engines such as the Apache Spark’s Structured Streaming API already make it very easy to get started integrating and processing data in real-time. However, to also ensure reliability and that those pipelines also work for extended time periods under a high workload, a proper monitoring and alerting system is essential. For this reason, there already exist a bunch of existing integrations for monitoring Apache Spark streaming applications in addition to the monitoring functionalities of the Spark UI itself. Yet, situations might require something more flexible, for example, to monitor multiple streaming jobs at once or to make monitoring easier accessible to the business as well. In this article, we will take a look how to push data into Power BI in real-time for enabling further alerting and dash-boarding use cases.

StreamingQueryListener

If you have already worked with Structured Streaming jobs, you certainly came across the built-in monitoring feature once you kick-off a streaming job in Spark/Databricks. The “Raw Data” tab gives you access to the raw data in JSON format.

Built-in Spark Streaming Metrics (Image by Author)

Further down we will capture exactly this streaming metrics in a continuous fashion by using the Apache Spark’s StreamingQueryListener class.

Developing custom StreamingQueryListener class

First, we need to extend this class and create our own sub-class to meet our requirements. By doing this, we actually tell it how to parse the streaming metrics and hence can at this point decide how the data structure should look like.

It’s also important to know that the original structure and data of an event might vary between streaming jobs depending on its source type and progress. In this example, we will take only fields that are consistent across all jobs. You can take the following Scala code and paste it into a new notebook’s cell, there doesn’t exist a Python/PySpark binding for this library at the time of this writing.

Custom StreamingQueryListener (Code by Author)

Attaching custom StreamingQueryListener class

The last line of code adds and registers the new listener to the current Spark Context. This allows us to simply attach this notebook to any other notebook starting a streaming job which then automatically monitors the associated query. Callbacks are getting received whenever a query is started, stopped or a progress is made. To be able to easily distinguish between the streaming jobs and to assign their metrics, you should further provide a query name .queryName(queryName).

Attaching StreamingQueryListener to Notebook (Image by Author)

Power BI Push Dataset

Now that we agreed upon the data structure we want to push, we can create our Power BI push dataset. In a further stage, this step also be done automatically as part of the StreamingQueryListener’s OnQueryStarted event via the Power BI REST APIs. However, in our example and to keep things short, we will create it per hand upfront with the schema listed below and historical data analysis turned on. This way, a new database is created behind the scenes that continues to store the pushed data as it comes in which allows us to create reports on top of the real-time data. These reports and their visuals are just like any other report visuals, which means you can use all of Power BI’s report building features to create visuals or alerts. Take a note of the provided “Push URL” after creating the dataset, we need to paste this information in our custom class from above.

Tipp: You can reset/empty a push dataset by simply disabling and re-enabling the historic data analysis feature.

Starting streaming and monitoring queries

Once everything is set up, we are finally ready to start our streaming jobs. In case no data arrives on the Power BI side, check the Databricks driver logs or try to manually debug your developed custom streaming listener class.

The example dashboard from above is a normal report based on the Power BI push dataset having enabled an automatic page refresh with 1 second interval.

Conclusion

Throughout this article, we explored how to monitor Databricks structured streaming jobs in Power BI which once again demonstrates how these two technologies perfectly complement each other. A proper monitoring strategy makes a huge part of building reliable data streaming applications. It not only allows to immediately respond to errors or get notified about any anomalies, but also reveals bottlenecks or room for performance optimizations.