Enhancing Network Observability for Azure Databricks Traffic with NSG Flow Logs and Dashboards

Published in

Databricks Platform SME

7 min readSep 11, 2024

Introduction

In the rapidly evolving landscape of cloud computing, ensuring robust network security is paramount. This comprehensive guide is designed to introduce you to Azure Network Security Group (NSG) flow logs, explore how to leverage them to monitor Azure Databricks network traffic, and demonstrate the creation of dynamic dashboards in databricks for enhanced network observability. Whether you’re a network administrator, a security professional, or a data scientist, this blog will equip you with the knowledge to enhance your network’s security and operational efficiency.

What are NSG Flow Logs?

Azure Network Security Groups (NSGs) are essential components that act as firewalls for network interfaces and subnets within the Azure Virtual Networks (VNet). They allow or deny network traffic to resources based on security rules. NSG flow logs are records that provide insights into the traffic that NSGs allow or deny. These logs are crucial for:

Security Analysis: Understanding and mitigating threats.
Compliance Auditing: Ensuring network operations meet security policies and compliance requirements.
Network Troubleshooting: Identifying and resolving network issues.

Flow logs store data such as:

Source IP
Destination IP
Port numbers
Protocols used
Traffic direction (inbound or outbound)
Rule number within the NSG that was applied to the traffic

Why Use Azure Databricks for NSG Flow Logs?

Azure Databricks offers a robust platform for handling large datasets such as NSG flow logs. Key advantages include:

Scalability: Handles large volumes of data seamlessly.
Real-time Analytics: Provides capabilities for streaming and real-time data processing.
Collaborative Features: Supports collaborative work across teams with interactive notebooks.
Integrated Dashboards: Allows for the creation of dynamic, interactive visualizations directly from processed data.
Azure Integration: Works seamlessly with Azure Blob Storage, Azure Data Lake, and other Azure services.

Step-by-Step Guide to Using Azure Databricks and NSG Flow Logs

Step 1: Enable NSG Flow Logs for Azure Databricks

1. Access Network Watcher:

Open the Azure Portal (https://portal.azure.com/).
In the search bar at the top, type “Network Watcher” and select it from the results.

2. Configure NSG Flow Logs:

In the Network Watcher blade, find and click on ‘NSG Flow Logs’ under the ‘Logs’ section.
Select the subscription and the NSG for which you want to enable flow logs.
You will see a list of NSGs; choose the NSG that is associated with your databricks workspace subnets for monitoring.

3. Set Up Flow Log Settings:

Status: Turn the status to ‘On’ to start collecting flow logs.
Storage Account: Select an Azure Storage Account where the logs will be stored. Create a new storage account if necessary by following the prompts.
Retention Policy: Set the number of days you want the logs to be retained in the storage account. If you set it to 0, logs will be retained indefinitely.
Traffic Analytics: Optionally, enable Traffic Analytics for enhanced processing and analysis of flow logs.

Here is a helpful link to the official Azure documentation for setting up NSG flow logs: Configure NSG flow logs.

Step 2: Set Up Azure Databricks

1. Create a Databricks Workspace:

Go to the Azure Portal.
Click on “Create a resource”, search for “Azure Databricks”, and select it.

Fill out the form by providing the necessary details such as workspace name, subscription, resource group, and location. Choose the pricing tier that fits your needs.

Click “Review + Create” and then “Create” to deploy the workspace.

For detailed instructions on creating a Databricks workspace, refer to the Azure documentation: Create an Azure Databricks workspace.

2. Launch the Workspace:

Once the deployment is complete, go to the resource.
Click on “Launch Workspace” to open the Azure Databricks portal in a new tab.

3. Create a New Notebook:

In the Azure Databricks workspace, click on ‘Workspace’ in the sidebar.
Select “Create” and then “Notebook” from the dropdown menu.
Name your notebook, select the default language as ‘Python’, and choose a cluster. If no cluster exists, you will need to create a new one by selecting “Create Cluster” and following the on-screen instructions.

Step 3: Process NSG Flow Logs

Use the following Python script in your Databricks notebook to read and process NSG flow logs:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, split

def initialize_spark_session():
 """Initialize and return a Spark session."""
 return SparkSession.builder \
 .appName("NSGFlowLogsParser") \
 .getOrCreate()

def read_nsg_flow_logs(spark, file_path):
 """Read and process NSG flow logs from a JSON file."""
 df = spark.read.json(file_path)
 # Using a single `select` and `explode` statement to flatten the data
 exploded_df = df.select(
 explode("records").alias("record"),
 explode("record.properties.flows").alias("flow"),
 explode("flow.flows").alias("flow_entry"),
 explode("flow_entry.flowTuples").alias("flow_tuple")
 )
 # Selecting necessary fields and splitting flow_tuple in one operation
 final_df = exploded_df.select(
 col("record.time").alias("time"),
 col("record.systemId").alias("system_id"),
 col("record.macAddress").alias("mac_address"),
 col("record.category").alias("category"),
 col("record.resourceId").alias("resource_id"),
 col("record.operationName").alias("operation_name"),
 col("record.properties.Version").alias("version"),
 col("flow.rule").alias("rule"),
 col("flow_entry.mac").alias("mac"),
 split(col("flow_tuple"), ",").alias("flow_data")
 ).select(
 "time", "system_id", "mac_address", "category", "resource_id", "operation_name",
 "version", "rule", "mac",
 col("flow_data").getItem(0).alias("timestamp"),
 col("flow_data").getItem(1).alias("src_ip"),
 col("flow_data").getItem(2).alias("dest_ip"),
 col("flow_data").getItem(3).cast("int").alias("src_port"),
 col("flow_data").getItem(4).cast("int").alias("dest_port"),
 col("flow_data").getItem(5).alias("protocol"),
 col("flow_data").getItem(6).alias("traffic_flow"),
 col("flow_data").getItem(7).alias("traffic_accepted"),
 col("flow_data").getItem(8).alias("flow_state"),
 col("flow_data").getItem(9).cast("int").alias("packets"),
 col("flow_data").getItem(10).cast("int").alias("bytes"),
 col("flow_data").getItem(11).cast("int").alias("more_packets"),
 col("flow_data").getItem(12).cast("int").alias("more_bytes")
 )
 return final_df

def save_to_delta_table(df, catalog_schema_table):
 """Save the DataFrame to a Delta table."""
 df.write.format("delta").mode("overwrite").saveAsTable(catalog_schema_table)

def main(file_path, catalog_schema_table):
 spark = initialize_spark_session()
 df = read_nsg_flow_logs(spark, file_path)
 save_to_delta_table(df, catalog_schema_table)
 print(f"Data from {file_path} has been parsed and stored in Delta table at {catalog_schema_table}")

# Example usage
file_path = 'abfss://path/to/your/nsg/logs/*.json'
catalog_schema_table = 'your_catalog.schema.your_table'
main(file_path, catalog_schema_table)

Step 4: Creating Dashboards and Visualizations in Azure Databricks

Visualizing data effectively is crucial for interpreting complex information easily and making informed decisions swiftly. Azure Databricks provides robust tools for creating interactive dashboards that help visualize NSG flow logs. This step-by-step guide will walk you through the process of creating visualizations and dashboards in Azure Databricks, focusing on network traffic data obtained from NSG flow logs.

Creating Visualizations

Once you have processed your NSG flow logs and stored the data in a Delta table, the next step is to analyze this data visually. Azure Databricks allows you to create visualizations directly within the notebooks:

Execute SQL Queries: Use the %sql magic command in Databricks notebooks to run SQL queries directly within a cell. This is useful for interacting with your Delta tables. For example:

%sql
SELECT
 rule,
 COUNT(*) AS Total_Rule_Triggers,
 SUM(CASE WHEN traffic_flow = 'I' THEN 1 ELSE 0 END) AS Inbound_Flows,
 SUM(CASE WHEN traffic_flow = 'O' THEN 1 ELSE 0 END) AS Outbound_Flows,
 MAX(CASE WHEN traffic_flow = 'I' THEN 'INBOUND' ELSE 'OUTBOUND' END) AS Traffic_Flow_Type,
 CONCAT_WS(', ', COLLECT_SET(CAST(dest_port AS STRING))) AS Destination_Ports,
 CONCAT_WS(', ', COLLECT_SET(dest_ip)) AS Destination_IPs,
 CONCAT_WS(', ', COLLECT_SET(src_ip)) AS Source_IPs
FROM <catalog.schema.table> # Insert your table path
GROUP BY rule
ORDER BY Total_Rule_Triggers DESC;

Visualize Query Results:

After running the SQL query, a result table appears below the cell.
Just above the table on the right side, there is a bar showing different visualization types (e.g., bar, line, pie, scatter, map).
Select a visualization type. For instance, choose a bar chart to visualize Total_Rule_Triggers or a pie chart for Traffic_Flow_Type.

2. Customize Your Visualization:

Configure the Plot Options: Customize aspects like the x-axis and y-axis, values, aggregation methods, and colors.
Add Titles and Labels: Make sure to add descriptive titles and labels so the visualization is understandable without additional context.

Building Dashboards

After creating individual visualizations, you can combine them into a comprehensive dashboard:

Create a New Dashboard:

At the top of the notebook, find the “View” option and switch it to “Dashboard”. This mode allows you to manage and organize multiple visualizations on one screen.
Click “+ New Dashboard” to create a new one and give it a name.

2. Add Visualizations to the Dashboard:

You can add visualizations to the dashboard by clicking the “+” button at the corner of each visualization plot in the notebook.
Organize them by dragging and resizing to create a visually appealing and logical layout.

3. Interactivity and Filters:

Add Filters: To make the dashboard interactive, you can add global filters that affect multiple visualizations. For example, adding a rule filter allows viewers to select a rule and see all visualizations update to show data relevant to that rule.
Dynamic Updates: Ensure that your visualizations are set to update dynamically based on the filters applied. This is critical for real-time monitoring and analysis.

4. Sharing and Collaboration:

Share the Dashboard: Once your dashboard is ready, you can share it with other users within your Databricks workspace. This facilitates easy collaboration.
Set Permissions: Control who can view or edit the dashboard to maintain data security and integrity.

An example of a dashboard:

You can also make the dashboard by introducing filters on the rules triggered that can filter out the necessary Information:

Conclusion

By following these steps, you have created a dynamic, interactive dashboard in Azure Databricks that visualizes NSG flow logs. This dashboard not only enhances your network observability but also empowers network administrators and security teams to make data-driven decisions efficiently.

Next Steps

Refine and Expand: Continuously refine the dashboard based on user feedback and add more metrics as necessary.
Automate Updates: Set up automation to refresh the data periodically or in real-time to ensure the dashboard always displays the latest information (you can use jobs).

Creating effective visualizations and dashboards is a powerful way to understand complex datasets and is essential for managing modern network environments effectively.