Azure Data Lake Usage Monitoring

8 min readMar 20, 2023

If the pick-n-mix was left un-monitored would you gobble down a sneaky cola bottle?

In this article, I’ll attempt to surface some of the psychology involved in Enterprise Data Governance challenges.

Pick n Mix

Imagine you’re stood next to the pick-n-mix section just at the side of a large cinema foyer. The place is deserted as it's mid-movie, there are no CCTV cameras and a solitary cola bottle is staring right up at you. Would you gobble it up? It’s so tasty, no one would know and it’s only pennies in value.

Maybe you would, maybe you wouldn’t. The point is that humans are risk assessment machines. Constantly reevaluating risk to survive. When there is no threat of being found out, no foreseen consequences, and an easy reward, humans will act.

Data

If left to it, would these risk assessment machines decide it would be ‘ok’ to do something they probably shouldn’t be doing with data to achieve an easy reward? Perhaps a conscious decision of malicious intent, carelessness, or even lack of understanding. If they understood that they could be held to account, would that change the decisions they make? Would they be more careful? Would they seek advice before proceeding with an action they don’t fully understand?

This is more important now we have Data Lakes and information-driven organisations. Every team and every individual within the organisation is empowered to do great things with data. We want the enterprise to experiment with a rich reservoir of data at its fingertips. A pessimistic approach is to serve data on a need-to-know basis. Prove you understand and need to use this data. Prove it is for a good cause and then you can have access. Thus, demonstrating you have already been made to account for your actions before you take them. A more optimistic proposal is to encourage expedited data discovery and frictionless data access, whilst retaining governance by demonstrating accountability is an active, continuous, observable process.

Let’s park those thoughts for a moment and get into some technology.

Who, What, When, Where, Why

The five W’s help define factual information. If we’re able to record these five related attributes at the time data is being interacted with, then we can establish data consumption and usage facts.

I will show you how I was able to use the logging data built into Azure Data Lake Storage Gen2 to gain greater observability over data access.

Here is a screenshot from one of the Power BI visualisation generated from the Data Lake diagnostic log I was able to construct.

Data Lake log information visualised in Power BI

Let me explain the elements of the visual. In the left panel, I have all the containers within the Data Lake listed. For each, container there is a stacked bar representing the number of blob read requests made to a sub-folder of the container. Each coloured stack bar has an integer value in white text. This integer represents the number of active user principals in the period.

So for the ‘salesorderservice’ container ( red box ), there was a total of 265 million read requests. Within the ‘salesorderserivce’ container I have seven coloured stack bars representing sub-folders (datasets) within the container. The selected solid green bar which represents one of the datasets has the highest number of active user principles (31).

On the far right ( yellow box ) I have a legend of all the sub-folders within the ’salesorderservice’ container and the graph in the centre of the screen shows the related read requests trend over time.

At the bottom of the visual ( blue box ), there is a table of user principles and the number of read requests made against the sub-folders (datasets) in the ‘salesorderservice’ blob container. In the smaller blue box, you can see the number of read requests I made against the order booked dataset ( 156750 requests ).

Processing Data Lake Logs

A ton of useful information is automatically captured by the Azure Data Lake Storage service and stored in the special system container $logs.

Just take a look at this link to understand just how much is getting captured and stored away for diagnostics and intelligence. https://learn.microsoft.com/en-us/rest/api/storageservices/storage-analytics-logged-operations-and-status-messages#logged-operations

It's possible to use a fairly simple architecture to process the logs and develop incredibly insightful dashboards.

Basic Azure Data Lake Log Processing Architecture

At the heart of this solution, I have a fairly simple Databricks notebook written in a combination of Pyspark and SQL. I use Pyspark for the imperative and functional code operations and I use the mite of SQL to declare how columns of data will be crunched.

The output of the data crunching is stored in a blob storage account. Power BI connects to the blob storage account to acquire data for the visuals. Blob storage is a good choice here for a few reasons. It’s cheap, it's reliable ( check the SLAs ) and we only need to refresh the data periodically throughout the day. Another alternative might be to call hive/Unity Catalog objects in Databricks directly from Power BI. But this would mean the data crunching would only happen at Power BI refresh time. I want to pre-crunch the data on a schedule.

Whilst the Azure Data Lake logs give us a lot of useful information they don’t provide human-friendly security principal identifiers. So we need Microsoft Graph to tell us the user-friendly identity of the principals appearing in the logs.

Databricks Notebook

Microsoft has documented the Azure Data Lake logs csv schema here:

Storage Analytics log format (REST API) - Azure Storage

Storage Analytics logging records details for both successful and failed requests for your storage account. Storage…

learn.microsoft.com

Which I converted to a struct type that can be used to parse the log data.

from pyspark.sql.types import StructType, StructField, LongType, StringType, ArrayType, TimestampType

diagnosticLogRecordv2 =StructType([
  StructField('versionnumber',StringType(),True),
  StructField('requeststarttime',TimestampType(),True),
  StructField('operationtype',StringType(),True),
  StructField('requeststatus',StringType(),True),
  StructField('httpstatuscode',StringType(),True),
  StructField('endtoendlatencyinms',StringType(),True),
  StructField('serverlatencyinms',StringType(),True),
  StructField('authenticationtype',StringType(),True),
  StructField('requesteraccountname',StringType(),True),
  StructField('owneraccountname',StringType(),True),
  StructField('servicetype',StringType(),True),
  StructField('requesturl',StringType(),True),
  StructField('requestedobjectkey',StringType(),True),
  StructField('requestidheader',StringType(),True),
  StructField('operationcount',StringType(),True),
  StructField('requesteripaddress',StringType(),True),
  StructField('requestversionheader',StringType(),True),
  StructField('requestheadersize',StringType(),True),
  StructField('requestpacketsize',StringType(),True),
  StructField('responseheadersize',StringType(),True),
  StructField('responsepacketsize',StringType(),True),
  StructField('requestcontentlength',StringType(),True),
  StructField('requestmd5',StringType(),True),
  StructField('servermd5',StringType(),True),
  StructField('etagidentifier',StringType(),True),
  StructField('lastmodifiedtime',StringType(),True),
  StructField('conditionsused',StringType(),True),
  StructField('useragentheader',StringType(),True),
  StructField('referrerheader',StringType(),True),
  StructField('clientrequestid',StringType(),True),
  StructField('userobjectid',StringType(),True),
  StructField('tenantid',StringType(),True),
  StructField('applicationid',StringType(),True),
  StructField('audience',StringType(),True),
  StructField('issuer',StringType(),True),
  StructField('userprincipalname',StringType(),True),
  StructField('reservedfield',StringType(),True),
  StructField('authorizationdetail',StringType(),True)
])                                   ])

For simplicity of parsing, I have changed a few of the field types to string. Log data typically sits in the semi-structured category so the decision to loosely type using string was tactical. Looking back, the log data conforms to a well-structured, seldom-changing, schema. So perhaps this tactic wasn’t necessary.

The next job is to configure access to the $logs container. An Azure Service Principal is used to authenticate.

def setBlobOAuthConfig( storageaccountname: str, applicationid:str, directoryid:str, secret:str ) :
  spark.conf.set(f"fs.azure.account.auth.type.{storageaccountname}", "OAuth")
  spark.conf.set(f"fs.azure.account.oauth.provider.type.{storageaccountname}", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
  spark.conf.set(f"fs.azure.account.oauth2.client.id.{storageaccountname}", f"{applicationid}")
  spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storageaccountname}", f"{secret}")
  spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storageaccountname}", f"https://login.microsoftonline.com/{directoryid}/oauth2/token")

The service principal used needs to be able to read from the $logs container. This is done by assigning the Storage Blob Data Reader role.

The prepared schema is then used to read and parse the semi-colon-separated log data. A temp view is created ready to crunch the data in SQL.

logdata = spark.read.option("sep", ";").schema(diagnosticLogRecordv2).csv(logpath)
logdata.createOrReplaceTempView("adlslogs")
display(logdata)

We don’t want to load millions of records into Power BI and aggregate them there. We can employ a performant distributed processing engine like Spark to crunch the big data into summarized data.

create or replace temporary view ContainerReadCount
as
select 
  requestStartDate,
  requestYYMM,
  concat(dataset, ' (', storage_container, '-', lakezone, ')' ) as container,
  userprincipal,
  count(1) as ReadFileCount
from 
(
  select
    *,
    to_date(requeststarttime) as requestStartDate,
    left(getArgument('job_start_date'), 7) as requestYYMM,
    split(requestedobjectkey, '/')[2] as storage_container,
    split(requestedobjectkey, '/')[3] as lakezone,
    split(requestedobjectkey, '/')[4] as dataset,
    coalesce(userprincipalname,serviceprincalname,userobjectid) userprincipal
  from logs
) x
where operationtype = 'ReadFile'
and split(requestedobjectkey, '/')[2] NOT LIKE "%logs%"
group by requestStartDate, requestYYMM, container, userprincipal
order by requestStartDate

The ‘requestedobjectkey’ exposes the path in the lake being accessed. This is then chopped up to create dataset labels for the dashboard.

I want to partition my output data by runtime-month, so partition values are created from the runtime date ‘job_start_date’ (requestYYMM)

The Microsoft documentation suggests that a user-friendly service principal name is available in the logs. In my experience, this is always null. However, when a user principal made the request the ‘userprincipalname’ is populated with a user-friendly identifier. For the case of service, principle reads, userobjectid ( the Azure GUID ) is substituted in. This GUID is crucial to discovering the friendly name once the data is in Power BI. It can be fed as a parameter into a ‘get’ call to the Microsoft Graph REST API.

Hence the following code…

...
coalesce(userprincipalname,serviceprincalname,userobjectid) userprincipal
...

The log data is then summarized by request start date, processing month, container (logical label) and user principal. Counting the read requests made.

Then it’s just a matter of writing the data generated by the view to partitions in the blob store.

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
df = sqlContext.table("ContainerReadCount")

output_folder = "abfss://adlslogprocessing@gsstorage.dfs.core.windows.net/ReadFileCountsV2/"
df.repartition(1).write.partitionBy("requestYYMM").mode("overwrite").format("parquet").save(output_folder)

Microsoft Graph

If you haven’t already done so, head on over to the Microsoft Graph Explorer, login ( assuming you already have an Azure account ) and try it out.

Graph Explorer | Try Microsoft Graph APIs - Microsoft Graph

Try the Graph Explorer developer tool to learn about Microsoft Graph APIs. Includes code snippets, Microsoft Graph…

developer.microsoft.com

There are a ton of useful API end-points that could be utilised in so many operational scenarios.

For this dashboard, I used the following end-point…

https://graph.microsoft.com/v1.0/servicePrincipals/{servicePrincipal-id}

Power BI has a web connector that can be used to call REST APIs. With this connector, you can lookup the friendly name for the service principals for each serviceprincipalname ( servicePrincipal-id ) in the Data Lake diagnostic logs.

Here is the custom Power BI function I used to perform the lookup…

= (params) =>
let
    
    GetJsonQuery = Web.Contents("https://graph.microsoft.com/v1.0/servicePrincipals/" & params, [
            Headers = [Authorization=bearertoken]
        ]  ),
    FormatAsJsonQuery = Json.Document(GetJsonQuery),

    data = try FormatAsJsonQuery[appDisplayName] otherwise null,
    res = data
in
    res

‘bearertoken’ is another query I use to obtain my bearer token to authorise the request to Microsoft Graph.

Add a lookup column to get friendly principal names from Microsoft Graph

The function is invoked as a custom column on the dataset prepared by the Databricks notebook.

Conclusion

I deliberately didn’t want this to be a ten-page ‘build a power BI dashboard’ walkthrough. My intention was to cover the key points regarding visualising this rich source of information; the data lake diagnostic log.

Back to the narrative of the post, would humans decide it would be ‘ok’ to gobble up a vulnerable cola bottle or do something with data they probably shouldn’t be doing? If they knew that their actions were recorded and easily observable by their community maybe they would be more inclined to pause and consider the ‘right thing’. After all, I think doing the right thing really sits at the heart of a good data governance culture.

Finally, I wanted to show another way the same data is re-used to show individual user trends. This gives us great insight into who are the people within the organisation most active with the Enterprise data. Perhaps a conversation starter around understanding a workload, potentially leading to performance improvements like local caching for repeated reads. The more we understand how users interact with data the more empowering the technical solutions we can serve them.