Computing total storage size of a folder in Azure Data Lake with Pyspark

Alexandre Bergere
datalex
Published in
3 min readSep 3, 2020

The following article explain how to recursively compute the storage size and the number of files and folder in ADLS Gen 1 (or Azure Storage Account) into Databricks.

Photo by ArtisanalPhoto on Unsplash

Configuration:

Using UDF:

In order to optimize our calcul we are going to vectorise our functions using Pandas UDF.

“A pandas user-defined function (UDF) — also known as vectorized UDF — is a user-defined function that uses Apache Arrow to transfer data and pandas to work with the data. pandas UDFs allow vectorized operations that can increase performance up to 100x compared to row-at-a-time Python UDFs.— Databricks documentation

For more information check the Databricks documentation or check this presentation, which included new features introduced with Spark 3.0.

Using the Azure API :

It’s more usual in Azure Databricks to mount your Storage (Gen 1, Gen 2 or Storage Account). Moreover, the mount system in Databricks is using dbutils and it’s not possible to use it with UDF. If you try the function with dbutils:

def recursiveDirSize(path):
total = 0
dir_files = dbutils.fs.ls(path)
for file in dir_files:
if file.isDir():
total += recursiveDirSize(file.path)
else:
total = file.size
return total
#UDF
udfRecursiveDirSize = udf(recursiveDirSize)
display(df.withColumn('size',udfRecursiveDirSize(col('path'))))

You’ll have the following issue:

“databricks could not serialize object: Exception: You cannot use dbutils within a spark job”

For that purpose, we are using the Azure API.

Calculate Total Storage size through PySpark:

Connect to the Data Lake Azure

For the purpose of the article, we are using Azure Datalake Gen1 and the following SDK : sdk azure.datalake.store.

The different functions can be used for Azure Datalake Gen 2 purpose or Storage Account. The only difference is the Azure API’s initialisation:

from azure.datalake.store import core, lib#Connect to Azure
adls_credentials = lib.auth(tenant_id=directory_id, client_secret=application_key, client_id=application_id)
#Create the connection
adls_client = core.AzureDLFileSystem(adls_credentials, store_name=adls_name)

Define functions:

Calculate recursive total size for a specific path:

#Load libraries
sql("set spark.sql.execution.arrow.enabled true")
from pyspark.sql.functions import concat, col, lit, pandas_udf
#Total size for a path
def recursiveDirSize(path):
total = 0
dir_files = adls_client.listdir(path=path,detail=True)
for file in dir_files:
if file['type']=='DIRECTORY':
total += recursiveDirSize(file['name'])
else:
total += file['length']
return total
#UDF
udfRecursiveDirSize = udf(recursiveDirSize)

Calculate number of Files and number of Folders for a specific path:

#Number of files for a path
def recursiveNbFile(path):
total = 0
dir_files = adls_client.listdir(path=path,detail=True)
for file in dir_files:
if file['type']=='DIRECTORY':
total += recursiveNbFile(file['name'])
else:
total += 1
return total
#UDF
udfrecursiveNbFile = udf(recursiveNbFile)
#Number of folders for a path
def recursiveNbFolder(path):
total = 0
dir_files = adls_client.listdir(path=path,detail=True)
for file in dir_files:
if file['type']=='DIRECTORY':
total += 1
total += recursiveNbFolder(file['name'])
return total
#UDF
udfrecursiveNbFolder = udf(recursiveNbFolder)

Use UDFs into your Dataframe:

After loading a data frame with different paths, we will enrich it with size, nbFiles and nbFolder for each path:

(
df
.withColumn('size',udfRecursiveDirSize(col('path')))
.withColumn('nbFiles',udfrecursiveNbFile(col('path')))
.withColumn('nbFolder',udfrecursiveNbFolder(col('path')))
)

--

--

Alexandre Bergere
datalex
Editor for

Data Architect & Solution Architect independent ☁️ Delta & openLineage lover.