Computing total storage size of a folder in Azure Data Lake with Pyspark

Published in

datalex

3 min readSep 3, 2020

The following article explain how to recursively compute the storage size and the number of files and folder in ADLS Gen 1 (or Azure Storage Account) into Databricks.

Configuration:

Using UDF:

In order to optimize our calcul we are going to vectorise our functions using Pandas UDF.

“A pandas user-defined function (UDF) — also known as vectorized UDF — is a user-defined function that uses Apache Arrow to transfer data and pandas to work with the data. pandas UDFs allow vectorized operations that can increase performance up to 100x compared to row-at-a-time Python UDFs.— Databricks documentation”

For more information check the Databricks documentation or check this presentation, which included new features introduced with Spark 3.0.

Using the Azure API :

It’s more usual in Azure Databricks to mount your Storage (Gen 1, Gen 2 or Storage Account). Moreover, the mount system in Databricks is using dbutils and it’s not possible to use it with UDF. If you try the function with dbutils:

def recursiveDirSize(path):
  total = 0
  dir_files = dbutils.fs.ls(path)
  for file in dir_files:
    if file.isDir():
      total += recursiveDirSize(file.path)
    else:
      total = file.size
  return total#UDF
udfRecursiveDirSize = udf(recursiveDirSize)display(df.withColumn('size',udfRecursiveDirSize(col('path'))))

You’ll have the following issue:

“databricks could not serialize object: Exception: You cannot use dbutils within a spark job”

For that purpose, we are using the Azure API.

Calculate Total Storage size through PySpark:

Connect to the Data Lake Azure

For the purpose of the article, we are using Azure Datalake Gen1 and the following SDK : sdk azure.datalake.store.

The different functions can be used for Azure Datalake Gen 2 purpose or Storage Account. The only difference is the Azure API’s initialisation:

API Gen 2: azure-storage-file-datalake
API Storage Account: azure-storage-blob

from azure.datalake.store import core, lib#Connect to Azure
adls_credentials = lib.auth(tenant_id=directory_id, client_secret=application_key, client_id=application_id)#Create the connection
adls_client = core.AzureDLFileSystem(adls_credentials, store_name=adls_name)

Define functions:

Calculate recursive total size for a specific path:

#Load libraries
sql("set spark.sql.execution.arrow.enabled true")
from pyspark.sql.functions import concat, col, lit, pandas_udf#Total size for a path
def recursiveDirSize(path):
  total = 0
  dir_files = adls_client.listdir(path=path,detail=True)
  for file in dir_files:
    if file['type']=='DIRECTORY':
      total += recursiveDirSize(file['name'])
    else:
      total += file['length']
  return total#UDF
udfRecursiveDirSize = udf(recursiveDirSize)

Calculate number of Files and number of Folders for a specific path:

#Number of files for a path
def recursiveNbFile(path):
  total = 0
  dir_files = adls_client.listdir(path=path,detail=True)
  for file in dir_files:
    if file['type']=='DIRECTORY':
      total += recursiveNbFile(file['name'])
    else:
      total += 1
  return total#UDF
udfrecursiveNbFile = udf(recursiveNbFile)#Number of folders for a path
def recursiveNbFolder(path):
  total = 0
  dir_files = adls_client.listdir(path=path,detail=True)
  for file in dir_files:
    if file['type']=='DIRECTORY':
      total += 1
      total += recursiveNbFolder(file['name'])
  return total#UDF
udfrecursiveNbFolder = udf(recursiveNbFolder)

Use UDFs into your Dataframe:

After loading a data frame with different paths, we will enrich it with size, nbFiles and nbFolder for each path:

(
df
   .withColumn('size',udfRecursiveDirSize(col('path')))
   .withColumn('nbFiles',udfrecursiveNbFile(col('path')))
   .withColumn('nbFolder',udfrecursiveNbFolder(col('path')))
)

You can find a usable Notebook on Github.

Alexkuva/The-Rougon-Macquart-project

The following repository give you a Notebook to recursively compute the storage size and the number of files and folder…

github.com