Python for Azure: Customised Azure Blob Container Inventory

Published in

Python for Azure

5 min readMay 13, 2023

Introduction: Azure Blob Container is a cloud-based storage service provided by Microsoft Azure that allows users to store and manage unstructured data such as text, images, videos, and audio files. Blob Container is part of the Azure Storage offering and is designed to handle massive amounts of data at scale.
At its core, Blob Container provides a highly scalable and durable storage platform for unstructured data, making it an ideal solution for applications that need to store large amounts of data without having to worry about managing the underlying infrastructure. Blob Container can store data up to 5 TB in size and can hold millions of blobs, making it an excellent choice for storing data-intensive workloads.

Azure Blob Container: A Powerful tool for Big Data Platform

Blob Container provides users with the ability to access and manage their data through a variety of interfaces, including REST APIs, .NET client libraries, and Azure PowerShell. Blob Container also integrates with other Azure services such as Azure Data Factory, Azure Stream Analytics, and Azure Functions, making it easy to build scalable and flexible data pipelines.

Azure Blob Storage for unstructured data [Source]

Blob Container offers several key features that make it a powerful tool for storing unstructured data. These include:

Durability: Blob Container is designed to provide high durability for stored data, with multiple copies of each blob stored in different locations to ensure data availability in case of a failure.
Security: Blob Container provides built-in security features such as encryption at rest and in transit, role-based access control, and shared access signatures, which help keep data secure.
Scalability: Blob Container is designed to scale automatically to meet the storage needs of applications, with the ability to store and manage petabytes of data.
Cost-effectiveness: Blob Container is a cost-effective solution for storing unstructured data, with pay-as-you-go pricing and the ability to scale storage capacity up or down as needed.

In conclusion, Azure Blob Container is a powerful and flexible cloud-based storage solution that provides a cost-effective and scalable way to store and manage unstructured data. With its durability, security, scalability, and cost-effectiveness, Blob Container is an ideal choice for building data-intensive applications in the cloud.

Customised Azure Blob Container Inventory for Data Exploration and Statistics

Currently, there is no tooling to give exact size of the blob container except through manual intervention in ‘Azure Storage Explorer’ tool. However, size of each single blob file is possible to know either via Rest API calls or directly in the portal but as mentioned not the total size of the whole blob container.

Therefore, the current article entails an approach to have this customised solution to get the total size of each blob container incl. listing all the blob file with their individual names and sizes respectively i.e., like an inventory for data exploration, visualization and statistics. Also, the ‘total_size’ and ‘total_blob_files’ of the blob-container will be written as its ‘metadata’ so that the end-users can quickly visualize or query the total_size of the blob container inl. total number of blob files in it.

In addition, in order to visualize the metadata no extra permissions are required like RBAC for contributor role at the scope of parent storage account or ACLs (Access control lists) at the scope of blob-container.

Hands-On Implementation via Azure Portal & Python SDK for Azure

Prerequisites

Python 3.6 or later is required to use this package
You must have an Azure subscription to run the python code below.

Setup

Install all the requirements Azure libraries for Python with pip:
Clone or download this project repository: Python-for-Azure
Open the project folder ‘azure-blob-container-inventory’ in Visual Studio Code or your IDE of choice.
From the root location of the project folder run the following command.

pip install -r requirements.txt

Workflow

1. In this workflow demo, I have firstly created a Resource group and further created a dummy Storage account to demonstrate the workflow

Note: Remember to whitelist your IP in the “Networking” config settings of the storage account. Also, in the “Access Control (IAM)” config settings, add proper “role assignment” to yourself, especially ‘Storage Blob Data Owner/Contributor’ role for successful execution of this demo workflow.

2. The script below “blob_container_inventory.py”, demonstrates the usage of Python SDK for Azure for implementing the above said workflow i.e., creating a customised Azure Blob Container Inventory for data exploration and statistics.

3. Before running the script, in the terminal of the IDE do the following steps:

az login --tenant <tenant_id>

Select the correct subscription

az account set --subscription <sub_id/sub_name>

[Info]: Now, the “_get_credential” method using “DefaultAzureCredential” library can do the authentication properly.

After selecting the correct ‘Python Interpreter’ & correct ‘Configuration’ for the scope of your project like “Working Directory” etc. , run the script “blob_container_inventory.py”
Following is the Python run-console with the workflow logs, please observed the highlighted text below. So each blob conatiner is enlisted with total number of files and their respective sizes and finally the total size of the container is extracted and displayed.

4. After the script is successfully executed, we can observe on the Azure portal side that the dummy storage account had the two mentioned blob containers with blob files as listed also in the python console log above.

Snippet of the first blob-container ‘container-blob-ver’

Snippet of the second blob-container ‘**container-imt’**

5. Also, after successful run of the script the ‘total_size’ and ‘total_blob_files’ parameters are added as the ‘metadata’ of the blob container, this in turn gives easy visualization or query capabilities to the Big Data Platform teams

Snippet showcases ‘**total_size’** and **‘total_blob_files’** as ‘**metadata’** of the blob container ‘**container-blob-ver**’

Snippet showcases ‘**total_size’** and **‘total_blob_files’** as ‘**metadata’** of the blob container ‘**container-imt**’

Key Observation from the Workflow:

The code/script can be added to a pipeline that can be scheduled to run run a week/month to create this inventory that can be used for data-exploration or data-statistics or even troubleshooting purposes.
The code can also be used as part of Azure functions to run as part of the event-driven architecture. as such whenever a file/folder is created or deleted the Azure functions runs to update the changes to the blob container inventory file.
If the blob container is very large and had millions of blob-files than workflow runtime can take a massive amount of time.