Python for Azure: Enable Blob Versioning on Azure Data Lake Storage [ADLS]

Pavleen Singh Bali
Python for Azure
Published in
5 min readNov 29, 2022

Introduction: To keep past iterations of an object automatically, you can enable Blob storage versioning. If a blob is edited or removed, you can access prior versions of the blob by turning on blob versioning. Blob versioning is part of a comprehensive data protection strategy for blob data.

Blob Versioning: Configuration/Setting at the scope of storage-account level

One can enable Blob storage versioning to automatically maintain previous versions of an object. When blob versioning is enabled, you can access earlier versions of a blob to recover your data if it is modified or deleted.

Blob Versioning working mechanism [Source]
  • When this feature is enabled, azure storage automatically creates a new version with unique version-ID (the value of version-ID is the timestamp when the blob was last modified)
  • A version captures the state of a blob at a given point in time. Each version is identified with the above mentioned unique version-ID
  • A version ID can identify the current version or a previous version. A blob can have only one current-version* at a time
  • If the write operation creates a new blob, then the resulting blob is the current version of the blob.
  • If the write operation modifies an existing blob, then the current version becomes a previous version and updated one is the newest version.

Points to Remember:

  • Blob versions are immutable. You cannot modify the content or metadata of an existing blob version
  • Microsoft recommends maintaining fewer than 1000 versions per blob otherwise latency for blob listing operations can increase
  • Blob versioning cannot help you to recover from the accidental deletion of a storage account or container
  • You can perform read or delete operations on a specific version of a blob by providing its version ID, otherwise operation acts on the current version
  • The version ID remains the same for the lifetime of the version
  • When blob versioning is turned on, each write operation to a blob creates a new version
  • A blob that was created prior to versioning being enabled for the storage account does not have a version ID
  • Disabling blob versioning does not delete existing blobs, versions, or snapshots (and then the blob modified /created doesn’t have a version ID)
  • If versioning and soft delete are both enabled for a storage account, then when you delete a blob, the current version of the blob becomes a previous version
  • Blob versioning is available for standard general-purpose v2, premium block blob, and legacy Blob storage accounts
  • Storage accounts with a hierarchical namespace enabled for use with Azure Data Lake Storage Gen2 are not currently supported
  • Enabling blob versioning can result in additional data storage charges to your account

Hands-On Implementation via Azure Portal & Python SDK for Azure

Prerequisites

Setup

pip install -r requirements.txt

Workflow

  1. In this workflow demo, I have firstly created a Resource group named “RG_Demo_ADLS_Data_Protection” and further created a Storage account named “demo00blobversioning”.

Note: Remember to whitelist your IP in the “Networking” config settings of the storage account. Also, in the “Access Control (IAM)” config settings, add proper “role assignment” to yourself, especially ‘Storage Blob Data Owner/Contributor’ role for successful execution of this demo workflow.

Storage account created where we will enable blob versioning feature.

2. The script below demonstrates the usage of Python SDK for Azure for implementing the above said workflow i.e., enabling blob versioning on ADLS storage-account.

3. Before running the script, in the terminal of the IDE do the following steps:

  • Log in to your Azure account
az login --tenant <tenant_id>
  • Select the correct subscription
az account set --subscription <sub_id/sub_name>

[Info]: Now, the “_get_credential” method using “DefaultAzureCredential” library can do the authentication properly.

  • After selecting the correct ‘Python Interpreter’ & correct ‘Configuration’ for the scope of your project like “Working Directory” etc. , run the script “blob_versioning.py”.
  • Following is the Python run-console with the workflow logs, please observed the highlighted text below.
Python console with work-flow logs

4. After the script is successfully executed, we can observe on the Azure portal side blob versioning i.e., “Versioning” property is enabled at the scope of storage account and also a container named “container-blob-versioning” is created with the blob named “blob-versioning” inside of it.

Blob Versioning property is enabled at the scope of storage-account
Container and the Blob within the container gets created as per the Python workflow

Here in the image below different versions of the blob “blob-versioning” can be seen, thus validating the current workflow.

Different version ID’s are visible for each write operation on the Blob

Note:

  • This feature is currently in preview and its stable version will soon be in General-Availability (GA)
  • Also with Hierarchical Namespace (HNS) enabled on ADLS Gen2 storage account, blob versioning feature is not supported
  • However, for the point mentioned above there is a workaround with ‘blob snapshot’ that I will present in the next article

Key Observation from the Workflow:

  • With each ‘write operation’ on the same blob new-version of the blob is created
  • This can help to retrieve the blob to a older version or some specific version if there is a need
  • To overwrite a base blob with one of its versions, you would simply use start_copy_from_url method on blob client and provide the URL of the versioned blob to it. URL of the versioned blob will be same as that of the base blob but with versionId as query string parameter

--

--

Pavleen Singh Bali
Python for Azure

| Consultant @ Microsoft | Inspired Human | Chasing Dreams | Belief in "Cosmic <--> Self reflection" as a bidirectional Transaction |