Azure Storage Account : The Nuances

Published in

Towards Data Engineering

8 min readJun 6, 2024

Azure Storage Account is a cloud storage solution provided by Microsoft Azure. It offers a scalable and durable platform to store various types of data, including blobs, files, queues, tables, and disks. Here are the main components and their benefits:

Blob Storage: Stores unstructured data such as text, binary data, images, and videos. It’s ideal for serving documents or media files directly to browsers.
File Storage: Provides fully managed file shares in the cloud, accessible via the Server Message Block (SMB) protocol. It’s useful for legacy applications and data migration.
Queue Storage: Enables reliable messaging between application components, ideal for decoupling different parts of a large-scale system.
Table Storage: Offers NoSQL key-value storage for structured data, suitable for scenarios needing fast access to large datasets.
Disk Storage: Provides persistent, high-performance block storage for Azure Virtual Machines (VMs).

How Azure Storage Account Helps:

Scalability: Automatically scales to handle massive amounts of data.
Durability: Data is replicated to ensure high availability and protection against data loss.
Security: Provides features like encryption, role-based access control, and network security to safeguard data.
Cost-Effectiveness: Various pricing tiers and redundancy options help optimize costs based on specific needs.
Integration: Easily integrates with other Azure services and on-premises systems, facilitating hybrid cloud scenarios.

Different Blob Access Tiers :

Storage Account Redundancy Options:

Azure Storage Accounts offer several redundancy options to ensure data availability and durability. These options provide different levels of replication to protect against hardware failures, network issues, and even regional disasters. Here are the primary redundancy options:

1. Locally Redundant Storage (LRS)

Description: LRS replicates data three times within a single physical location in the primary region.
Use Case: Suitable for scenarios where data can be recovered from other sources and where cost is a primary concern.
Pros: Cost-effective.
Cons: Data is not protected against regional outages.

2. Zone-Redundant Storage (ZRS)

Description: ZRS replicates data synchronously across three Azure availability zones in the primary region.
Use Case: Ideal for scenarios requiring higher availability within a single region.
Pros: Protects against data center failures within the region.
Cons: Does not protect against regional disasters.

3. Geo-Redundant Storage (GRS)

Description: GRS replicates data to a secondary region, hundreds of miles away from the primary region. Data is replicated asynchronously.
Use Case: Suitable for critical data requiring disaster recovery capabilities.
Pros: Provides regional disaster protection.
Cons: Higher cost and potential for data loss if a regional disaster occurs before the asynchronous replication completes.

4. Read-Access Geo-Redundant Storage (RA-GRS)

Description: RA-GRS offers the same replication as GRS, with the additional capability of read access to the secondary region.
Use Case: Useful for applications that require high availability and need read access to the data even if the primary region is down.
Pros: Provides read access to data in the secondary region, enhancing disaster recovery.
Cons: Higher cost compared to GRS.

5. Geo-Zone-Redundant Storage (GZRS)

Description: GZRS combines the features of ZRS and GRS. It synchronously replicates data across three zones in the primary region and asynchronously to a secondary region.
Use Case: Best for applications needing high availability and disaster recovery within and across regions.
Pros: Offers high availability and protection against regional and zonal failures.
Cons: Higher cost due to advanced replication.

6. Read-Access Geo-Zone-Redundant Storage (RA-GZRS)

Description: RA-GZRS provides the same replication as GZRS with additional read access to the secondary region.
Use Case: Ideal for applications that need read access to data even if the primary region is down, along with high availability and disaster recovery.
Pros: Provides the highest level of availability and read access during regional outages.
Cons: Most expensive redundancy option.

What is ADLS Gen2 and How is it different from a Blob container ?

Azure Datalake Storage Gen 2 account creation is similar to the Blob/Container Storage creation. However, the hierarchical namespace option has to be enabled under the advanced settings tab during the storage account creation.

Hierarchical namespace, complemented by Data Lake Storage Gen2 endpoint, enables file and directory semantics, accelerates big data analytics workloads, and enables access control lists (ACLs).

Differece between Blob Storage and ADLS Gen2

Key Points on Storage Account.

By default the access tier is hot
Archive is available at blob level. As we can archive only the existing data.
Changing the access tier from cold -> hot (An early deletion charge will be applied if done within 30 days of storage creation)
Changing the access tier from archive -> hot/cold (An early deletion charge will be applied if done within 180 days of storage creation)
Rehydration is a process that is followed when moving the Blob from archive to hot/cold tier. This process takes several hours.
Life Cycle Management of Storage Account : Can be automated
Azure Storage Pricing — Volume of data stored per month — Quantity and types of operations performed along with data transfer cost if any.

3 ways of accessing the Storage Account :

Access Key / Account Key : Access permissions are set at the storage account level. There is no possibility of restricting the access for a specific container / folder.
SAS Key (Shared Access Signature) : Access permissions are set at container level. Better control on security as compared to Access Key.
Service Principal : Fine grain control on the access permissions as it is set at the folder level and therefore provides the best security.

Azure Secret Scope :

Secret scopes in Azure Databricks (or any cloud-based data platform) are important for managing sensitive information such as database credentials, API keys, and other confidential data. Here are several reasons why secret scopes are essential:

1. Security

Data Protection: Secret scopes help protect sensitive information by securely storing and managing secrets, preventing exposure in notebooks or code.
Access Control: Only authorized users and services can access secrets, reducing the risk of unauthorized access.

2. Centralized Management

Ease of Use: Secrets can be managed centrally, making it easier to update and rotate credentials without changing the code.
Consistency: Ensures that all users and applications use the same set of credentials, reducing configuration errors.

3. Compliance

Auditability: Many secret management systems provide audit logs to track who accessed or modified secrets, helping meet compliance requirements.
Regulation Adherence: Helps adhere to industry standards and regulations (like GDPR, HIPAA) that mandate secure handling of sensitive data.

4. Operational Efficiency

Simplified Configuration: Reduces the complexity of managing credentials in different environments (development, staging, production) by abstracting secrets from the code.
Automation: Enables automated workflows by allowing applications to securely access secrets without manual intervention.

5. Minimized Risk of Credential Exposure

Environment Isolation: Secrets are isolated from the environment where the code is executed, reducing the risk of accidental exposure through logs or error messages.
Version Control: Secrets are not stored in version control systems, protecting them from being exposed in repositories.

6. Integration with Other Services

Seamless Integration: Many cloud platforms integrate secret management with their services, making it easier to securely use secrets across different tools and applications.
Dynamic Secret Generation: Some secret management tools can generate and provide secrets dynamically, improving security by reducing the lifetime of credentials.

How Secret Scopes Work in Azure Databricks

In Azure Databricks, secret scopes are used to manage secrets in a secure way:

Creation: Administrators create secret scopes to store secrets.
Access: Notebooks and jobs can access these secrets using the Databricks Utilities (dbutils.secrets.get()), ensuring that secrets are never hard-coded in scripts.
Management: Secrets within a scope can be managed through the Databricks UI, CLI, or API, allowing for easy updates and rotations.

There are 2 options in Azure Secret Scope:

Azure Key Vault Backed Secret Scope:

The keys are stored in the Azure Key Vault. It is a preferred option as the other Azure services can use the keys.

Steps for creating an Azure Key Vault and linking that to the Databricks Secret Scope :

Create an Azure Key Vault resource
Create a secret key and store it in the Key Vault
Launch the Databricks Workspace and append /#secrets/createScope at the end of the url to create a secret scope.
Fill in a scope name and link it to the Azure Key Vault by providing the Vault URL and Resource ID (details will be present under properties section in the azure key vault)

Linking Azure Secret Scope on Databricks Workspace

Databricks Backed Secret Scope

The keys are stored in an encrypted Databricks database. The Databricks backed secret scope can be created only through a CLI or an API but not through a UI.

Steps for creating a Databricks Backed Secret Scope :

Execute the following command on your terminal to create a secret scope.

databricks secrets create-scope --scope <scope-name>
--initial-manage-principal users

2. Adding values to the secret scope.

databricks secrets put --scope <scope-name> --key <key-name>

3. Creating a mount point to access the storage account using the Databricks backed secret scope.

dbutils.fs.mount(
source =
'wasbs://retaildb@<storage-account-name>.blob.core.windows.n
et',
mount_point = 'mnt/retaildb',
extra_configs =
{'fs.azzure.account.key.<storage-account-name>.blob.core.windo
ws.net' : dbutils.secrets.get(‘<scope-name>’, ‘<key-name>’)}
)

Note: Make sure to replace the respective values in place of variables mentioned in angle brackets

How to set SAS Token in a Databricks environment ?

SAS token has to be generated and it would be required to connect to the storage account to access the required files. Now, this token can be added in the spark configurations using spark.conf.set to connect to the storage account.

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.
windows.net", "SAS")

spark.conf.set("fs.azure.sas.token.provider.type.<storage-account-name>.dfs.
core.windows.net",
"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")

spark.conf.set("fs.azure.sas.fixed.token.<storage-account-name>.dfs.core.win
dows.net", "<SAS-Token>")

Note: Make sure to replace the respective values in place of variables mentioned in angle brackets The SAS token can be added as a secret key to the azure key vault to access in a more secure manner.