Best Practices for Access Control in Azure Data Lake Storage Gen2

Antony Neu
Mercedes-Benz Tech Innovation
5 min readJul 9, 2021

--

Photo by Jaye Haych on Unsplash

Enterprise data governance and security regulations demand the restriction of access to modern data lakes. While data governance aims at complying with laws and regulations such as GDPR, the principle of least privilege and the separation of duties have become cyber security best practices to protect company data.

Fortunately, Microsoft Azure provides multiple authorization mechanisms in Azure Data Lake Storage (ADLS) Gen2:

  • Access keys
  • Shared access signatures (SAS tokens)
  • Role-based access control (RBAC)
  • Access control lists (ACLs)

This article will focus on the last two mechanisms which both grant access to Azure Active Directory (AAD) security principals, i.e. users, groups, managed identities and service principals. Typical use cases are Azure Data Factory access, Databricks filesystem mounts or individual access for team members. By choosing AAD-based mechanisms over token or key-based mechanisms, the implementation benefits from the global security features and policies enabled on the AAD tenant. For example, the rotation of tokens becomes obsolete.

By sharing my experience on cloud-based projects, this article aims to help you avoid common pitfalls when setting up your permissions on ADLS Gen2. First, let’s review the differences between RBAC roles and access control lists to know when to prefer one over the other.

Role-based access control (RBAC)

RBAC roles can be used with most Azure components, and Azure storage accounts are no exception. Security principals are assigned to one or multiple roles, which limit their permissions on the target resources. RBAC roles are especially useful to implement inheritance as they can be assigned at different levels, e.g. individual resources, resource groups, or subscriptions. A security principal with a certain RBAC role assigned on a resource group will automatically inherit this role on all storage accounts in the resource group. For ADLS Gen2, the lowest level at which the RBAC roles can be assigned is the storage account container. Naturally, this limits how fine-grained the permissions can be set.

RBAC rolles assigned on a higher hierarchy level will automatically be inherited by the lower levels.

A common pitfall is using the wrong RBAC role. The commonly used role Reader will only grant access to the resource on the Azure control plane, which includes the listing of containers in the storage account. For actual data access, it is necessary to assign data-specific roles, e.g. Storage Blob Data Reader. For writing data, the corresponding roles Storage Blob Data Contributor or higher roles are required.

Access control lists

If a finer-grained access control is required, access control lists (ACLs) can be used. ACLs provide a POSIX-style set of permissions on files and folders. There are two types of ACLs:

  • Access ACLs are set on existing files and folders and are not inherited by default.
  • Default ACLs are only set on folders as a template for future child elements and do not affect existing files. Once default ACLs have been set on a folder, they are automatically applied to newly created children files and folders.

The main disadvantage of ACLs is the cost of altering permission on a potentially large number of files. As ADLS is designed for big data scenarios, usually a large number of files are involved. Microsoft added the feature to set the ACLs recursively on folders in 2020, but it still remains a costly action.

A common mistake is to assign a role to the object ID of the app registration and not to the object ID of the service principal. ACLs showing no effect is typically due to missing execute permissions on the parent folder as well as on all parent folders in the file hierarchy starting at the root folder. Simply setting the read permission on the file alone is not sufficient. It is also important to note that, if a security principal is granted access by a certain RBAC role assignment, the access cannot be restricted by ACLs, as RBAC roles take precedence over ACLs. For more information on how permissions are evaluated, see the documentation here.

Recommendations for access control

  • Do not enable public read access on the storage account or containers. This could not only cause a leak of data to the public, but would also disable all set ACLs.
  • If finer-grained access control is required, plan your folder structures before ingesting data and set the default ACLs on folders, which will automatically set the permissions during ingestion.
  • Grant access to user groups instead of setting the ACLs for individual users. Access is then controlled by adding the user or service principal to the respective AD group instead of having to apply the updated ACLs recursively on a potentially large number of files.
  • Use DevOps pipelines to set up and update your permissions. Manual setting by administrators should be reduced to a minimum to avoid errors and enforce security.

Other access control mechanisms

As mentioned in the introduction, this article focuses on Azure-AD-based access control mechanisms. For those applications which cannot use Azure AD, shared access signatures (SAS tokens) or access keys can be used. The latter should only be used if SAS tokens are not an option, as they grant full access permission to the storage account. Both methods are not bound to a security principal, and can be passed on easily to different users. Tokens and keys should therefore be stored securely and rotated regularly.

In addition to the authorization mechanisms presented in this article, access can also be restricted at the networking level by using firewall rules. This adds another layer of security to the implementation.

References

--

--

Antony Neu
Mercedes-Benz Tech Innovation

Big data architect — Cloud enthusiast — Passionate about data