Building your Data Lake on Azure Data Lake Storage gen2

Nicholas Hurt
Microsoft Azure
Published in
17 min readMar 1, 2020

--

Introduction

Planning a data lake may seem like a daunting task at first - deciding how best to structure the lake, which file formats to choose, whether to have multiple lakes or just one, how to secure and govern the lake. Not all of these need to be answered on day one and some may be determined through trial and error. There is no definitive guide to building a data lake and each scenario is unique in terms of ingestion, processing, consumption and governance.

In a previous blog I covered the importance of the data lake and Azure Data Lake Storage (ADLS) gen2, but this blog aims to provide guidance to those who are about embark on their data lake journey, covering the fundamental concepts and considerations of building a data lake on ADLS gen2.

Data Lake Planning

Structure, governance and security are key aspects which require an appropriate amount of planning relative to the potential size and complexity of your data lake. Consider what data is going to be stored in the lake, how it will get there, it’s transformations, who will be accessing it, and the typical access patterns. This will influence the structure of the lake and how it will be organised. Then consider who will need access to which data, and how to group these consumers and producers of data. Planning how to implement and govern access control across the lake will be well worth the investment in the long run.

If your data lake is likely to start out with a few data assets and only automated processes (such as ETL offloading) then this planning phase may a relatively simple task. Should your lake contain hundreds of data assets and have both automated and manual interaction then certainly planning is going to take longer and require more collaboration from the various data owners.

Most people by now are probably all too familiar with the dreaded “data swamp” analogy. According to Blue Granite; “Hard work, governance, and organization” is the key to avoiding this situation. Of course, it may be impossible to plan for every eventuality in the beginning, but laying down solid foundations will increase the chance of continued data lake success and business value in the long run.

A robust data catalog system also becomes ever-more critical as the size (number of data assets) and complexity (number of users or departments) of the data lake increases. The catalog will ensure that data can be found, tagged and classified for those processing, consuming and governing the lake. This important concept will be covered in further detail in another blog.

Data Lake Structure — Zones

This has to be the most frequently debated topic in the data lake community, and the simple answer is that there is no single blueprint for every data lake — each organisation will have it’s own unique set of requirements. A simple approach may be to start with a few generic zones (or layers) and then build out organically as more sophisticated use-cases arise. The zones outlined below are often called different things, but conceptually they have the same purpose — to distinguish the different states or characteristics of the data as it flows through the lake, usually in terms of both business value and the consumers accessing that data.

Raw zone

Using the water based analogy, think of this layer as a reservoir which stores data in it’s natural originating state — unfiltered and unpurified. You may choose to store it in original format (such as json or csv) but there may be scenarios where it makes more sense to store it as a column in compressed format such as avro, parquet or Databricks Delta Lake. This data is always immutable -it should be locked down and permissioned as read-only to any consumers (automated or human). The zone may be organised using a folder per source system, each ingestion processes having write access to only their associated folder.

As this layer usually stores the largest amount of data, consider using lifecycle management to reduce long term storage costs. At the time of writing ADLS gen2 supports moving data to the cool access tier either programmatically or through a lifecycle management policy. The policy defines a set of rules which run once a day and can be assigned to the account, filesystem or folder level. The feature is free although the operations will incur a cost.

Cleansed zone

The next layer can be thought of as a filtration zone which removes impurities but may also involve enrichment.

Typical activities found in this layer are schema and data type definition, removing of unnecessary columns, and application of cleaning rules whether it be validation, standardization, harmonisation. Enrichment processes may also combine data sets to further improve the value of insights.

The organisation of this zone is usually more business driven rather than by source system — typically this could be a folder per department or project. Some may also consider this as a staging zone which is normally permissioned by the automated jobs which run against it. Should data analysts or scientists need access to the data in this form, they could be granted read-only access only.

Curated zone

This is the consumption layer, which is optimised for analytics rather than data ingestion or data processing. It may store data in denormalized data marts or star schemas as mentioned in this blog. The dimensional modelling is preferably done using tools like Spark or Data Factory rather than inside the database engine. Should you wish to make the lake the single source of truth then this becomes a key point. If the dimensional modelling is done outside of the lake i.e. in the data warehouse then you may wish to publish the model back to the lake for consistency. Either way, a word of caution though; don’t expect this layer to be a replacement for a data warehouse. Typically the performance is not adequate for responsive dashboards or end-user/consumer interactive analytics. It is best suited for internal analysts or data scientists who want run large-scale adhoc queries, analysis or advanced analytics who do not have strict time-sensitive reporting needs. As storage costs are generally lower in the lake compared to the data warehouse, it may be more cost effective to keep granular, low level data in the lake and store only aggregated data in the warehouse. These aggregations can be generated by Spark or Data Factory and persisted to the lake prior to loading the data warehouse.

Data assets in this zone are typically highly governed and well documented. Permission is usually assigned by department or function and organised by consumer group or by data mart.

Laboratory zone

This is the layer where exploration and experimentation occurs. Here, data scientists, engineers and analysts are free to prototype and innovate, mashing up their own data sets with production data sets. This is similar to the notion of self-service analytics (BI) which is useful during the initial assessment of value. This zone is not a replacement for a development or test data lake, which is still required for more rigorous development activities following a typical software development lifecycle.

Each lake user, team or project will have their own laboratory area by way of a folder, where they can prototype new insights or analytics, before they are agreed to be formalised and productionised through automated jobs. Permissions in this zone are typically read and write per user, team or project.

In order to visualise the end-to-end flow of data, the personas involved, the tools and concepts, in one diagram, the following may be of help…

Concepts, tools, & personas in the Data Lake

The sensitive zone was not mentioned previously because it may not be applicable to every organisation, hence it is greyed out, but it is worth noting that this may be a separate zone (or folder) with restricted access.

The reason why scientists are greyed out in the raw zone is that not all data scientists will want to work with raw data as it requires a substantial amount data preparation before it is ready to be used in machine learning models. Equally analysts do not usually require access to the cleansed layer but each situation is unique and it may occur.

Folder structure/Hierarchy

An appropriate folder hierarchy will be as simple as possible but no simpler. Folder structures should have:

  • a human-readable, understandable, consistent, self documenting naming convention
  • sufficiently granular permissions but not at a depth that will generate additional overhead and administration.
  • partitioning strategies which can optimise access patterns and appropriate file sizes. Particularly in the curated zones, plan the structure based on optimal retrieval but be cautious of choosing a partition key with high cardinality which leads to over partitioning which in turn leads to suboptimal file sizes.
  • each folder with files of the same schema and the same format/type

Whilst many use time based partitioning there are a number of options which may provide more efficient access paths. Some other options you may wish to consider are subject area, department/business unit, downstream app/purpose, retention policy or freshness or sensitivity.

The raw zone may be organised by source system, then entity. Here is an example folder structure, optimal for folder security:

\Raw\DataSource\Entity\YYYY\MM\DD\File.extension

Typically each source system will be granted write permissions at the DataSource folder level with default ACLs (see section on ACLs below) specified. This will ensure permissions are inherited as new daily folders and files are created. In contrast, the following structure can become tedious for folder security as write permissions will need to be granted for every new daily folder:

\Raw\YYYY\MM\DD\DataSource\Entity\File.extension

Sensitive sub-zones in the raw layer can be separated by top level folder. This will allow one to define a separate lifecycle management policy using rules based on prefix matching. Eg:

\Raw\General\DataSource\Entity\YYYY\MM\DD\File.extension
\Raw\Sensitive\DataSource\Entity\YYYY\MM\DD\File.extension

Be sure to keep an open mind during this planning phase. Folders or zones do not need to always reside in the same physical data lake — they could also manifest themselves as separate filesystems or different storage accounts, even in different subscriptions. Particularly if you are likely to have huge throughput requirements in a single zone which may exceed a request rate of 20,000 per second, then multiple physical lakes (storage accounts) in different subscriptions would be a sensible idea. See the section entitled “How many data lakes/storage accounts/filesystems?” for more details.

How many data lakes, storage accounts & filesystems do I need?

A common design consideration is whether to have single or multiple data lakes, storage accounts and filesystems. The data lake itself may be considered a single logical entity yet it might comprise of multiple storage accounts in different subscriptions in different regions, with either centralised or decentralised management and governance. Whatever the physical implementation, the benefit of using a single storage technology is the ability to standardise across the organisation with numerous ways in which to access the data. Whilst ADLS gen2 is still a PaaS fully managed service, and having multiple storage accounts or filesystems does not incur any monetary cost until you start to store and access data. There is an administrative and operational overhead associated with each resource in Azure to ensure that provisioning, security and governance (including backups and DR) are maintained appropriately. The question of whether to create one or multiple accounts has no definitive answer, it requires thought and planning based on your unique scenario. Some of the most important considerations might be:

  • Planning large-scale enterprise workloads may require significant throughput and resources. Considering the various subscription and service quotas may influence your decision to split the lake physically across multiple subscriptions and/or storage accounts. See the addendum for more information.
  • Regional vs global lakes. Globally distributed consumers or processes on the lake may be sensitive to latency caused by geographic distances and therefore require the data to reside locally. Regulatory constraints or data sovereignty may often prevent data from leaving a particular region. These are just a few reasons why one physical lake may not suit a global operation.
  • Global enterprises may have multiple regional lakes but need to obtain a global view of their operations. A centralised lake might collect and store regionally aggregated data in order to run enterprise-wide analytics and forecasts.
  • Billing and organisational reasons. Certain departments or subsidiaries may require their own data lake due to billing or decentralised management reasons.
  • Environment isolation and predictability. Even though ADLS gen2 offers excellent throughput, there are still limits to consider. For example, one may wish to isolate the activities running in the laboratory zone from potential impact on the curated zone, which normally holds data with greater business value used in critical decision making.
  • Features and functionality at the storage account level. If you want to make use of options such as lifecycle management or firewall rules, consider whether these need to be applied at the zone or data lake level.

Whilst there may be many good reasons to have multiple storage accounts, one should be careful not to create additional silos, thereby hindering data accessibility and exploration. Take care to avoid duplicate data projects due to lack of visibility or knowledge-sharing across the organisation. Even more reason to ensure that a centralised data catalogue and project tracking tool is in place. Fortunately, data processing tools and technologies, like ADF and Databricks (Spark) can easily interact with data across multiple lakes so long as permissions have been granted appropriately. For information on the different ways to secure ADLS from Databricks users and processes, please see the following guide.

HNS, RBAC & ACLs

It should be reiterated that ADLS gen2 is not a separate service (as was gen1) but rather a normal v2 storage account with Hierarchical Namespace (HNS) enabled. A standard v2 storage account cannot be migrated to a ADLS gen2 afterwards — HNS must be enabled at the time of account creation. Without HNS, the only mechanism to control access is role based access (RBAC) at container level, which for some, does not provide sufficiently granular access control. With HNS, RBAC is typically used for storage account admins whereas access control lists (ACLs) specify who can access the data, but not the storage account level settings. RBAC permissions are evaluated at a higher priority than ACLs so if the same user has both, ACLs will not be evaluated. If this all sounds a little confusing, I would highly recommend you understand both the RBAC and ACL models for ADLS covered in the documentation. Another great place to start is Blue Granite’s blog.

Managing Access

As mentioned above access to the data is implemented using ACLs using a combination of execute, read and write access permissions at the appropriate folder and file level. Execute is only used in the context of folders, and can be thought of as search or list permissions for that folder.

The easiest way to get started is with Azure Storage Explorer. Navigate to the folder and select manage access. In production scenarios however it’s always recommended to manage permissions via a script which is version controlled. See here for some examples.

It is important to understand that in order to access (read or write) a folder or file at a certain depth, execute permissions must be assigned to every parent folder all the way back up to the root level as described in the documentation. In other words, a user (in the case of AAD passthrough) or service principal (SP) would need execute permissions to each folder in the hierarchy of folders that lead to the file.

Resist assigning ACLs to individuals or service principals

When using ADLS, permissions can be managed at the directory and file level through ACLs but as per best practice these should be assigned to groups rather than individual users or service principals. There are two main reasons for this; i.) changing ACLs can take time to propagate if there are 1000s of files, and ii.) there is a limit of 32 ACLs entries per file or folder. This is a general Unix based limit and if you exceed this you will receive an internal server error rather than an obvious error message. Note that each ACL already starts with four standard entries (owning user, the owning group, the mask, and other) so this leaves only 28 remaining entries accessible to you, which should be more than enough if you use groups…

“ACLs with a high number of ACL entries tend to become more difficult to manage. More than a handful of ACL entries are usually an indication of bad application design. In most such cases, it makes more sense to make better use of groups instead of bloating ACLs.”

Equally important is the way in which permission inheritance works:

“…permissions for an item are stored on the item itself. In other words, permissions for an item cannot be inherited from the parent items if the permissions are set after the child item has already been created. Permissions are only inherited if default permissions have been set on the parent items before the child items have been created.”

In other words, default permissions are applied to new child folders and files so if one needs to apply a set of new permissions recursively to existing files, this will need to be scripted. See here for an example in PowerShell.

The recommendation is clear — planning and assigning ACLs to groups beforehand can save time and pain in the long run. Users and Service Principals can then be efficiently added and removed from groups in the future as permissions need to evolve. If for some reason you decide to throw caution to the wind and add service principals directly to the ACL, then please be sure to use the object ID (OID) of the service principal ID and not the OID of the registered App ID as described in the FAQ. You may wish to consider writing various reports to monitor and manage ACL assignments and cross reference these with Storage Analytics logs.

File Formats & File Size

As data lakes have evolved over time, Parquet has arisen as the most popular choice as a storage format for data in the lake. Depending on the scenario or zone, it may not be the only format chosen — indeed one of the advantages of the lake is the ability to store data in multiple formats, although it would be best (not essential) to stick to a particular format in each zone more from a consistency point of view for the consumers of that zone.

Choosing the most appropriate format will often be a trade off between storage cost, performance and the tools used to process and consume data in the lake. The type of workloads may also influence the decision, such as real-time/streaming, append-only or DML heavy.

As mentioned previously lots of small files (kbs) generally lead to suboptimal performance and potentially higher costs due to increased read/list operations.

Azure Data Lake Storage Gen2 is optimised to perform better on larger files. Analytics jobs will run faster and at a lower cost.

Costs are reduced due to the shorter compute (Spark or Data Factory) times but also due to optimal read operations. For example, files greater than 4 MB in size incur a lower price for every 4 MB block of data read beyond the first 4 MB. For example, to read a single file that is 16 MB is cheaper than reading 4 files that are 4 MB each. Read more about Data Lake gen2 storage costs here, and in particular, see the FAQ section at the bottom of the page.

When processing data with Spark the typical guidance is around 64MB — 1GB per file. It is well known in the Spark community that thousands of small files (kb in size) are a performance nightmare. In the raw zone this can be a challenge, particularly for streaming data which will typically have smaller files/messages at high velocity. Files will need to be regularly compacted/consolidated or for those using Databricks Delta Lake format, using OPTIMIZE or even AUTO OPTIMIZE can help. If the stream is routed through Event Hub, the Capture feature can be used to persist the data in Avro files based on time or size triggers. Other techniques may be to store the raw data as a column in a compressed format such as Parquet or Avro .

In non-raw zones, read optimised, columnar formats such as Parquet and Databricks Delta Lake format are a good choice. Particularly in the curated zone analytical performance becomes essential and the advantages of predicate pushdown/file skipping and column pruning can save on time and cost. With a lack of RDBMS-like indexes in lake technologies, big data optimisations are obtained by knowing “where-not-to-look”. As mentioned above however, be cautious of over partitioning and do not chose a partition key with high cardinality. Comparisons of the various formats can be found in the blogs here and here.

In summary, with larger data volumes and greater data velocity, file formats are going to play a crucial role in ingestion and analytical performance. In the raw zone where there is a greater likelihood of an accumulation of smaller files, particularly in IoT scale scenarios, compression is going to be another important consideration. Leaving files in raw format such as json or csv may incur a performance or cost overhead. Here are some options to consider when faced with these challenges in the raw layer:

  • Consider writing files in batches and use formats with a good compression ratio such as Parquet, or use a write optimised format like Avro.
  • Introduce an intermediate data lake zone/layer between raw and cleansed which periodically takes uncompressed and/or small files from raw, and compacts them into larger, compressed files in this new layer. If raw data ever needs to be extracted or analysed, these processes can run more efficiently against this intermediate layer rather than the raw layer.
  • Use lifecycle management to archive raw data to reduce long term storage costs without having to delete data.

Conclusion

There is no one-size-fits-all approach to designing and building a data lake. Some may grow their data lake incrementally, starting quickly by taking advantage of more cost effective storage and data processing techniques, such as ETL off loading. Others may wish to take the time to consider their own needs in terms of current and future ingestion and consumption patterns the personas involved, their security and governance requirements. To avoid unmanageable chaos as the data lake footprint expands, the latter will need to happen at some point but it should not stall progress indefinitely via “analysis paralysis”. The data lake can facilitate a more data centric, data driven culture through the democratisation of data, but this should be an organisation-wide commitment, not just an IT driven project, to achieve long term success.

I wish you all the best with your data lake journey and would love to hear your feedback and thoughts in the comments section below. Whilst I have taken every care to ensure the information provided is factual and accurate at the time of writing, my experience and research is finite, and ever-evolving technologies and cloud services will change over time.

Addendum — ADLS gen2 considerations

Whilst quotas and limits will be an important consideration, some of these are not fixed and the Azure Storage Product Team will always try to accommodate your requirements for scale and throughput where possible. At the time of writing here are the published quotas and items to consider:

  • 5 PiB for all regions. These are default limits which normally can be raised through a support ticket.
  • Max request rate 20,000 per second per storage account.
  • Ingress rate 25 Gbps.
  • Storage accounts per subscription 250.
  • Max access & default ACLs per file or folder 32. This is a hard limit hence ACLs should be assigned to groups instead of individual users.
  • See other limits here. Note some default (max) limits or quotas may be increased via support request.
  • Azure services which support ADLS gen2.
  • Blob storage features which are supported.
  • Other important considerations.

Please note that limits, quotas and features are constantly evolving, therefore it is advisable to keep checking the documentation for updates.

Additional Reading

The Enterprise Big Data Lake by Alex Gorelik

https://www.amazon.co.uk/Enterprise-Big-Data-Lake/dp/1491931558

--

--

Nicholas Hurt
Microsoft Azure

My personal blog, usually tech related. My views are my own.