Analytics and the importance of the data lake on Azure

Published in

Microsoft Azure

10 min readApr 9, 2019

Mowich Lake — Mount Rainier National Park

When embarking on an analytics project, a lake is unlikely to be the first thing that comes to mind. The “lake” in this context is a cloud-based repository of data containing both raw and curated data, used in the process of building an analytics platform. Generally, people initially think about BI tools, data integration and data warehouses. They may refer to recent publications or announcements, such the Magic Quadrant and Data Management Solutions for Analytics (DMSA) by Gartner.

Source: https://info.microsoft.com/ww-landing-gartner-mq-bi-analytics-2019.html

Or the recent cloud data warehouse performance (TPC-DS) benchmarking published by GigaOm.

They may have also come across the blog post by the corporate VP of Azure, announcing leading price-performance across the rest of the analytics stack on Azure. Whilst these are all extremely relevant, what some may have missed was the subtle yet important announcement that Azure’s next generation data lake storage (ADLS Gen2) has now reached generally availability.

The data services mentioned in this post make up the modern data warehouse (DW) architecture on Azure, depicted below, but the rationale for choosing such an architecture to support a wider data strategy is not always well understood, particularly the role of the data lake.

Modern Data Warehouse on Azure — End to End Analytics

As companies endeavour to become more data centric and data driven, the need for a sound data lake strategy becomes increasingly important. Recognised by Gartner as part of the extended logical data warehouse design, it’s primary purpose: to drive innovation and facilitate exploration.

Source: Gartner — **The Data and Analytics Infrastructure Model**

Breaking down data silos, providing friction free access to hi-fidelity data, data exploration using analytical sandboxes and predicative analytics — these are just some of the ways to go about driving this innovation and unlocking insights across vast amounts of data from multiple sources. Data analysts and scientists are typically the personas involved and they require a platform and tools which enable them to do their best work. Data engineers can also benefit from a platform to offload high intensity workloads from the data warehouse, which has premium yet finite and precious resources, best utilised for serving analytics.

The data lake plays a crucial role in supporting these activities.

Often though, the data lake is an after-thought. By the time people realise why they need one, they already have vast quantities of both raw, transient and curated data living in a large data warehouse or database. As data volumes grow and velocity increases, so do the challenges, almost exponentially. In summary, some of these are:

Increasing costs and effort

Storing all historical and raw data becomes costly, ultimately data is archived or deleted to limit costs.
Data warehouses are expensive for good reason - they utilise premium storage and proprietary technology to provide high performance analytics yet they have finite resources.
Large data warehouses have high TCO including backups, administration, maintenance and expertise required.
Data requires upfront understanding and modelling (schema-on-write)

Barriers to access

Access (for exploration & discovery) to the data requires dedicated access through the DW’s dedicated compute layer for simple data retrieval/loading activities — tightly coupled storage and compute.
Does not facilitate simple, low-cost, low-impact sharing of data sets.
Limits ease of exploration —may have an impact on production systems or requires a DBA to provide access.

Limited insights

Low fidelity: Data is usually curated, aggregated, partial or incomplete (to keep costs down)
DW technologies may not easily or efficiently support large scale analysis of data in raw format such as json — semi-structured data analysis.

Impact and scale

Transformation (as part of ETL) workloads and data sharing cannot scale without significantly increasing cost.
Mixed workloads (other than analytics) compete for resources.
DW has limited concurrency for a given cost.
DW resources should be conserved for consistent low-latency, high performance BI style analytics rather than high intensity ELT workloads.

Essentially the data warehouse has, over time, taken on many of the characteristics of a data lake, so conversely, we find that data lakes in general can provide the following benefits:

Lowest cost of ownership

Cloud storage is massively scalable and low cost — compared to premium storage used for data warehousing
Economical — separation of storage and compute
Store everything, discard nothing, understand later without significant cost or effort upfront (schema-on-read)
High levels of availability and redundancy
Share data more simply and without impacting mission critical systems

Data democratisation and deeper insights

Provides friction-free access to data, promotes self service
Facilitates building up and tearing down of analytical sandbox and prototype environments quickly
Stores high fidelity data —combining various data sources with full history can yield deeper insights. Data scientists often want as much data as possible in original format.
Increased access (concurrency) can be scaled by adding compute as required
Data lakes are commonly now being used as the source of truth

If the data lake can provide so many benefits why aren’t they as popular as the data warehouse?

Historically the reason may have been due to the skills, time and the upfront investment required. Hadoop, more specifically HDFS, was a common choice, as it was one of the few technologies which could handle the volume and variety of data. Hadoop data lakes living on-prem or in data centres were typically reserved for the likes of the large tech companies who could make such an investment in hardware and possessed the skills. Having data scientists who could build machine learning models were equally for the few who had the breadth and volume of data and expertise to do so.

Arguably, Spark is currently the best way to process data and run exploratory analytics on data in the lake. Unfortunately, not every organisation feels they have the expertise or capability to run Spark based workloads. This perceived skills gap within the organisation is still, to this day, one of the main reasons why so many avoid a data lake strategy altogether. However we are reaching an exciting point in the maturity of these big data technologies and integrated cloud-based data services, which can help to bridge the gap and accelerate the journey toward a real data lake. Let’s begin to explore some these which are available on Azure, and how they play a vital role in the data lake ecosystem.

Azure Data Lake Storage

Azure is the only cloud vendor to offer a data lake storage service that is purpose built for big data analytics. Gen2 will provide the best of both storage formats, object storage and hierarchical filesystem.

ADLS Gen2 aims to offer fast, secure, scalable cloud storage with the following benefits:

Hierarchical namespace
Access control — combination of role based access (RBAC) and POSIX-compliant Access Control Lists (ACLs)
Performance tuned for big data analytics
High scale capacity and throughput
Global scale — all 50 regions
Durability options — various levels of redundancy to suit
Tiered — Hot/Cool/Archive (lifecycle management coming soon)
Cost effective — store data at the same cost as BLOB storage
Secure — at rest is encrypted with 256-bit AES encryption (by default) and TLS 1.2 while in transit (if enabled)

One of the most notable features is the hierarchical namespace which allows data to be organised like a filesystem with a hierarchy of directories. This has three advantages over typical object storage which other cloud vendors provide for the purposes of building a data lake:

A hierarchy of directories can efficiently be arranged to represent the various zones, business units, projects and provenance of the data in the lake.
Big data frameworks such as Spark and Hive were built with an implicit assumption that the underlying storage service is a hierarchical filesystem e.g. HDFS. When a directory is renamed marking the end of a job, traditional cloud-based object stores turn this into an O(n) complex operation viz. n copies and n deletes which dramatically impacts performance. In ADLS this rename is an instant, single atomic metadata operation.
Azure Active Directory security groups can be assigned to ACLS which can then be applied to folders and files to provide layers of security and governance to the data residing in the lake.

How and where the data is stored is just the beginning. This step alone is not going to unlock value —simply google “data is not the new oil”. To orchestrate, move and process data in the lake, organisations will need a productive set of tools and technologies to complement their level of expertise.

Azure Data Factory and Databricks

With the advent of managed big data services in the cloud (such as HDInsight and Databricks) which can process data residing in low-cost, scalable cloud based storage such as ADLS, the data lake has certainly became more attainable. Combine this with Azure Data Factory (ADF) which is a graphical (GUI) based, code-free, fully managed data integration service in the cloud, and you have powerful combination.

One can use ADF to orchestrate and move their data into and out of the data lake. Additionally, now with Mapping Data Flows in ADF (in preview at the time of writing) one can visually design, build, and manage data transformation processes. Ultimately these execute as Spark jobs running on Databricks clusters, so you achieve the power and scale of Spark without having to learn Spark or have a deep understanding of their distributed infrastructure!

Whilst tools such as Mapping Data Flows may accelerate your data lake implementation, any engineer or ETL developer who understands SQL should attempt the high-level structured API’s in Spark and Spark SQL. Using a combination of the dataframe APIs (which are simple and don’t require any deep Scala or Python knowledge) and SQL, one can express processing logic to build up data transformation pipelines. For those interested in PySpark there is an excellent introduction here.

Having a data warehousing background myself, I recently blogged about my experience of developing ETL routines using Spark through the lens of a traditional data warehouse developer. Particularly I focused on the use of Databricks Delta which can simplify data processing pipelines and bring data reliability and performance to data lakes. Delta’s ability to compact small files into larger ones optimises the throughput and cost effectiveness of ADLS.
I have been asked before whether BI tools can and should connect directly to Spark tables. The answer is yes it can however Spark or Databricks should not be thought of as the serving layer for typical interactive BI analytics. This approach would become too costly and lack consistent response times as the user base increases. For this is the purpose of the data warehouse.

Azure Data Warehouse

Once data has been transformed (e.g. star schema) and enriched (sometimes aggregated) it is ready to be moved into the warehouse for high performance, interactive dashboard analytics. Azure Data Warehouse provides unrivalled price-performance, often outperforms the competition, and offers advanced security.

Source: https://gigaom.com/report/data-warehouse-cloud-benchmark/

There are a number of interesting new features in preview such as data discovery and classification and workload importance. We can expect many more to come, particularly to further support the broader modern data warehouse pattern.

An Integrated Ecosystem

When starting a data lake project the tools and services are no good in isolation. The deep engineering-lead integration of these components will become essential to minimise complexity and overhead, so that more time can be spent gaining insights from your data. Here are some other announcements and resources which reinforce Azure’s commitment to providing a robust ecosystem of tightly integrated services.

Azure Databricks and Azure SQL DW with both batch and streaming support

ADF connector for ADLS Gen2

Load data into SQL DW using ADF

Azure Databricks also provides an excellent platform to run advanced analytics across the data lake to support your data science activities.

Azure Machine Learning service SDK can be integrated into the Azure Databricks environment to seamlessly extend it for experimentation, model deployment, and management.

For announcements of further integration across these services stay tuned to the updates page.

Conclusion

Using first-party services from Azure indeed provides an excellent platform to start building one’s data lake. There are other topics beyond the scope and time of this blog which should be also be considered; concepts such as data discovery and self service, quality, cataloguing and classification, lineage tracking and governance. These can increase the chance of success and maximise the benefits of the lake, but are unique to every scenario.

Success of the data lake does not solely depend on the technology however, often an internal transformation required — a cultural shift to embrace this data centric focus. Finding and understanding data in the lake becomes an essential enabler, and particularly any tribal knowledge needs to be converted into readily accessible information. This can be done through the designated officials such as data stewards or owners, as well as through crowdsourcing and automation, but there needs to be a change in mindset throughout the organisation to embrace a culture of sharing and joint responsibility.

Analytics and the importance of the data lake on Azure

Conclusion

Written by Nicholas Hurt