Building a Solid Foundation for your Cloud Analytics Platform

Congratulations, you’ve finally migrated your data to the cloud! Now what?

Anne Nasato
Slalom Build
22 min readAug 8, 2022

--

When I entered the big data consulting space a few years ago, the vast majority of market demand revolved around helping customers migrate their data from legacy, on-premise systems into the cloud. While there are still several companies making this move, there’s been a noticeable increase in demand for cloud analytics platforms. This demand is largely driven by those customers who have already moved their data into the cloud, are becoming aware of the capabilities they’ve unlocked in doing so, and are looking ahead while asking themselves, “what else ya got?”

Photo by EJ Yao on Unsplash

Well, Here’s What We’ve Got

This article contains key insights learned through experience in helping various organizations establish the foundations of their cloud analytics platforms through modern data architecture principles.

  • The “Modern Cloud Analytics Platform in 3 Parts” starts off high-level and serves as a good overview to the rest of the content covered in the article.
  • I’ve written a Tutorial that lives in a supplementary article (it can be read here) and provides a detailed, hands-on lab that can help readers quickly spin up their own proof-of-concept for a cloud analytics platform.
  • Lastly, “Before You Go” contains information on components not discussed above, but which should definitely be considered when developing a cloud analytics platform.

Consider this a one-stop guide to building the foundations of your data platform with Azure tools, technologies, and modern data architecture best practices.

It’s no secret that a significant portion of companies’ data is under-utilized. This is partially due to the legacy systems many organizations are used to, which were built to harness data for specific use cases, with little scope for imagination or flexibility beyond those predefined processes. Modern data architectures go against this single-purpose-application mindset and enable greater flexibility with increased scalability and horsepower. With the cloud, it really comes down to what you’re willing to pay for; the budget’s the limit.

Let’s come back to that note on increased flexibility. In this context, modern data architectures can enable a significantly greater number of big data use cases with fewer services deployed. This has a few second-order effects.

For one, database administrators (DBAs) transition from working with their particular flavour of database to a more generic, functionally-focused service. As an example of this, someone who is traditionally an Oracle DBA may be tasked with administering a general relational database service, which may contain various types of SQL databases. I’ve also helped Oracle (and other traditional) DBAs ramp-up on managing NoSQL databases. Part of this ramp-up involves understanding how the cloud performs with features such as regional replicas, strong-versus-weak read-after-write consistency, high availability, and disaster recovery. It also means that operations teams don’t need to worry about monitoring and maintaining so many disparate services. From an administrative and logistics perspective, the cloud also cuts down on time spent processing new hardware and software requests (although scale in/scale out request processes should still be in-place).

Basically, the cloud breaks down many of the silos which existed in legacy architectures. While there is a learning curve for new cloud users, much of their existing knowledge and experience will translate over well. In fact, it will likely make them more responsible operators within the cloud. Additionally, modern data architectures need not be complex; with a few foundational (yet powerful) components, organizations can create a cloud analytics platform to serve their current and future big data use cases.

Modern Cloud Analytics Platform in 3 Parts

Many legacy data architectures are really just massive data warehouses, giving analytics users access to modeled data (and not much, if anything, else). This setup contributes to both the overspend of storage and the underutilization of data. The rigidity of automatically applying transformation logic and business rules to data severely reduces its utility, eliminating any additional potential held by the raw data. Sometimes, raw data is held in a “staging zone” or equivalent area; this data is often highly restricted and frequently purged.

Traditional data architecture courtesy of https://www.datavirtualizationblog.com/enterprise-data-warehouse-no-longer-suffice-data-driven-world/traditional-data-warehouse-architecture/

Modern data architectures provide more options, at lower cost. Sitting with the concept of a “staging zone” for a moment; imagine storing years’ worth of historical data in this area, and being able to query as well as analyze this data, all at extremely low-cost. This is possible via the modern data lake.

Data lakes are far from the most exciting part of modern data architectures, but they are a key (if not, the key) component. A data lake serves as the storage layer for a modern data architecture and in cloud data ecosystems, these are highly secure, limitlessly scalable, and economically-attractive. The intended use of a data lake is, as previously stated, storage. However, some organizations may choose to perform some modeling within their data lake as well. For mature organizations with strict cloud governance practices, this may be feasible; for organizations new to the cloud, it is advised to try to keep the transformations within the data lake to a minimum, and save the modeling for the data warehouse.

The data warehouse typically receives more attention than the data lake, and there are a number of reasons for this. Firstly, the data warehouse is often far more expensive than the data lake. As a general rule in the cloud, anything involving compute incurs higher costs. For this reason, we do not advise that you rely on your data warehouse for storage. Let your data lake handle that. Secondly, the (enterprise) data model is housed in the data warehouse; this is what enables business and operational processes to run. The third reason (because we like groups of three) the data warehouse typically receives more attention is because it needs the attention; a data lake can be spun-up and largely left alone. Data warehouses need frequent monitoring and tuning, and involve more “moving parts.”

Speaking of moving parts, modern data architectures are built on top of extract-transform-load/extract-load-transform (ETL/ELT) processes to get data to where it needs to be, and in the format in which it needs to be there. These processes are often referred to as pipelines, and these pipelines can handle the E, and the T, and the L as efficiently as you’re willing to pay for. Orchestration of these pipelines can sometimes be left to the pipeline services themselves as well, such as running these processes on a set schedule.

Sample foundational modern data architecture.

There you have it! The cloud analytics platform requires three main services to function; that’s all. Of course, as new use cases arrive (such as streaming data) additional services may need to be added, but they can all be built on top of the foundation outlined here.

This article focuses on the cloud analytics platform built in Azure.

Part 1: Storage Layer

As mentioned above, it can be hard to get excited about the storage layer, or the data lake, in your modern data architecture. Let me try to hype it up anyway, because this is what really enables the magic to happen. In fact, I often joke with our clients that if we just delivered them a well-structured data lake, we’ll have done a good job of getting their cloud analytics platform going.

Image courtesy of https://lakefs.io/data-lakes/

Joking aside, the data lake is a major enabler for a variety of cloud services and big data use cases. Many data warehousing solutions actually require a data lake to connect to upon spin-up—including Databricks on Azure and AWS. In the Azure ecosystem, Synapse, AzureML, and HDInsight all require connectivity to an Azure Storage Account upon initialization. Plus, data lakes can securely store virtually any type of data (structured, unstructured, semi-structured) in virtually any format (video, image, Parquet, flat file, etc) at limitless scale and low cost.

Data Lake Services

Azure Data Lake Storage (ADLS) Gen2 is the recommended storage layer when working in Azure. It is built on top of blob storage, with additional functionality such as enabling a hierarchical namespace with which to structure your data lake.

Data lakes make use of object-level storage, meaning that even if in the interface of these services they appear to have a file-folder structure, each “item” within the data lake is actually its own unique entity with its own unique identifier. The “folder structure” is really just a set of common prefixes among disparate objects that gives the appearance of structured groupings. Object-level storage stores all items, or data, in one massive repository but as individual objects so that their physical locations can be distributed across different devices.

Data Lake Zones

Data Lakes typically have logical partitions referred to as “zones.” Each zone serves a different function or purpose for the data within it. Different organizations will adopt different zones, depending on the type of data being stored in their data lake as well as the different downstream processes this data feeds into. In fact, it can be hard (if not impossible) to find a single source of truth in terms of which zones to include, and how to label them.

However, there are some commonly used zones worth noting. This list is not exhaustive, and it may be common to see organizations use any combination of these labels and logical divisions within their data lake. The purpose here is to provide an idea of common data lake structures and naming conventions, so that you can determine how to best organize your own data lake.

Raw Zone (or Landing Zone)

This section consists of data ingested from source systems “as-is” with no transformations applied. While there are no transformations such as modeling or even anonymization occurring before data lands in this area, it may be necessary to select a storage format for data. This is the case when pulling from a relational database.

When selecting a file format, it is recommended to go with something with high compression that is also easy to query, such as Parquet format.

It is also worth noting that while “Raw Zone” and “Landing Zone” can be used interchangeably, sometimes a “Landing Zone” refers to a more transient zone in which data lands but does not persist once it has been successfully sent to downstream consumers. This means that some data lakes actually contain both a Landing Zone and a Raw Zone, in this order, where the Landing Zone is ephemeral and the data is persisted in the Raw Zone.

Standard Zone (or Cleansed Zone or Enriched Zone)

Some organizations may want to capture their source data as-is (or as close to this as possible), and then send it downstream to undergo some preliminary set of transformations. This next “hop” is typically the Standard or Cleansed Zone. Common operations here may be converting the storage format to something easily consumed by downstream consumers (such as Apache Avro for streaming use cases), or adding a naming pattern to file names, or appending or updating a column/row to datasets.

Curated Zone

This is the zone in which data is “teed up” for downstream consumers beyond the data lake. Final transformations, and perhaps even some lightweight modeling, are applied before data lands in this zone. Note that even if data in this zone is modeled, it is not a replacement for a data warehouse which is the superior offering for business and operational consumers requiring a unified, aggregated enterprise view to the data. Data in this zone could be sent downstream to a data warehouse (in order to then serve these use cases), or utilized by advanced analytics and data science users for their work.

Data also could be stored in this zone after being modeled in the data warehouse. This is an important point, as data lakes are often thought as coming before the data warehouse as per a linear analytics platform pattern, but they can also store data post-modeling layer.

Other Zones Worth Noting

In addition to the aforementioned zones, there are a few other zone types worth considering.

Sandbox Zone

This zone is intended for analytics and data science users who leverage the data in the cloud analytics platform to store their own artifacts. Historically, these users may have taken up storage and compute in the enterprise data warehouse for their work. However, anything not strictly used for business/operational processes should not be stored in the data warehouse (unless it is being tested to be adopted for enterprise purposes). Using a data warehouse for storage is expensive and could lead to the disorganization of this very important piece of a data ecosystem.

Therefore, it is recommended to provide users with their own namespace within the data lake where they can read from and write to their work artifacts, while leveraging the enterprise data from the data lake and/or the data warehouse.

Archive Zone

While the low cost and high scalability (in addition to storage tiers) of data lakes means that historical data can be kept in nearly any section of the data lake, some organizations may choose to keep data over a certain age (or unused data) in its own zone.

Additionally, in similar fashion to the Curated Zone described above, organizations may also wish to store outdated modeled data from the warehouse in this zone.

New Data Lake Feature: Delta Lake Format

If you’re still not excited about data lakes at this point, I’m a little disappointed. Sure, I’ll admit that it can be a challenge. Especially as, since the official coining of the phrase “data lake” in 2010, many of these storage layers have actually turned into something more similar to a “data swamp.” In order to be able to take full advantage of the highly-scalable storage offered by this technology, a data lake needs to be well-organized and maintained. Even then, gaining insights from data within a data lake may prove difficult.

For many datasets, having disparate formats of objects (ie: Avro, Parquet, CSV, etc.) as well as erroneous operations such as partial writes can result in frustrating challenges when attempting to access and make meaning of these objects. One such format which helps overcome common obstacles to making the data in a data lake useful is Delta Lake. Better yet, not only does Delta Lake improve upon established data lake architectures; it also enables the development of a newer architecture, the data lakehouse.

Delta Lake is an open-source storage layer that sits on top of a data lake. Delta Lake is built from Parquet format and runs on an underlying Spark engine. This means that Delta Lake has the storage and query performance efficiency of Parquet format, with all the speed and scale of Spark. Talk about a big data power couple.

On top of a superior storage format and compute capability, Delta Lake’s other offerings not only help prevent a data lake from becoming a data swamp, but also make the data itself easier to write to and read from, therefore enabling more straightforward analysis to occur. The real magic of Delta Lake lies in the transaction log, which is what enables ACID compliance, time travel, and audit history (among other key features).

By all means, it is feasible to write directly to Delta Lake format in the Raw Zone. However, if the Raw Zone contains sensitive information that should not be widely accessible, and therefore not query-able, it may not be necessary to deploy Delta Lake format at this zone. It may make more sense in the Standard and/or Curated Zone (or at whichever point data is de-sensitized) so that users can query data in these zones as if it were stored in a relational database, without risk of exposing sensitive data.

In Azure, it is possible to convert to Delta format via Azure Data Factory, Synapse Pipelines, or Azure Databricks. It is possible to create Spark tables and perform operations on this data in Synapse Notebooks. One key decision to make when creating Spark tables on top of Delta data is whether to make them managed or unmanaged (AKA: external) tables.

Comparison of managed vs. unmanaged/external Spark tables

One way to look at this decision is if you are concerned about users dropping Spark tables, go with unmanaged/external as only metadata will be lost, but the data itself will persist. Alternatively, if this is not an issue and you don’t want to be responsible for specifying the location of the data, go with managed tables.

Final Notes on the Storage Layer

It is advisable to focus resiliency efforts on your storage layer. While a loss of any area of your cloud analytics platform would be highly detrimental, the loss of your data lake could be catastrophic. Load up on disaster recovery and redundancy here, especially if you have to choose. It is possible to restore the rest of your cloud analytics platform from the data lake, but it is highly improbable that this can be said for any other component of your system.

Part 2: Modeling Layer

This is the part of your cloud analytics platform that will consume the most time and attention, and it’s the part that most people get excited about. As the name suggests, the modeling layer (AKA: the data warehouse) is where your modeled data lives. Exactly how that data is modeled will vary organization to organization, but it will likely include facts and dimensions, in the form of tables and views utilized by recurring business processes in production.

Some of the ways in which a data warehouse in the cloud can improve upon legacy systems of the same name. Image courtesy of https://www.kdnuggets.com/2021/01/cloud-data-warehouse-future-data-storage.html

While this piece of the modern data architecture likely appears the most familiar to those who have been working with legacy on-premise systems, be warned: this ain’t your on-prem data warehouse. The cloud brings all sorts of tuning and configuration, to control the great power that comes with it. With great power comes great responsibility (and potentially great invoices paid to your cloud provider, if not managed properly).

Data Warehousing Services

Microsoft is making a big push towards Azure Synapse Analytics. Synapse appears to be the intended one-stop shop for all things cloud analytics, so it’s only natural that Synapse Dedicated SQL Pool is the warehousing solution offered by Azure. In fact, Synapse Dedicated SQL Pool was formerly Azure SQL DW. Synapse also has a Serverless SQL Pool offering, but this would be more recommended for ad-hoc analytics on your enterprise model data as opposed to serving as your (as the name suggests) dedicated warehouse engine.

While Synapse seems like the clear choice for your data warehouse service, it’s not the only option. For smaller data volumes and workloads, it may be worth looking to some of Azure’s other data services. This is especially advisable if your organization’s data systems don’t require all of the power of Synapse, and if your organization is new to the cloud. Looking to services such as Azure SQL DB to serve as your warehouse can be a very wise decision, especially as Azure SQL DB provides the option to scale up within the service itself, among its different tiers.

Some organizations may be faced with the decision between Azure SQL DB and Azure SQL Managed Instance. While Managed Instance may be advisable for migrating existing SQL Server workloads to the cloud due to its nearly one-to-one match with non-Azure SQL Server, for organizations looking to develop a new, modern cloud warehouse, Azure SQL DB is advised. Managed Instance, as the name suggests, comes with additional features that your organization actually may not need. This additional functionality can also lead to higher costs, as well as an increased risk of incurring runaway costs over Azure SQL DB.

As mentioned above, for organizations with smaller workloads, it may make sense to start off with a more traditional relational database service for their modeling layer, and then scale up to a “true” data warehouse when necessary. Once you’re already in the cloud, performing a migration within your existing ecosystem should not be terribly difficult, and can likely even be achieved via the ETL services you’re already using.

Modern Cloud Data Warehouse Designs

Gone are the days of using your data warehouse for storage; we know better now, and we’re more disciplined than that. While the data lake contains all of our data, with virtually no reason for it being there other than for it to exist in a massive repository that scales limitlessly at low cost, the data in our data warehouse is there for a predefined (typically, business) reason.

As such, the data in a data warehouse has been processed and is ready to be queried by operations and business processes, or for ad-hoc analysis. Because of the highly-structured nature of the data in a data warehouse, it may also be more accessible for less-technical data users, as compared to the data lake. With that being said, data here can (and will) still be leveraged for advanced analytics.

From a structural perspective, moving entities such as facts and dimensions to the cloud may not change them from how they were in a legacy on-premise environment. After all, the data model still needs to suit the needs of your business, and any downstream consumers of your cloud analytics platform should not notice a change in their work processes (unless they notice an improvement!). One major change though is the extinction of singleton data marts and cubes “hanging off” of your data warehouse, as well as the elimination of a staging layer where raw data was once stored. Coupling your modern data warehouse with a data lake removes the need for these components.

Key Considerations for Cloud Data Warehouses

With all the potential of cloud data warehouses comes a need for thorough up-front design and mostly-correct configuration. Failing to do so could result in sub-optimal performance and/or unnecessarily high costs. While it’s fully reasonable to expect that not everything will be perfect from the get-go, and also that your organization’s needs will evolve, it is important to establish a solid foundation upon which to accommodate these adjustments.

Important factors for designing and developing a cloud data warehouse.

Again, not everything will be perfectly configured from the start. However, taking the time to establish a solid foundation which can be iteratively improved upon will set your organization up for future success as your cloud data systems continue to develop and grow.

Part 3: ETL Layer

We need some way to actually get our data into our cloud analytics platform, and to exactly where it ought to be, in the format we want. We do this via our ETL (or ELT) layer. This could also be known as our data movement/transformation layer. While there are tools which exist outside of major cloud providers which can achieve this, we’ll focus on those offered by Azure for this article.

Synapse Pipelines or ADF?

As mentioned above, Microsoft is making a big push for Synapse as the one-stop shop for all things cloud data analytics. Therefore, it’s only fitting that there exists a data movement/transformation service under this umbrella, called “Synapse Pipelines.” Users familiar with Azure ETL may be asking “what about Azure Data Factory (ADF)?” Alas, Azure gives you options.

With the exception of a few minor details, Synapse Pipelines and ADF are virtually identical services. Even the minute differences which currently exist are being reconciled as you read this. Some readers may believe this indicates the imminent extinction of ADF, which is also what I initially thought; however, Microsoft has indicated that it will continue to support both services for the foreseeable future.

All that to say, it may make sense to jump on the Synapse Pipelines bandwagon now, especially if you are creating fresh new ETL processes in the cloud. The case is further made for Synapse Pipelines if you’re using other Synapse Services, such as Dedicated SQL Pool or Notebooks, as it means fewer services in which you must manage your resources. Also, Synapse Notebooks and data sources integrate seamlessly with Synapse Pipelines.

One reason one may shy away from Synapse Pipelines would be the newness of the service. People working with Synapse Pipelines will often find guidance and support via well-established ADF documentation. Additionally, at the time of writing there is a lot more Terraform support for ADF than Synapse. This may make ADF your DevOps engineer’s preferred service, and understandably so. However, the “gap” between these two services and their offerings is quickly closing, and Microsoft can strongly support users of both services.

Configuration-Driven Pipelines

An efficient way of building out ETL/ELT pipelines is leveraging a configuration-based approach. This means minimizing (or ideally, eliminating) hard-coded values in your pipelines and replacing them with parameters. Parameter values are stored in control tables in either a dedicated database, or even within your data warehouse, and are provided as-needed to your pipelines.

Sample configuration-driven data architecture.

In order to enable dynamic pipelines, you must first understand your source and target systems, as well as the shape of your data. This enables you to provide the appropriate connection information as well as retrieval and landing values to the control tables, to feed into the pipeline. Knowing these things will also empower users to leverage a single set of resources across multiple data sources, if they have multiple data sources that follow the same pattern. For example, an organization may have a variety of relational databases, but if these all leverage the same connection info and store the same type of data (ie: relational data), then they are likely all compatible with the same pipeline pattern.

One thing worth noting is that even though this is described as a “single pipeline” for multiple data sources, it does not mean that this pipeline must run in a sequential fashion, looping through each data source and running once-at-a-time. It is possible to configure parallelization and concurrency settings for most ETL services out-of-the-box, meaning you can have a high degree of parallelism for a single pipeline and its resources.

While this method definitely requires more work up front, it is far superior to creating a pipeline per data source, and it pays off quickly. Firstly, when onboarding new data sources, it is not necessary to build out a brand new pipeline for each source. So long as your source follows the same high-level pattern described above as other sources, all that is required is inputting the parameter values for that source into your control table. This can even be automated via scripting out SQL DML statements. Secondly, the operations and maintenance effort of a smaller set of resources is far simpler than having to maintain N pipelines for N data sources.

Another perk of leveraging this approach is that it does not need to be fully built-out from the onset. In fact, developers may start off by using hard-coded values in their pipeline, and then evolve the pipeline to be configuration driven. Over time, organizations can also build out sophisticated logging and error-handling capabilities as part of their pipelines while following this approach.

Tutorial

As mentioned above, I have written an additional article containing a hands-on lab to help users build out their own foundational cloud analytics platform which you can access at this link. The supplementary article repeats some of the content covered here, but only if relevant to the tutorial.

Photo by sang xiaolei on Unsplash

Before You Go

There are a few more items worth briefly mentioning before ending this discussion on setting up a future-proof cloud analytics platform.

Governance

Like we said above, the cloud breaks down silos. This makes the need for governance even greater.

Start with defining your different types of users and grouping them in terms of how they’ll interact with your cloud analytics platform. This will help you determine what they actually need to access, and therefore create permissions sets accordingly. Role-based access control (RBAC) is a beautiful thing, and make sure you leverage it in conjunction with the least privileges principle. This means that each user has access to exactly what they need, and nothing more.

From here, it can be good to define formal processes for the onboarding and offboarding of users to the platform. Further workflows can be defined around development processes for topics such as environment promotion.

Basic user onboarding process example.
Basic user offboarding process example.

Governance is an ongoing process, and will continue to evolve as your organization’s use of the cloud grows and changes. Try to make these processes symbiotic so that they inform and improve each other with time.

Data Consumers

Almost all organizations leverage reporting and/or data visualization tools. No one wants to lose this capability with a move to the cloud, and the good thing is, you don’t have to! However, there are some additional considerations worth noting.

While it is free to ingest data into the cloud, there can be fees associated with egressing your data from the cloud. While it may be ok to incur these fees every now and then on an as-needed basis, it is not ideal to spend money just because your reporting tool is located beyond the scope of your cloud environment. There are a few options available to help with this.

As Power BI is a Microsoft product, it’s actually possible to connect Power BI to your Azure cloud data without incurring data egress charges. This is possible by making sure all your resources are in the same region. For more information on this, check out the link below.

Tableau also provides documentation on integrating its services with a variety of other technology platforms:

It goes without saying (but it bears mentioning) to always make sure you lock down your resources and follow security best practices, especially when integrating additional technologies into your cloud environment.

Analytics Sandboxes

Chances are, your organization’s data scientists and advanced analytics users will want to directly benefit from the cloud as well. A good way to enable this is via sandboxes.

Sandboxes are intended to provide a safe environment in which developers can experiment with data without impacting production workloads and processes. Governance is a crucial part of enabling a sandbox environment as only approved users should be able to access this environment, and within that environment there should be minimal (if any) potential to impact real data.

Ideally, users can read (desensitized) production data into their work, but not write back to production resources. Instead, the Data Lake should be leveraged by sandbox users in order to persist the artifacts of their work. Therefore, sandboxes require two main infrastructure components in order to be useful: compute and storage. In Azure, this can be achieved via the Synapse Workspace (specifically Notebooks with Spark and Serverless SQL pools) and ADLS Gen2.

Basic analytics sandbox service & permissions setup.

Additional governance practices are recommended in order to promote work developed in the sandbox to higher-level environments, until it is operational and productionized. This relies on the use of practices such as source control, CI/CD pipelines, and appropriate quality assurance measures.

Up & Up & Up

With a strong core built-out, your organization is well-positioned to not only capitalize on all of the potential of your existing data, but is also future-proofed for new data sources and data types.

This means that you can take your cloud analytics platform from this:

To this:

Today’s well-built cloud analytics platform is tomorrow’s streaming data platform, ML Ops platform, and beyond.

If you have any feedback or questions related to this article, please leave them in the comments below!

--

--