Why a Delta Lakehouse?

Robert Dale Thompson

Published in

tmobile-dsna

12 min readApr 16, 2022

Beyond the constraints of a data warehouse.

By: Robert Thompson; Geoff Freeman

THE PROBLEM WE FACED

T-Mobile’s mission to build the nation’s best 5G network drastically increased the number of monthly network projects planned and directly impacted the enterprise procurement and supply chain organizations. To take on aggressive targets and meet critical goals, the enterprise needed efficient back-office business processes and, as a result, the need for data skyrocketed.

Data at T-Mobile, as at many enterprises, is spread between disparate, unintegrated systems. In our case, integrating the status of network plans, material procurement, supply chain operations and field services was incredibly complex. In early 2018, the enterprise relied on weekly reporting in Excel and PowerPoint. These reports were built on dedicated SQL Servers, each of which obtained data from different systems. The reports were then pulled into the enterprise data warehouse and combined. This process regularly took 12 hours or more to complete.

To expedite this process, we needed a centralized location to build reports. And so, our adventures with Azure began.

FIRST ITERATION (TMUS AZURE DATA WAREHOUSE)

The first iteration of this centralization effort, an Azure Data Warehouse (ADW) named TMUS, was created using many shortcuts. It was not anticipated that users would want access to this system, or that it would eventually become an integration point. The goal was simple: cut out the analyst with a spreadsheet and build all the reports in the same place. What did we do? An ADW instance was created and populated with Azure Data Factory (ADF). Files were staged in Azure Data Lake Storage (ADLS). There was an idea that a data lake would eventually replace the ADW but at first, it was just a glorified proof of concept. TMUS quickly showed its worth, however, as reports were generated much quicker. This system was now able to generate reports on average 9 hours before the rest of the business got their data. At this point, it was just generating operational reports. They were long-running operational reports to be sure, but that was the only role of the system.

Reporting Expansion

At some point it was decided to give some business units (BUs) direct access to TMUS so that they could also get these operational reports early. The access wasn’t tightly controlled, so the BUs were able to start writing their own queries to refine or improve the reports in ways that were directly applicable to them. This was the first time that much of this data had been centralized, and business users derived value from being able to generate their own reports from it. These reports quickly came to be considered mission critical for these BUs. They also identified new data sources that would be complementary to the existing datasets. To get these data sources into the enterprise data warehouse would take prohibitively long times, so the BUs asked for these data sources to be loaded into TMUS, where they could then write their own transformations.

Problems Identified with the First Iteration

TMUS experienced rapid, organic growth and was considered wildly successful. However, with any successful project, you are certain to find flaws — and flaws we did find:

No Plans for System Integration

The only way to deal with all this data together was to load it to this centralized store. This led people to clamor for more access, which led to resource contention issues. There were also no clear systems integration controls; users were given access without clear SLAs or regulation for use. Some teams wrote queries that ran for hours. Other teams set up their own ETL processes to extract massive amounts of data. In other words, it was the wild west.

Workload Contention

Operational and analytical workloads have very different profiles. As analytical workload and user-generated reports grew and consumed more resources, operational report refreshes became unreliable. An analytical report that ran for hours would consume resources needed to update the operational reports that were on an hourly refresh cadence.

Guardrails Needed

Unfortunately, the reports that these users wrote were not being tightly controlled. User-generated reports and queries started to impact the performance of one other along with the core operational reports that the system was initially built to power. By the time these performance impacts became unbearable, the system was already powering everything from inventory allocations systems to executive dashboards.

The desire to innovate and run at the speed of business overtook the desire for governance and structure. As a result, TMUS was a victim of its own success.

What Went Right with the First Iteration

There are plenty of success stories from the initial implementation. Streamlined reporting process gave us the ability to hit T-Mobile’s aggressive build pace, but that story is just one of many.

Harvest to Core

Often, business users came up with queries to solve the issue of how data from disparate systems related to one other. These queries could then be harvested and curated by TMUS, propagating that logic across the business. In many cases, TMUS was the only place in the enterprise where the data from these systems could be integrated.

Security

Early on, we segregated data into separate schemas based on source system or workload vertical. This gave us the ability to apply a security model that was relatively easy to administer and audit.

Data accessibility

TMUS became a place for analysts to wrangle data that doesn’t intersect naturally, opening doors for identifying flaws in workstreams and creating business process efficiencies.

SECOND ITERATION — DATA LAKEHOUSE

One of the high-profile use cases of TMUS was the twice weekly allocation of material to complete network build projects. Allocation informed how billions of dollars in inventory would be fulfilled from T-Mobile’s warehouses. It was operationally critical but failing about 40% of the time due to workload contention. The decision was made to extract the logic powering the process from TMUS and put that logic into a Databricks notebook. The process went from failing 40% of the time to running flawlessly. This became the foundation of the second iteration of our architecture.

Data Lakehouse Overview

A Data Lakehouse is simply data warehousing principles applied over data in a data lake. Data is exposed through delta tables, an open-source protocol for interacting with data in the lake in an ACID compliant way. This comes with several benefits directly applicable to our problems.

Workload Contention Solved Through Compute/Storage Separation

Delta tables do not experience blocking and can be shared between compute processes without contention. An additional benefit of separating the storage and the compute is that the compute used by a user is isolated to that user and can be charged back to that user’s organization.

Data Isolation Solved Through Direct Data Lake or Spark Access

No longer does data need to be loaded to a centralized system — if a user has access to data in the data lake, then they can use the compute endpoints to connect to that data and analyze it. This simplifies or eliminates the need for ETLs whose only purpose is to move data without transforming it.

Clear SLAs and Integration Points Provided Through Serverless Architecture

Exposing data through the data lake makes for a clear integration point — if you want to send us data, just drop it in the data lake. If you want to pick up data, just use any client capable of reading data from the data lake. The Serverless SQL and Spark endpoints give a similar experience. This also gives clear SLAs. Serverless SQL is not always faster than querying a SQL server, but the same query will consistently take the same amount of time and won’t be affected by other users.

Architecture

Our lakehouse architecture looks like this:

Ingest

At the ingest level, not much had to change. Extraction processes already staged data in the data lake, so it was easy for us to start building tables off these locations. The main addition to our ingest process was modifying how extracted data was logged. We created a new auditing database to control the data warehousing aspects of our lakehouse, so the ADF pipelines all needed to log to that location.

Delta Lakehouse

This is the heart of where changes were needed: a series of delta tables that expose the same data in the lakehouse as that which users could get through TMUS. This is done by ingesting data through a series of tables we have labeled Bronze, Silver, and Curated. The Bronze tables essentially act like Slowly Changing Dimension Type 2; if anything changes, a new row is created. The Silver tables store just the latest state of data, mirroring what is on the table in the source system. The Curated tables are where any advanced ETL and data shaping for warehousing takes place.

The Bronze and Silver tables are maintained by a generic series of notebooks and entirely metadata driven. At this point, most data extraction is also generic and metadata driven. This makes onboarding new datasets a trivial matter. The Curated datasets are maintained by custom notebooks that are orchestrated by ADF.

Serve

All the benefits of Lakehouse Architecture would be useless if it didn’t expose data in a way that makes it easy to migrate the existing ecosystem of reports, dashboards, and extracts. Azure Synapse Analytics makes this process easy, however. Any consumers that previously connected to TMUS can now connect to a Serverless SQL endpoint and access the Lakehouse version of the data without any further translations. This means that existing SSRS, PowerBI, Excel, or custom ETLs work just by switching the connection string.

Transitioning to Lakehouse

The original design included staging data in the data lake as parquet files before loading them to our ADW. Already having the data in data lake made it easy for us to write a standardized workbook to load data to our Bronze and Silver tables. Very quickly, the base system data that needed integrating was available in both our serverless and ADW endpoints. Much of the logic that drives reporting had been embedded in our ADW system, however, as views or stored procedures. All this embedded logic needs to go through a process of prioritization and then migration, where one can translate the view and stored procedure logic into Databricks notebooks. This means, at least for the foreseeable future, that some ETL logic still locked in our ADW warehouse. Some users will still need to connect to that ADW instance to get their Curated data. As more and more teams have realized the benefits of the lakehouse architecture, however, they have started to transition their own reporting logic into the Spark notebooks or dataflows provided by Azure Synapse.

Synapse Workspace

When ADW was rebranded to Synapse Analytics, our use of ADW gave an inside track to a new offering from Microsoft. This product tied our lakehouse architecture to our current offering in a seamless manner.

Spark

Synapse now includes managed Apache Spark pools. This gives the ability to write notebooks to access delta tables from within the workspace. Running ML workloads becomes trivial with this new inclusion in the platform. Synapse Spark also supports Azure Active Directory for an easy extension of our security model.

Serverless SQL

Serverless SQL endpoints now built into Synapse offer the ability to easily access delta tables with a familiar SQL experience. This feature is key to migrating current PowerBI reports to the Lakehouse. With Serverless SQL, current reports migrate almost effortlessly.

Dedicated SQL Pool

This is what the original implementation, TMUS, has become. No big change here in this offering. Moving transformation workloads into the Lakehouse frees up resources here.

DataFlow

DataFlows are a somewhat new offering that is low-code GUI on top of Spark code. Offering DataFlows to business partners makes it easy for them to curate data for their reporting needs.

Pipeline Orchestration

Synapse Pipelines give business partners the ability to schedule updates to their curation of data. They can then set their refresh cadence to meet their needs.

What is Working

How do you build guardrails into a system that is meant to encourage exploring undefined solutions?

Security Model

The security in the initial iteration revolved around SQL Server schemas. A schema functions as the intersection of business domain and permission requirements, and then Azure Active Directory groups were created and associated with each schema. This is easy for everyone to conceptualize, and easy to administer and audit. A key reason that we are using Azure Synapse to expose data to users is that it uses credential passthrough at all levels to regulate access. This means that extending our security model meant just putting data into similarly named containers in our Azure Storage account. Granting the same Azure Active Directory groups access to those containers lines up the permissions between Serverless SQL, Spark, SQL, and Data Lake.

Environments

Dev and test environments were created to give us the ability to not develop in our production environment. We also locked down our production environment to disallow development in production. This increased reliability considerably.

DevOps and CI/CD

To reduce the burden on the operations team and increase testing automation, CI/CD and DevOps practices were implemented. Automating release pipelines in each environment and section of the platform helped achieve a reliable deployment cadence. The release pipelines also offer the ability to run build and integration tests. This is available in every section of the platform. The integration tests now incorporate multiple sections of the platform, reducing outages.

SQL Deployment Framework

A homegrown framework that executes rerunnable SQL queries/DDL to deploy changes in our SQL databases. It uses AAD to authenticate so it doesn’t break our security model. Past personal history with DACPAC influenced the decision to use a homegrown solution instead of an out of the box implementation.

Azure Data Factory

We believe our implementation of CI/CD on ADF should be considered best practice. All pull requests are deployed to the dev environment so engineers can test and validate changes in a reliable fashion to eliminate production snafus[PC10] . Additionally, all merges to our main branch are deployed to our test environment to validate that the merge was successful and repeatable. Then it is pushed to production with the same code base that was deployed to test.

Databricks

Databricks has a feature called Repos, where git repositories are cloned to the workspaces. This feature easily integrates with Azure DevOps pipelines and allows us to extend our CI/CD practices across all Databricks workspaces.

Pull Request for Check-Ins

This is a standard peer review process. No self-reviews and a couple of checklists for the reviewer to follow. Any engineers who check in or approve a check-in that breaks the build bring in donuts.

Hurdles Discovered and Addressed

As with any new product, we have found features that need to be addressed.

Warm Spark Pool

Synapse spark jobs each get a dedicated cluster, but these clusters are not spun up from a warm pool. Every cluster needs to first instantiate virtual machines to run on. This causes every job to require a warm-up time of 3–5 minutes just for allocating resources.

Unsupported Delta Table Stats in Serverless SQL

This causes a skewed memory footprint in query plans when joining delta tables in Serverless SQL. This then causes memory usage to sometimes exceed the standard memory allocation for Serverless SQL deployments.

CI/CD Synapse Workspace Limitations

The Synapse workspaces do not have the toolset to follow the tried and tested pattern that our team uses with ADF. There is no automated process supplied for Microsoft to deploy Serverless SQL objects. Our use of the SQL deployment framework to deploy serverless views is required to allow us to work around this, but it is surprising that Microsoft does not supply a method for deploying objects to Serverless SQL databases.

Report Evolution

The Serverless SQL endpoint is promoted as a predictable performance solution for reports. While not necessarily the fastest, it is consistent, as there is no contention or queuing when waiting on compute resources. This has required an educational outreach to our business partners on how to efficiently use this technology.

CONCLUSION

From BI to ML to AI, data is the currency of the future. This architecture allows a relatively simple infrastructure that will expose data at the speed of business. It allows reading and writing of data without blocking and scales out linearly. Business partners can easily adopt advanced analytics and derive new insights. These new insights promote innovation across disparate workstreams and solidify the decentralized approach to analytics taken by T-Mobile.