Querying One Trillion Rows of Data with PowerBI and Azure Databricks

Kyle Hale
13 min readMar 7, 2022

--

TL;DR

I mean ... c’mon. That’s pretty awesome, you have to admit.

One trillion rows. That’s it. That’s the GIF.

Background

Sitting at our recent Company Kickoff I watched a great presentation from Databricks co-founder Reynold Xin on the many technical improvements to the platform that were made in the previous year. One demo that stood out to me as a recovering BI architect was a video of our much improved performance when acting as a “live connection” in Tableau.

I wasn’t alone; a colleague piped up in the session’s Slack channel to ask how many rows were in the dataset, and what guidance we should be giving to our customers. I wrote that at my last role we had modeled a dataset with 7.5 billion rows and performance was surprisingly decent.

And then I had a thought: how would Databricks SQL handle a trillion rows? Most data platforms aren’t even capable of handling a trillion row dataset. Even the ones that can will shy away from it and tell you to aggregate it down to something manageable.

But Databricks is all about scale.

One trillion rows.

Sure.

Let’s do it.

Scenario

We’re going to follow the core strategy of large-scale visual analysis: summarize, filter, drill down.

  1. Create useful summarized views of your data that let people separate the signal from the noise.
  2. Allow people to filter as needed so only the signal remains. Ideally this is something like a 10x or higher reduction in data that actually needs to be analyzed.
  3. Provide drill down capabilities from the filter to the next level of detail.
  4. Repeat this process as needed to get to the analysis you desire.

The scenario I decided to build my demo around was a common one for manufacturing, healthcare, and energy: time series analysis of field devices.

I imagined we had put 100 devices in the field and they were capturing environmental readings. Our dashboard would then summarize these readings by device; provide a filter to the user by device and time window; and a drill down to the individual readings for that device at a single point in time.

All BI work is pretty much exactly like this.

You can easily imagine this scenario as a potential root cause analysis of devices that are underperforming or malfunctioning; look for changing patterns over time that require more attention; or compare devices across the field.

Steps

There are a number of steps you need to take in order to store, process, model, and visualize a dataset of this size in order to get great end user performance:

  • Partitioning and ordering the data on disk for fast and efficient retrieval
  • Creating aggregations to enhance user’s experience during the summarize phase
  • Modeling the data in the best way to match the type of analysis being performed
  • Visualizing the data to take advantage of the work you did in the previous steps

And then of course, step 0 which is to plan all of this out ahead of time based on the requirements of our scenario so implementation is on rails. (Trust me, optimizing trillion row datasets definitely falls under the category of “measure twice, cut once!”)

Planning Ahead (Logical Plan)

For our scenario, our data model will consist of 4 tables:

  • Our time series fact table — this is the trillion row dataset we want to filter and then drill into
  • A summarized aggregation table — this table will be the first table users interact with, in order to find interesting patterns within the fact table for drilling down into; it will be directly linked to our fact table in Power BI using aggregations
  • Two dimension tables, one for our devices and one for our time attributes — in a production scenario, these would provide additional attributes and hierarchies our users can filter on without having to store this data for every row in our (already very large!) fact table.
Our data model.

Delta Table Management

For Delta table management, we’re going to focus on three techniques: data partitioning, z-ordering, and creating aggregation tables. These will help ensure our data is optimally stored in our Delta Lake for querying and retrieval, and that our Power BI performance is optimized to take advantage of supporting both an in-memory cache and directly querying Big Data sets like ours in a single model.

1. How should we partition our data tables?

In the Databricks lakehouse architecture, data partitions provide two major advantages for large datasets to be queried. First, for specific queries it lets you very quickly ignore, or prune, partitions which are not relevant to your query. And secondly for broad queries it lets you easily parallelize your query and take full advantage of the Photon engine.

The main criteria for choosing how to partition your query is that the data in each partition should be seen as a standalone table. A bad example might be if you had purchase history data, and you decided to partition data based on your customer’s first name. There may come a day when all you really want to know is what all of the Kyles bought at your store, but it’s pretty unlikely.

A better option would be to partition the data by date — in fact, time is by far the most popular partitioning attribute. In Delta tables, you can partition by multiple columns, and like most things, you can apply the Goldilocks rule — too few partitions means you can’t efficiently query your data, but too many partitions might lead to sparse data files, so try to get it “just right” in terms of your partition sizes.

In our case, it makes a lot of sense to partition our fact table by device ID — since each device operates independently of the others and we want to isolate and then examine poor-performing devices with our analysis — and by day, since most of our analysis is intra-day.

CREATE TABLE `spark_catalog`.`timeseries`.`sensors_part` (
`TimeStamp` TIMESTAMP,
`SensorType` STRING,
`SensorId` BIGINT,
`DeviceId` BIGINT,
`Value` DOUBLE,
`day` DATE,
`hourminute` STRING
) USING delta PARTITIONED BY (DeviceId, day)

2. How should we order our data within our partitions?

When you have a data attribute that isn’t all the exact same value (unlike device ID) but values close to each other tend to get queried together (like timestamps), you can tell Delta tables to use z-ordering to order the rows within the table by these attributes. This colocality lets the Delta engine more efficiently retrieve data from queries that use these attributes for filtering, grouping, or sorting.

When the engine knows the data is sorted, it can quickly find where to start scanning, and stop as soon as the data it’s scanning isn’t part of the requested data.

In our case, we have our hourminute column which will be the focal point for a lot of our time intelligence; z-ordering our data by this column will definitely help optimize our data tables for efficient querying.

3. What aggregations should we create?

Aggregation tables in Power BI are awesome. They allow you to import a hidden table of aggregated data into your model and connect it to your (trillion row!) dataset. When a user performs a query against your fact table that the aggregation table can answer — no points for guessing — Power BI uses the aggregation table to answer it instead of your DirectQuery table, using its low-latency in-memory cache to deliver instant results.

The best part of this is the experience is completely seamless to your end user. They get snappy performance at the aggregated level, and the low-level detailed data they need to complete their analysis at the fact table level.

Going back to our mantra of “summarize, filter, drill down”, in order to maximize the potential for this, we need to design our aggregated table ahead of time to make sure it will answer the “summarize” section sufficiently to provide the users with what they need to “filter” the data before drilling down into the fact level.

In our case, we want to quickly identify poorly performing devices during particular time windows, so an aggregate table of performance metrics by deviceID and timestamp will be really helpful here. For your own table, follow the Goldilocks rule (again!)— you want just the right amount of dimensions and metrics without making essentially a duplicate of your fact table.

CREATE VIEW timeseries.agg_value_count AS
SELECT
deviceid,
hourminute,
count(1) AS rowcount,
sum(value) AS totalvalue
FROM
timeseries.sensors_part
GROUP BY
DeviceId,
hourminute

Overall Guidance

  • You must partition your data to take advantage of data skipping and parallelism. Random data storage is your enemy.
  • Z-Ordering data where colocality matters can help a lot. For time series data, it’s highly recommended to z-order based on timestamp to make it easy to “build up” aggregates like month-to-date and year-over-year KPIs.
  • You must include aggregations, they’re too convenient! Just make sure you design it to meet your ultimate goal of “summarize, filter, drill down.”
  • Try to keep high-cardinality columns in your fact table even if they’re actually dimension attributes. At this scale, the “join penalty” outweighs the “ETL penalty”, especially if we don’t expect those dimension attributes to exhibit frequent changes.
  • Avoid views for your DirectQuery tables. At this scale, avoiding recomputation is a major win.
  • Goldilocks is a great story, use it to teach your kids about tradeoffs!
When customers join the BI sprint review.

Power BI

Now that we’ve created our tables let’s see what best practices we can leverage to make our Power BI model operate at its highest efficiency.

What storage mode do we need for each table?

Today PowerBI offers 4 different storage modes for tables:

  • Import mode — all data is loaded into Power BI’s in-memory cache
  • DirectQuery mode — all data remains in the source system and only the metadata is stored in Power BI
  • Dual mode — the data is loaded into Power BI’s in-memory cache and the source system metadata is stored; Power BI can use either mode to most efficiently produce DAX query results.
  • Hybrid tables — A new feature, this lets you combine Import mode and DirectQuery mode data in a single table by using partitions (see how useful they are?) and combine them together into one seamless table for the end user.

Each of them has their pros and cons, and more importantly their key use cases.

For our large fact table, we have two options:

  • DirectQuery
    Pros: easier to maintain
    Cons: Assumes a uniform distribution of querying. If certain segments of data get disproportionate amounts of analysis, there’s no way to take that into account.
  • Hybrid table
    Pros: Can import the “hottest” data segments for fast performance.
    Cons: Preview feature, requires additional maintenance, only one partition can be DirectQuery (so make it count!).

In our demo example, we’re just running in DirectQuery mode because it’s easier to set up (and to fully test out the capabilities of Databricks SQL) — but if you had a trillion rows of raw data in real life and wanted to incrementally grow it, a hybrid table makes a lot of sense.

For our smaller dimension tables, we should run them in Dual mode to take advantage of their values being available for slicers and visuals in-memory, but for Databricks SQL to access those values directly in DirectQuery.

Overall guidance

  • Pretend there is a big wall between your Import mode and DirectQuery mode data and you want to avoid throwing things over it if possible.
  • … so if a visual is based on a DAX query that can be run entirely on imported data or an aggregation, bring it in.
  • If a visual needs some DirectQuery data, make sure all the data is easily available in Databricks SQL (and modeled per the guidance above.)

Modeling + DAX

PowerBI’s DAX language really expands the semantic layer of your data model. It’s powerful and has a lot of hidden depth (and complexity!) … but for DirectQuery datasets — and especially ones with one trillion rows — the simple philosophy around DAX should be less is more.

At a minimum we want to create metrics which are supported by our aggregation table and include those in our Summary view.

And we can include some time intelligence or slightly more advanced DAX (i.e. with CALCULATE or FILTER) … if we keep the actual dataset being processed in DirectQuery relatively small.

We definitely want to leverage PowerBI’s performance analyzer as well as Databricks SQL’s built in Query History tool here. We can place a DAX measure in a card or table visual, refresh the visual, and see how long it takes for the DAX query to run and return the results. We can also see what query is being generated in DBSQL and see if there are opportunities for modifying our model, adding additional aggregations, rewriting our DAX, performing the calculation upstream in Delta Lake, or (worst case scenario) abandoning our calculation altogether.

Overall guidance:

  • Create DAX around your aggregation table.
  • Use the Performance Analyzer and Query History tools and measure, measure, measure.
  • Proactively set your users’ expectations for performance of a trillion row dataset — if a DAX calculation is very valuable but takes 30–45 seconds to run, this may be an appropriate tradeoff with your users.

Report

Again following our mantra (say it with me) “summarize, filter, drill down”, we want to build our report out with two pages: a Summary page and a Drill Down page.

The summary page will let the user quickly select a time window, identify poorly performing devices, select a device, and then see the device telemetry for that time window.

The user can then drill down from that telemetry into a specific timestamp and see the operating characteristics of the device at that time and potentially identify a root cause for the poor performance. (In real life, this drill down page might also introduce work order history, manufacturer info, and other relevant data about the device to better investigate the issue.)

Our Summary page will largely be based on our aggregation table.

The aggregate table makes the user’s summary view and filter experience very snappy. We can easily identify which devices we want to look further into.

Note that each visual is still using our fact table fields, but when the user selects values on the report page, these DAX queries will resolve using our aggregation table for great performance.

Our drill through / tooltip capabilities are all run directly on Databricks SQL.

We can link our Drill down page two different ways: as a standalone report page enabled with the Drillthrough functionality in Power BI; or as a tooltip report that the user can see by hovering over a value.

For this demo we’ll create a tooltip report although the mechanics for creating the two are basically the same: identify which fields are passed through as filters from the Summary page to the Drill Down page, and then link the two pages in the Power BI UI.

In our case, we’re linking through the DeviceId and Timestamp and exploring a histogram of the readings taken at that particular timestamp. In your scenario this could be looking for outliers or comparisons, providing richer analysis of a much smaller set of data points … in short, this is where we move to the explanatory phase of our analysis.

Overall guidance

  • Just like our tables in our model, separate your summary and detail visuals into their own pages.
  • Avoid cluttering with too many visuals. Use good design principles, and specifically favor precise visuals with more complex calculations but fewer data points over broad visuals with many data points that require the analyst to do more interpretation. Where feasible, let Photon do the work!
  • Turn off unnecessary interactions between visuals, especially those that use DirectQuery.

Results

Beyond the nice GIF at the top of the article, to give some quantitative data on the performance here:

We have 2.4 terabytes of data, organized into 200 partitions (100 devices * 2 days), with 27,953 files averaging 87 MB each.

Running on a Medium cluster, when we select an hourminute and DeviceId combination from the summarized view, the average time to produce the line chart visual on Databricks SQL across 25 runs is 2.12 seconds, with a worst case performance of 3.47 seconds, and a best case scenario of 1.47 seconds.

The row count by value tooltip is a bit more intensive and variable in the data returned by the query; across 25 runs the average time is 3.66 seconds, with a worst case performance of 6.01 seconds and a best case scenario of 2.75 seconds.

Even in the worst scenario being able to go from choosing a deviceId and time slice to looking at detailed analysis in less than 10 seconds on a trillion rows of data is pretty impressive!

A typical run of the line chart and tooltip report.

Setting aside the technical performance, even more impressive is the fact that this cluster is running at just over $11/hour, which in itself is a miracle of modern cloud engineering and a vindication of the lakehouse architecture.

Remember, kids: not just performance … price-performance.

Conclusion

I have to admit, when I started out making this demo, I had no idea how it was going to perform — but seeing those first few queries run through in a few seconds truly was electric! The power of the lakehouse is definitely something you have to see to believe.

Databricks provides awesome processing power to ingest and transform data at trillion row scales and beyond, and with Databricks SQL a world-class engine to query that same data.

And PowerBI provides a great platform to surface that data and offers powerful concepts like aggregations and composite models to let you make the most of your Azure Databricks investment.

I hope this inspires you to build awesome dashboards and reports for your business problems and share your results with the rest of us!

Questions, comments, ideas for more blogs? Leave them in the chat.

--

--

Kyle Hale

Azure Solution Architect at Databricks. I’ve seen things.