Stories by Akshayprabhu on Medium

Enabling Snowflake–Databricks Interoperability

Akshayprabhu — Fri, 30 Jan 2026 17:08:39 GMT

Snowflake–Databricks Interoperability: From Parallel Platforms to a Shared Data Foundation

Modern enterprises rarely rely on a single data platform. Over time, different teams adopt tools that best serve their needs — data engineers lean toward scalable processing engines, analysts toward governed warehouses, and data scientists toward flexible ML environments. This reality has led many organizations to run Databricks and Snowflake side by side.

Historically, however, these platforms have operated in silos, connected through brittle ETL pipelines and redundant data copies. That model is increasingly unsustainable — both technically and financially.

This blog explores why Snowflake–Databricks interoperability matters, why enterprises are actively pursuing it, and how recent advances in open table formats and catalogs are making it practical today.

Why Snowflake and Databricks End Up Together

Despite frequent comparisons, Snowflake and Databricks solve different problems.

Databricks is optimized for data creation. It excels at large-scale ingestion, complex transformations, streaming workloads, and machine learning. Spark’s flexibility makes it ideal for evolving datasets and computationally intensive processing.

Snowflake is optimized for data consumption. It offers strong governance, predictable performance, and high concurrency for SQL analytics and BI. For enterprise reporting and regulated analytics, Snowflake remains a natural choice.

Most organizations don’t choose one over the other — they choose both. The real question becomes how to connect them without turning data movement into the dominant architectural concern.

Redefining Interoperability

Interoperability is often misunderstood as better connectors or faster data transfers. The article makes a stronger point: real interoperability reduces the need to move data at all.

Instead of treating each platform as a destination with its own copy of the data, interoperable architectures treat data as a shared asset stored in cloud object storage. Snowflake and Databricks become compute layers that operate on that shared foundation.

This is not just an optimization — it is a conceptual shift. Pipelines no longer define the architecture. The data does.

Why Interoperability Matters Now

Cost Has Become Architectural

Copying data was once cheap enough to ignore. At scale, it no longer is.

Large datasets duplicated across platforms multiply storage costs and force teams to maintain parallel pipelines that do the same work twice. Every additional copy introduces new monitoring requirements, new failure modes, and new governance concerns.

Interoperability simplifies this by design. Data is written once and reused many times.

Different Teams, Different Needs

No single engine is ideal for every workload. Forcing all analytics, transformations, and ML into one platform leads to compromises and frustration.

Interoperability allows:

Engineers and data scientists to stay productive in Databricks
Analysts and business users to stay productive in Snowflake
Both groups to work on the same datasets

This alignment reduces friction between teams and shortens the distance from raw data to insight.

Governance Improves When Copies Disappear

Every duplicated dataset increases risk. In regulated environments, each copy must be secured, audited, and governed independently.

A shared data layer does not eliminate governance challenges, but it narrows the surface area. Fewer copies mean fewer places where sensitive data can leak or drift out of compliance.

Open Data Is Strategic Data

Interoperability is also about control. When data lives in open formats on object storage, organizations retain freedom of choice. They can adopt new engines, evolve architectures, or rebalance workloads without being forced into a full platform migration.

This is not anti-vendor — it is pro-optionality.

The Technology That Makes This Possible

Delta Lake’s Role — and Its Limits

Delta Lake introduced reliability to data lakes by adding ACID transactions, schema enforcement, and time travel. It dramatically improved Spark-based data platforms and became foundational to Databricks.

However, Delta was initially optimized for a Spark-centric world, limiting its usefulness across heterogeneous engines.

Apache Iceberg as the Common Language

Apache Iceberg was designed for multi-engine access from day one. Its separation of data and metadata allows different compute engines to safely read the same tables while maintaining consistency.

This neutrality is why Iceberg has become the practical center of Snowflake–Databricks interoperability.

Snowflake’s Move Toward Shared Data

Snowflake’s native Iceberg support represents a fundamental change in posture. Snowflake no longer requires full ingestion to participate in analytics. It can query Iceberg tables stored externally while preserving its performance and governance characteristics.

This turns Snowflake from a closed destination into an active participant in shared-data architectures.

Databricks and the Bridge to Iceberg

Databricks has responded by embracing Iceberg directly and by introducing Delta UniForm, which exposes Iceberg metadata for Delta tables. This allows existing Delta Lake investments to participate in interoperable architectures without wholesale rewrites.

Unity Catalog further enables this by acting as a central governance and metadata layer.

Integration Patterns

Access Databricks Iceberg Tables from Snowflake

Snowflake setup

Snowflake external volume — Grant Snowflake restricted access to Microsoft Azure container where the databricks Iceberg tables and metadata are stored using an external volume.

Note: Please perform all the prerequisites in the Snowflake documentation mentioned above to establish access to storage account without any hurdles.

Use below command to create snowflake External volume.

2. Snowflake external catalog — With this table type, Snowflake uses a catalog integration to retrieve information about Iceberg metadata and schema. The table data and metadata are stored in external cloud storage, which Snowflake accesses using an external volume.

The following diagram shows how an Iceberg table uses a catalog integration with an external Iceberg catalog.

Use below command to create Snowflake External Catalog to connect to the Databricks Unity Catalog.

Replace below in command and Execute:

CATALOG_NAMESPACE = Schema in Databricks

Catalog URI = Replace Databricks URI before /api, after databricks URI everything will remain same.

CATALOG_NAME = Catalog name in Databricks

BEARER TOKEN = Value of bearer token to be created in Databricks

Azure Setup

Allow Snowflake VNET subnet IDs in Azure Storage Account

If Azure storage firewall is configured to block all unauthorized traffic to Azure storage account, allowing VNet subnet IDs is required.

Please follow this documentation to allow the VNET subnet IDs from snowflake in the Storage Account.

Create Database in Snowflake from Databricks

Once all the previous steps have been setup, database from UC can be setup in snowflake using command in below screenshot. This will register all schemas and tables (only ICEBERG format) in snowflake database from Databricks UC for which we created the catalog integration.

EXTERNAL_VOLUME = EXTERNAL VOLUME created in Snowflake

CATALOG = CATALOG INTEGRATION created in Snowflake

SYNC_INTERVAL_SECONDS = Specifies the time interval (in seconds) that Snowflake should use for automatically discovering schemas and tables in your remote catalog.

Values: 30 to 86400 (1 day), inclusive

Default: 30 seconds

Managed and External both tables can be discovered once the database is set up.

Note — The table should be an Iceberg table to be visible in Snowflake. Existing tables (Managed or External) if not Iceberg can be modified by altering the table properties -

Optional — Create tables in Snowflake

If you do not want to create a full database from Databricks UC and need to create specific tables, they can be created in Snowflake using the command shown in screenshot below.

Replace below in command and Execute:

EXTERNAL_VOLUME = EXTERNAL VOLUME created in Snowflake

CATALOG = CATALOG INTEGRATION created in Snowflake

CATALOG_TABLE_NAME = Databricks table name

What Still Isn’t Solved

The article is clear that interoperability is not magic.

Write coordination remains difficult, and most architectures enforce a single-writer rule. Governance policies must still be implemented separately per engine. Operational complexity shifts from ETL pipelines to metadata, catalogs, and access control alignment.

These trade-offs are real — but they are increasingly preferable to the alternative.

When Interoperability Is Worth It

Interoperability delivers the most value when:

Data volumes are large
Teams use multiple analytics personas
Compliance requirements are strict
ML and BI must coexist without duplication

In these environments, shared data architectures stop being optional and start becoming inevitable.

Closing Thoughts

Snowflake–Databricks interoperability reflects a broader evolution in data architecture. The industry is moving away from siloed platforms toward shared data foundations built on open standards.

This shift does not eliminate complexity — but it puts that complexity where it belongs: in data governance and design, not in endless data movement.

The takeaway is simple:
Store data once. Govern it well. Let engines compete on compute — not ownership of the data.

Monitoring and Optimizing Databricks Jobs with DataFlint

Akshayprabhu — Sun, 07 Dec 2025 21:26:04 GMT

Optimizing and Monitoring Databricks Jobs with DataFlint

Contributors: Swapnilspra

Databricks, built on top of Apache Spark, has become the go-to platform for large-scale data engineering, analytics, and AI workloads. Every Spark job that runs on Databricks consumes compute resources — CPU cores, memory, I/O, and storage. For teams running hundreds of daily ETL pipelines, even small inefficiencies can multiply into major cost and performance problems.

Despite Spark’s flexibility, understanding what happens under the hood is notoriously difficult. The Spark UI exposes task metrics, shuffle stats, and stage timelines — but interpreting them manually is tedious. Each job may produce hundreds of stages and thousands of tasks, making it hard to pinpoint performance bottlenecks. Identifying issues like data skew, underutilized cores, unbalanced partitions, or join misconfigurations requires deep Spark internals knowledge. Optimization often happens reactively — only after jobs fail or exceed SLA thresholds. While Databricks provides metrics and logs, there’s no simple, human-readable summary of “what went wrong” or “how to fix it”.

DataFlint Makes Optimization Simple

DataFlint is an open-source Spark plugin that surfaces detailed runtime insights and practical optimization suggestions directly inside the Spark UI. It acts as an intelligent performance advisor for your Spark jobs — continuously monitoring and diagnosing them in real-time.

DataFlint hooks into Spark’s internal event listeners and execution metrics to detect common inefficiencies such as:

Small tasks / excessive partitions: Detects high task counts with low processing time, suggesting .coalesce() or .repartition().
Idle cores and underutilization: Identifies clusters that are oversized or waiting on slow tasks.
Skewed shuffles: Flags uneven data distribution and recommends salting or repartitioning.
Join inefficiencies: Recognizes Sort-Merge Joins and recommends broadcasting small tables for faster execution.
Spill warnings: Highlights when data exceeds memory and spills to disk.

DataFlint UI Overview (using a sample job)

For installation on Databricks, you can follow DataFlint page which explains every step in detail. Once it is installed on your Databricks cluster, it automatically augments the Spark UI with an additional DataFlint tab.

Each job execution is summarized in the following sections:

Summary Dashboard: Lists alerts grouped by severity (warnings/errors).

SQL Analysis: Shows SQL queries, detected join types (Broadcast, Sort-Merge, etc.), and performance hints.

Stage Metrics: Displays task duration distributions, shuffle read/write sizes, and partition statistics.

Storage & Caching Insights: Highlights unpersisted RDDs, skewed cached tables, and potential memory pressure.

A simple job like:

from pyspark.sql import functions as F

spark.range(5_000_000).repartition(6000).groupBy(F.expr("id % 10")).count().collect()

generates multiple alerts such as:

Each alert category highlights a specific performance inefficiency and points directly to the optimization you can apply.

1. Small Tasks Alert

What it means
Your job has too many small partitions, leading to:

Excessive scheduling overhead
Low CPU utilization
Longer job runtime

What you can optimize

Increase partition size using:

df.coalesce(n)
df.repartition(n)

Adjust shuffle partitions:

spark.conf.set(“spark.sql.shuffle.partitions”, )

Tune upstream operations to avoid unnecessary repartitions.

2. Idle Cores / Under-Utilization

What it means
Executors are spending most of their time idle — the cluster is likely over-provisioned.

What you can optimize

Reduce cluster size (fewer nodes or smaller instance type).
Decrease executor counts:

spark.executor.instances
spark.executor.cores

Enable/adapt Dynamic Allocation settings.
Coalesce tasks to increase per-task workload.

3. Broadcast Join Recommendation

What it means
Spark is performing a Sort-Merge Join, shuffling large datasets across the cluster, even though one side is small enough to broadcast.

What you can optimize

Broadcast the smaller dataset:

from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), “key”)

Or increase auto broadcast threshold:

spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, “200MB”)

This often produces significant speedups for join-heavy pipelines.

4. Shuffle Heavy Alert

What it means
Large shuffle read/write volumes detected. This is a common symptom of:

Bad join keys
Unbalanced partitioning
Skew in data distribution

What you can optimize

Repartition on better keys.
Use salting for skewed joins.
Broadcast small tables where applicable.
Use bucketing (for stable schemas).

5. Spill Detected

What it means
Spark ran out of executor memory and spilled intermediate data to disk — a major performance penalty.

What you can optimize

Increase executor memory.
Increase shuffle partitions to reduce per-task load.
Cache or checkpoint intelligently.
Rebalance data to reduce skew.
Validate filter pushdown and prune unnecessary columns.

6. Skew Detection

What it means
One or more tasks processed disproportionately more data than others, indicating uneven data distribution.

What you can optimize

Salt join keys:

df.withColumn(“key_salted”, F.concat(“key”, F.rand()*10))

Repartition by key instead of round-robin.
Use skewJoinHint or AQE’s skew join handling in Spark 3.x.

In summary, DataFlint helps turn Spark’s complex execution patterns into crystal-clear, guided optimization paths.
Instead of manually digging through the Spark UI and interpreting task graphs, you get human-readable alerts, clear explanations, and concrete optimization steps that immediately tell you what needs tuning and why.

This brings job optimization from “expert-only tribal knowledge” to a systematic and repeatable process, enabling every data engineer to improve performance and reduce compute cost with confidence.

What’s Next: Persisting DataFlint Insights for Long-Term Monitoring

The open-source version of DataFlint doesn’t persist metrics beyond the lifetime of the job. Once the Spark UI session ends, all those insights disappear with it.

To bridge this gap, we built a custom Spark plugin extension that automatically captures all DataFlint alerts and writes them into a managed Delta table — with zero code changes to user jobs. This enables building BI dashboards to visualize job health and cluster efficiency and identify high-waste pipelines, inefficient clusters, and opportunities for right-sizing. All of this becomes possible once DataFlint metrics are persisted in Delta and made queryable.

In the next story, I will walk through how we built this custom utility, how it plugs into DataFlint’s event listeners, and how you can adopt it for your Databricks environment.

Data Security in Databricks: A Practical Framework for Classification, Encryption & Access…

Akshayprabhu — Sat, 06 Dec 2025 23:09:46 GMT

Data Security in Databricks: A Practical Framework for Classification, Encryption & Access Reporting

The cornerstone of data protection: encrypt everything, trust nothing by default.

Contributors — Sahil Sawant

Free access link

Protecting sensitive information is one of the most important responsibilities in modern data platforms. Whether building enterprise analytics in banking, public sector, or philanthropy, organizations must ensure that PII, confidential, and regulated data are properly identified, classified, secured, and monitored.

In this article, we break down a full security framework for Databricks across three pillars:

Identification of Sensitive Data
Implementing Encryption / Anonymization Based on Data Classification
Reporting & Monitoring Access to Sensitive Data

The examples are grounded entirely in a real use case built in Databricks, where Microsoft Presidio, an open-source data protection and de-identification SDK from Microsoft, performs automated PII detection and Unity Catalog enforces governance using tags, comments, and table-level policies.

Full Source Code — The complete Databricks notebooks, PII detection pipeline, and anonymization framework used in this article are available here.

1. Identification of Sensitive Data (Data Classification Layer)

The first step in any secure data architecture is knowing where sensitive information exists. In a modern Lakehouse, PII can appear in structured fields, nested structures, arrays, or even free-text columns where identifiers are buried in natural language. Relying on manual reviews or naming conventions is never enough.

In this framework, sensitive data is identified using a fully automated Presidio-based scanner that evaluates every column across Unity Catalog. Each field is analyzed using entity recognizers for emails, phone numbers, credit cards, IPs, SSNs, IBANs, names, and dozens of other patterns. The scanner aggregates confidence scores and hit rates to determine which columns are likely to contain PII, then writes this intelligence back into Unity Catalog via tags and comments. This creates a living classification layer that continuously keeps the platform aware of where regulated data resides.

Goal:

Automatically detect, classify, and tag sensitive data so the governance system knows how to protect it.

1.1 Automated PII Detection Using Microsoft Presidio

In the usecase, Microsoft Presidio is integrated directly into Databricks as a broadcasted AnalyzerEngine:

broadcasted_analyzer = sc.broadcast(AnalyzerEngine())

A custom Pandas UDF (analyze_udf) runs Presidio against each column:

def analyze_series(s: pd.Series) -> pd.Series:
return s.astype(str).apply(analyze_text)

analyze_udf = pandas_udf(analyze_series, returnType=StringType())

The scanner identifies PII patterns such as:

Emails
Phone numbers
Credit card numbers
SSNs, ITIN
IBAN, BBAN
Passports
IP addresses (IPv4, IPv6, IPv4_with_port)
Free-text PII embedded in long text fields

This ensures the scanner is validated against both explicit and hidden PII values.

The PIIScanner class automates:

Applying Presidio detection to every column
Parsing results via from_json()
Aggregating findings using Num entities, Average score, Hit rate
Comparing against thresholds (hit_rate ≥ 60%, avg_score ≥ 0.5)
Returning only likely-sensitive columns

This produces a consistent, repeatable classification layer across catalogs.

1.2 Catalog-Wide PII Scanning Across Unity Catalog

The notebook doesn’t stop at a single dataframe — it scans every table across multiple catalogs:

all_tables = pii_scanner.get_all_uc_tables(spark, catalogs)

For each securable (table or view), the following happens:

The table is sampled
Presidio UDF is applied
Matching PII types are collected
Results are returned into a single unified dataframe (scan_results)

This produces a sensitive data inventory across the entire lakehouse.

1.3 Automatic Classification Tags in Unity Catalog

When PII is found, the notebook automatically tags the table and columns using SQL commands invoked programmatically via Spark:

ALTER TABLE SET TAGS (‘PII’)
ALTER TABLE

ALTER COLUMN email SET TAGS (‘EMAIL_ADDRESS’)

It also adds warnings as comments, such as:

> # WARNING! This column contains PII:

These tags act as the classification layer that drives downstream security policies.

2. Implementing Encryption / Anonymization Based on Data Classification

Once sensitive fields are discovered and tagged, the next step is ensuring they are protected. In this framework, protection is enforced using entity-aware anonymization rules powered by Microsoft Presidio’s AnonymizerEngine.

Using Unity Catalog tags as the source of truth, the system automatically selects the right anonymization strategy for each column — masking credit card numbers, replacing emails with , scrubbing names, redacting IP addresses, or sanitizing free-text fields containing mixed PII. This process is completely automated: tables tagged with PII are dynamically transformed into sanitized _anonymized versions that retain analytical value while removing sensitive information. This enables safe analytics, data sharing, model development, and downstream consumption without exposing raw PII.

2.1 Using Tags to Determine Sensitive Columns

The method get_pii_tagged_columns(...) queries Unity Catalog’s internal system tables:

SELECT column_name, tag_name
FROM system.information_schema.column_tagsSELECT column_name, tag_name
FROM system.information_schema.column_tags

This identifies exactly which columns:

Are tagged as PII
Have specific entity tags such as EMAIL_ADDRESS, IP_ADDRESS, SSN, etc.

These tags directly inform how each column should be anonymized.

2.2 Presidio-Based Anonymization Rules

The anonymization behavior is implemented using Presidio’s AnonymizerEngine with entity-specific operator rules:

defaults = {
 “EMAIL_ADDRESS”: OperatorConfig(“replace”, {“new_value”: “”}),
 “IP_ADDRESS”: OperatorConfig(“replace”, {“new_value”: “”}),
 “URL”: OperatorConfig(“replace”, {“new_value”: “”}),
 “PERSON”: OperatorConfig(“replace”, {“new_value”: “”}),
}

A factory function constructs Pandas UDFs to apply these transformations:

col_udf = make_anonymize_pandas_udf(entities, language, operator_cfg)

2.3 Creating Anonymized Copies of PII Tables

The method _anonymize_table applies anonymizers only to PII-tagged columns:

anon_df = df.select(*[col_udf(col(c)) if c in pii_cols else col(c)])

Then writes the sanitized version back to Unity Catalog:

anon_df.write.mode(“overwrite”).saveAsTable(f”{catalog}.{schema}.{table}_anonymized”)

A higher-level method, anonymize_all_tagged_tables_in_catalog, loops through all PII-tagged tables in a catalog and applies anonymization automatically.

This creates:

A raw table (containing sensitive data)
A sanitized version (safe for analytics, sharing, or lower-privilege users)

3. Reporting & Monitoring Access to Sensitive Data

A secure platform not only protects sensitive data but also provides visibility into where it is, how it is handled, and how it is accessed. This framework produces structured scan results for every table, creating a complete inventory of PII across catalogs — entity types, confidence levels, column names, and scan timestamps.

Because Unity Catalog tags are applied during classification, downstream access logs can now be tied directly to sensitivity levels. This enables the creation of dashboards that answer questions like: Which tables contain PII? Where has anonymization been applied? Who is accessing sensitive datasets? Are users querying raw PII when anonymized versions exist? Combined with Databricks audit logs, this becomes the foundation for a robust governance, compliance, and monitoring layer, ensuring the organization maintains full oversight of how regulated data flows through the Lakehouse.

3.1 PII Scan Results as a Governance Report

It aggregates every scan into a single dataframe:

scan_results = pd.concat([…])

Each record contains:

scan_date
securable (catalog.schema.table)
column
entity_type
num_entities
avg_score
hit_rate

This becomes a PII Inventory Report across the entire lakehouse. Teams can store this in Delta tables for dashboards like:

Most common PII types
Tables with highest PII density
Catalogs with PII drift over time
A weekly “PII Health Check” report

3.2 Reporting on Anonymization Activity

The anonymization pipeline logs which tables:

Were identified as PII
Were anonymized
Were skipped due to no PII
Had errors

These logs can be converted into operational dashboards such as:

Anonymization coverage report
Raw vs anonymized table comparison
Policy compliance validation

3.3 Access Monitoring (Integration Point)

The classification layer enables meaningful access reporting by correlating:

Unity Catalog access logs
PII tags
Table usage logs
Anonymized vs non-anonymized table access patterns

This supports compliance reporting such as:

“Which users accessed raw PII this month?”
“Is anyone querying PII tables without anonymization?”
“Are access patterns aligned with RBAC/ABAC policies?”

FinOps in Action: Monitoring and Optimizing Compute Costs in Databricks Using System Tables

Akshayprabhu — Fri, 05 Dec 2025 18:02:38 GMT

Contributors —

Free access link

In the era of cloud-native analytics, the flexibility of scaling compute on demand comes with a catch: unpredictable and rising costs. This is where FinOps steps in. If you’re working with Databricks and wondering how to get better visibility into your usage costs and optimize them, you’re in the right place.

In this blog, I’ll walk you through:
- What FinOps is and why it’s essential
- How to implement FinOps principles using Databricks system tables
- A detailed breakdown of a Databricks dashboard I built to monitor compute costs at a workspace and user level

What is FinOps and Why Should You Care?

FinOps (short for Financial Operations) is the practice of managing and optimizing cloud spending through collaboration between engineering, finance, and product teams. It’s about making informed decisions that balance performance, speed, and cost.

FinOps brings transparency to cloud billing, enabling teams to answer critical questions:

Who’s spending the most?
Which workloads are inefficient?
Are we exceeding budgets?

Without a FinOps strategy, cloud bills become unpredictable, leading to wasted spend and poor ROI.

Why FinOps is Especially Important in Databricks

Databricks offers a powerful platform for data engineering and machine learning, but its on-demand pricing model can quickly spiral out of control:

Clusters running idle
Underutilized compute
Storage costs piling up

Without clear cost accountability, even the most optimized Spark pipelines can burn through budget. That’s where system tables come in handy.

Building a FinOps Dashboard Using Databricks System Tables

Databricks provides system tables that expose metadata around jobs, clusters, users, and billing. With these tables, you can create a powerful internal FinOps dashboard.

What My Dashboard Tracks

Total compute cost over 6 months
Monthly compute cost by workspace
Monthly compute cost by user
Filters for workspace and user selection

Let’s break down how each component works.

Dataset 1: Total compute cost by workspace for last 6 months

This dataset calculates the total compute cost for the last 6 months:
usage.usage_quantity * list_prices.pricing.default AS cost

We join system.billing.usage with system.billing.list_prices to calculate compute cost and associate it with workspaces from system.access.workspaces_latest.

Key Insight: You can now identify which workspaces are consuming the most compute and track trends over time.

Dataset 2: Monthly cost analysis per user

This query helps attribute compute cost to specific users:
usage.identity_metadata.run_as AS user

By aggregating cost by user and month, we surface insights into who is generating compute workloads — and potentially uncovering unused clusters or overprovisioned jobs.

Dataset 3: Total cost 6 months

A simple aggregation that gives a single-number snapshot of total compute cost over the last six months:
SELECT SUM(list_cost) AS total_cost FROM usage_with_cost

This metric helps leadership understand high-level spend without drilling into the weeds.

Dataset 4: Monthly Cost Analysis per Workspace

This allows us to visualize how costs fluctuate by workspace and month, helping teams understand seasonality or the impact of new data initiatives.

How Compute Costs Are Calculated

All datasets follow a similar pattern:
1. Pull usage metrics from system.billing.usage
2. Join with price data in system.billing.list_prices
3. Join with workspace metadata via system.access.workspaces_latest
4. Optionally attribute usage to users via identity_metadata.run_as

Each cost line is:
usage_quantity * pricing.default

And is grouped by:

Workspace ID / Name
User
Month

Visualizing in Databricks Dashboards

The compute_cost_analysis dashboard brings it all together:

Total Cost Counter: One glance at the total dollar amount.
Bar Charts: Costs by workspace and user across time.
Filters: Select workspace or user to zoom in.

Final Thoughts

Implementing FinOps isn’t just about saving money — it’s about enabling growth with control. By using Databricks system tables, we gain clear visibility into who is using compute, how it’s being used, and where optimizations can be made.

What’s Next?

Smart Cluster Configuration Recommendations
Use cost and utilization data to suggest right-sizing or autoscaling policies.
Proactive Alerts
Set up alerts when spending spikes unexpectedly or usage trends break historical patterns.

Want the code or template?
Drop a comment or DM me — happy to share a redacted version and collaborate.