<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Akshayprabhu on Medium]]></title>
        <description><![CDATA[Stories by Akshayprabhu on Medium]]></description>
        <link>https://medium.com/@akshayprabhu818?source=rss-a67fa6edef6c------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*idyv63V7o2j5yiMi.jpg</url>
            <title>Stories by Akshayprabhu on Medium</title>
            <link>https://medium.com/@akshayprabhu818?source=rss-a67fa6edef6c------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 23 May 2026 16:03:30 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@akshayprabhu818/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Enabling Snowflake–Databricks Interoperability]]></title>
            <link>https://medium.com/@akshayprabhu818/enabling-snowflake-databricks-interoperability-c74a6611fb65?source=rss-a67fa6edef6c------2</link>
            <guid isPermaLink="false">https://medium.com/p/c74a6611fb65</guid>
            <dc:creator><![CDATA[Akshayprabhu]]></dc:creator>
            <pubDate>Fri, 30 Jan 2026 17:08:39 GMT</pubDate>
            <atom:updated>2026-04-03T13:56:45.440Z</atom:updated>
            <content:encoded><![CDATA[<h3>Snowflake–Databricks Interoperability: From Parallel Platforms to a Shared Data Foundation</h3><p><a href="https://medium.com/@akshayprabhu818/enabling-snowflake-databricks-interoperability-c74a6611fb65?source=friends_link&amp;sk=f260fb137c858195f1bbc3e9eb219c3e"><strong><em>Free access link</em></strong></a></p><p>Contributor: <a href="https://medium.com/u/3c8f25c249ab">Swapnilspra</a></p><p>Modern enterprises rarely rely on a single data platform. Over time, different teams adopt tools that best serve their needs — data engineers lean toward scalable processing engines, analysts toward governed warehouses, and data scientists toward flexible ML environments. This reality has led many organizations to run Databricks and Snowflake side by side.</p><p>Historically, however, these platforms have operated in silos, connected through brittle ETL pipelines and redundant data copies. That model is increasingly unsustainable — both technically and financially.</p><p>This blog explores why Snowflake–Databricks interoperability matters, why enterprises are actively pursuing it, and how recent advances in open table formats and catalogs are making it practical today.</p><h3>Why Snowflake and Databricks End Up Together</h3><p>Despite frequent comparisons, Snowflake and Databricks solve different problems.</p><p>Databricks is optimized for <strong>data creation</strong>. It excels at large-scale ingestion, complex transformations, streaming workloads, and machine learning. Spark’s flexibility makes it ideal for evolving datasets and computationally intensive processing.</p><p>Snowflake is optimized for <strong>data consumption</strong>. It offers strong governance, predictable performance, and high concurrency for SQL analytics and BI. For enterprise reporting and regulated analytics, Snowflake remains a natural choice.</p><p>Most organizations don’t choose one over the other — they choose both. The real question becomes how to connect them <strong>without turning data movement into the dominant architectural concern</strong>.</p><h3>Redefining Interoperability</h3><p>Interoperability is often misunderstood as better connectors or faster data transfers. The article makes a stronger point: <strong>real interoperability reduces the need to move data at all</strong>.</p><p>Instead of treating each platform as a destination with its own copy of the data, interoperable architectures treat data as a shared asset stored in cloud object storage. Snowflake and Databricks become <strong>compute layers</strong> that operate on that shared foundation.</p><p>This is not just an optimization — it is a conceptual shift. Pipelines no longer define the architecture. The data does.</p><h3>Why Interoperability Matters Now</h3><h4>Cost Has Become Architectural</h4><p>Copying data was once cheap enough to ignore. At scale, it no longer is.</p><p>Large datasets duplicated across platforms multiply storage costs and force teams to maintain parallel pipelines that do the same work twice. Every additional copy introduces new monitoring requirements, new failure modes, and new governance concerns.</p><p>Interoperability simplifies this by design. Data is written once and reused many times.</p><h4>Different Teams, Different Needs</h4><p>No single engine is ideal for every workload. Forcing all analytics, transformations, and ML into one platform leads to compromises and frustration.</p><p>Interoperability allows:</p><ul><li>Engineers and data scientists to stay productive in Databricks</li><li>Analysts and business users to stay productive in Snowflake</li><li>Both groups to work on the <em>same datasets</em></li></ul><p>This alignment reduces friction between teams and shortens the distance from raw data to insight.</p><h4>Governance Improves When Copies Disappear</h4><p>Every duplicated dataset increases risk. In regulated environments, each copy must be secured, audited, and governed independently.</p><p>A shared data layer does not eliminate governance challenges, but it narrows the surface area. Fewer copies mean fewer places where sensitive data can leak or drift out of compliance.</p><h4>Open Data Is Strategic Data</h4><p>Interoperability is also about control. When data lives in open formats on object storage, organizations retain freedom of choice. They can adopt new engines, evolve architectures, or rebalance workloads without being forced into a full platform migration.</p><p>This is not anti-vendor — it is pro-optionality.</p><h3>The Technology That Makes This Possible</h3><h4>Delta Lake’s Role — and Its Limits</h4><p>Delta Lake introduced reliability to data lakes by adding ACID transactions, schema enforcement, and time travel. It dramatically improved Spark-based data platforms and became foundational to Databricks.</p><p>However, Delta was initially optimized for a Spark-centric world, limiting its usefulness across heterogeneous engines.</p><h4>Apache Iceberg as the Common Language</h4><p>Apache Iceberg was designed for multi-engine access from day one. Its separation of data and metadata allows different compute engines to safely read the same tables while maintaining consistency.</p><p>This neutrality is why Iceberg has become the practical center of Snowflake–Databricks interoperability.</p><h4>Snowflake’s Move Toward Shared Data</h4><p>Snowflake’s native Iceberg support represents a fundamental change in posture. Snowflake no longer requires full ingestion to participate in analytics. It can query Iceberg tables stored externally while preserving its performance and governance characteristics.</p><p>This turns Snowflake from a closed destination into an active participant in shared-data architectures.</p><h4>Databricks and the Bridge to Iceberg</h4><p>Databricks has responded by embracing Iceberg directly and by introducing <strong>Delta UniForm</strong>, which exposes Iceberg metadata for Delta tables. This allows existing Delta Lake investments to participate in interoperable architectures without wholesale rewrites.</p><p>Unity Catalog further enables this by acting as a central governance and metadata layer.</p><h3>Integration Patterns</h3><p><strong>Access Databricks Iceberg Tables from Snowflake</strong></p><p><strong>Snowflake setup</strong></p><ol><li><a href="https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-azure"><strong>Snowflake external volume</strong></a> — Grant Snowflake restricted access to Microsoft Azure container where the databricks Iceberg tables and metadata are stored using an external volume.</li></ol><p>Note: Please perform all the prerequisites in the Snowflake documentation mentioned above to establish access to storage account without any hurdles.</p><p>Use below command to create snowflake External volume.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/868/1*9waN9aMeqrBwj7V4XvmlpA.png" /></figure><p>2. <a href="https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-catalog-integration-rest-unity"><strong>Snowflake external catalog</strong></a> — With this table type, Snowflake uses a catalog integration to retrieve information about Iceberg metadata and schema. The table data and metadata are stored in external cloud storage, which Snowflake accesses using an external volume.</p><p>The following diagram shows how an Iceberg table uses a catalog integration with an external Iceberg catalog.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/770/1*_KbKLnNnaK04jNVDwuYtBg.png" /></figure><p>Use below command to create Snowflake External Catalog to connect to the Databricks Unity Catalog.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/956/1*-VlslcNGjrESCCvt_fy2Jg.png" /></figure><p>Replace below in command and Execute:</p><p><em>CATALOG_NAMESPACE = Schema in Databricks</em></p><p><em>Catalog URI = Replace Databricks URI before /api, after databricks URI everything will remain same.</em></p><p><em>CATALOG_NAME = Catalog name in Databricks</em></p><p><em>BEARER TOKEN = Value of bearer token to be created in Databricks</em></p><p><strong>Azure Setup</strong></p><p>Allow Snowflake VNET subnet IDs in Azure Storage Account</p><p>If Azure storage firewall is configured to block all unauthorized traffic to Azure storage account, allowing VNet subnet IDs is required.</p><p>Please follow this <a href="https://docs.snowflake.com/en/user-guide/data-load-azure-allow">documentation</a> to allow the VNET subnet IDs from snowflake in the Storage Account.</p><p><strong>Create Database in Snowflake from Databricks</strong></p><p>Once all the previous steps have been setup, database from UC can be setup in snowflake using command in below screenshot. This will register all schemas and tables (<strong>only ICEBERG format</strong>) in snowflake database from Databricks UC for which we created the catalog integration.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/562/1*MmtBlA9GJA7HL1eywkpErg.png" /></figure><p><em>EXTERNAL_VOLUME = EXTERNAL VOLUME created in Snowflake</em></p><p><em>CATALOG = CATALOG INTEGRATION created in Snowflake</em></p><p><em>SYNC_INTERVAL_SECONDS = Specifies the time interval (in seconds) that Snowflake should use for automatically discovering schemas and tables in your remote catalog.</em></p><p><em>Values: 30 to 86400 (1 day), inclusive</em></p><p><em>Default: 30 seconds</em></p><p>Managed and External both tables can be discovered once the database is set up.</p><p><strong>Note</strong> — The table should be an Iceberg table to be visible in Snowflake. Existing tables (Managed or External) if not Iceberg can be modified by altering the table properties -</p><p><strong>Optional — Create tables in Snowflake</strong></p><p>If you do not want to create a full database from Databricks UC and need to create specific tables, they can be created in Snowflake using the command shown in screenshot below.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/752/1*-vzjDblkMx0c3YTIo9uMNA.png" /></figure><p>Replace below in command and Execute:</p><p><em>EXTERNAL_VOLUME = EXTERNAL VOLUME created in Snowflake</em></p><p><em>CATALOG = CATALOG INTEGRATION created in Snowflake</em></p><p><em>CATALOG_TABLE_NAME = Databricks table name</em></p><h3>What Still Isn’t Solved</h3><p>The article is clear that interoperability is not magic.</p><p>Write coordination remains difficult, and most architectures enforce a single-writer rule. Governance policies must still be implemented separately per engine. Operational complexity shifts from ETL pipelines to metadata, catalogs, and access control alignment.</p><p>These trade-offs are real — but they are increasingly preferable to the alternative.</p><h3>When Interoperability Is Worth It</h3><p>Interoperability delivers the most value when:</p><ul><li>Data volumes are large</li><li>Teams use multiple analytics personas</li><li>Compliance requirements are strict</li><li>ML and BI must coexist without duplication</li></ul><p>In these environments, shared data architectures stop being optional and start becoming inevitable.</p><h3>Closing Thoughts</h3><p>Snowflake–Databricks interoperability reflects a broader evolution in data architecture. The industry is moving away from siloed platforms toward <strong>shared data foundations built on open standards</strong>.</p><p>This shift does not eliminate complexity — but it puts that complexity where it belongs: in data governance and design, not in endless data movement.</p><p>The takeaway is simple:<br> <strong>Store data once. Govern it well. Let engines compete on compute — not ownership of the data.</strong></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c74a6611fb65" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Monitoring and Optimizing Databricks Jobs with DataFlint]]></title>
            <link>https://medium.com/@akshayprabhu818/monitoring-and-optimizing-databricks-jobs-with-dataflint-b2694b9fa036?source=rss-a67fa6edef6c------2</link>
            <guid isPermaLink="false">https://medium.com/p/b2694b9fa036</guid>
            <dc:creator><![CDATA[Akshayprabhu]]></dc:creator>
            <pubDate>Sun, 07 Dec 2025 21:26:04 GMT</pubDate>
            <atom:updated>2026-04-03T13:58:38.342Z</atom:updated>
            <content:encoded><![CDATA[<h3>Optimizing and Monitoring Databricks Jobs with DataFlint</h3><p>Contributors: <a href="https://medium.com/u/3c8f25c249ab">Swapnilspra</a></p><p>Databricks, built on top of Apache Spark, has become the go-to platform for large-scale data engineering, analytics, and AI workloads. Every Spark job that runs on Databricks consumes compute resources — CPU cores, memory, I/O, and storage. For teams running hundreds of daily ETL pipelines, even small inefficiencies can multiply into major cost and performance problems.</p><p>Despite Spark’s flexibility, understanding what happens under the hood is notoriously difficult. The Spark UI exposes task metrics, shuffle stats, and stage timelines — but interpreting them manually is tedious. Each job may produce hundreds of stages and thousands of tasks, making it hard to pinpoint performance bottlenecks. Identifying issues like data skew, underutilized cores, unbalanced partitions, or join misconfigurations requires deep Spark internals knowledge. Optimization often happens reactively — only after jobs fail or exceed SLA thresholds. While Databricks provides metrics and logs, there’s no simple, human-readable summary of “what went wrong” or “how to fix it”.</p><h4>DataFlint Makes Optimization Simple</h4><p><a href="https://github.com/dataflint/spark?utm_source=chatgpt.com"><strong>DataFlint</strong></a> is an open-source Spark plugin that surfaces detailed runtime insights and practical optimization suggestions directly inside the Spark UI. It acts as an intelligent performance advisor for your Spark jobs — continuously monitoring and diagnosing them in real-time.</p><p>DataFlint hooks into Spark’s internal event listeners and execution metrics to detect common inefficiencies such as:</p><ul><li><strong>Small tasks / excessive partitions:</strong> Detects high task counts with low processing time, suggesting .coalesce() or .repartition().</li><li><strong>Idle cores and underutilization:</strong> Identifies clusters that are oversized or waiting on slow tasks.</li><li><strong>Skewed shuffles:</strong> Flags uneven data distribution and recommends salting or repartitioning.</li><li><strong>Join inefficiencies:</strong> Recognizes Sort-Merge Joins and recommends <strong>broadcasting small tables</strong> for faster execution.</li><li><strong>Spill warnings:</strong> Highlights when data exceeds memory and spills to disk.</li></ul><h4>DataFlint UI Overview (using a sample job)</h4><p>For installation on Databricks, you can follow <a href="https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-databricks">DataFlint</a> page which explains every step in detail. Once it is installed on your Databricks cluster, it automatically augments the Spark UI with an additional <em>DataFlint</em> tab.</p><p>Each job execution is summarized in the following sections:</p><ul><li>Summary Dashboard: Lists alerts grouped by severity (warnings/errors).</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9h8Jrm2RQLuVZfW239yhzA.png" /></figure><ul><li>SQL Analysis: Shows SQL queries, detected join types (Broadcast, Sort-Merge, etc.), and performance hints.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pW4TZF3lFqIKC8KxnfNc1A.png" /></figure><ul><li>Stage Metrics: Displays task duration distributions, shuffle read/write sizes, and partition statistics.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mQae5jRjGfudrgYB3RydPg.png" /></figure><ul><li>Storage &amp; Caching Insights: Highlights unpersisted RDDs, skewed cached tables, and potential memory pressure.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uBJshx3yQ9YESN1B9lYDWQ.png" /></figure><p>A simple job like:</p><pre>from pyspark.sql import functions as F</pre><pre>spark.range(5_000_000).repartition(6000).groupBy(F.expr(&quot;id % 10&quot;)).count().collect()</pre><p>generates multiple alerts such as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KTIvcI0N_FX98B0gXKDttQ.png" /></figure><p>Each alert category highlights a specific performance inefficiency and points directly to the optimization you can apply.</p><h4>1. Small Tasks Alert</h4><p><strong>What it means</strong><br> Your job has too many small partitions, leading to:</p><ul><li>Excessive scheduling overhead</li><li>Low CPU utilization</li><li>Longer job runtime</li></ul><p><strong>What you can optimize</strong></p><ul><li>Increase partition size using:</li></ul><pre>df.coalesce(n)<br>df.repartition(n)</pre><ul><li>Adjust shuffle partitions:</li></ul><pre>spark.conf.set(“spark.sql.shuffle.partitions”, &lt;lower value&gt;)</pre><ul><li>Tune upstream operations to avoid unnecessary repartitions.</li></ul><h4>2. Idle Cores / Under-Utilization</h4><p><strong>What it means</strong><br>Executors are spending most of their time idle — the cluster is likely over-provisioned.</p><p><strong>What you can optimize</strong></p><ul><li>Reduce cluster size (fewer nodes or smaller instance type).</li><li>Decrease executor counts:</li></ul><pre>spark.executor.instances<br>spark.executor.cores</pre><ul><li>Enable/adapt Dynamic Allocation settings.</li><li>Coalesce tasks to increase per-task workload.</li></ul><h4>3. Broadcast Join Recommendation</h4><p><strong>What it means</strong><br> Spark is performing a Sort-Merge Join, shuffling large datasets across the cluster, even though one side is small enough to broadcast.</p><p><strong>What you can optimize</strong></p><ul><li>Broadcast the smaller dataset:</li></ul><pre>from pyspark.sql.functions import broadcast<br>df_large.join(broadcast(df_small), “key”)</pre><ul><li>Or increase auto broadcast threshold:</li></ul><pre>spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, “200MB”)</pre><p>This often produces significant speedups for join-heavy pipelines.</p><h4>4. Shuffle Heavy Alert</h4><p><strong>What it means</strong><br> Large shuffle read/write volumes detected. This is a common symptom of:</p><ul><li>Bad join keys</li><li>Unbalanced partitioning</li><li>Skew in data distribution</li></ul><p><strong>What you can optimize</strong></p><ul><li>Repartition on better keys.</li><li>Use salting for skewed joins.</li><li>Broadcast small tables where applicable.</li><li>Use bucketing (for stable schemas).</li></ul><h4>5. Spill Detected</h4><p><strong>What it means</strong><br> Spark ran out of executor memory and spilled intermediate data to disk — a major performance penalty.</p><p><strong>What you can optimize</strong></p><ul><li>Increase executor memory.</li><li>Increase shuffle partitions to reduce per-task load.</li><li>Cache or checkpoint intelligently.</li><li>Rebalance data to reduce skew.</li><li>Validate filter pushdown and prune unnecessary columns.</li></ul><h4>6. Skew Detection</h4><p><strong>What it means</strong><br> One or more tasks processed disproportionately more data than others, indicating uneven data distribution.</p><p><strong>What you can optimize</strong></p><ul><li>Salt join keys:</li></ul><pre>df.withColumn(“key_salted”, F.concat(“key”, F.rand()*10))</pre><ul><li>Repartition by key instead of round-robin.</li><li>Use skewJoinHint or AQE’s skew join handling in Spark 3.x.</li></ul><p>In summary, DataFlint helps turn Spark’s complex execution patterns into crystal-clear, guided optimization paths.<br> Instead of manually digging through the Spark UI and interpreting task graphs, you get human-readable alerts, clear explanations, and concrete optimization steps that immediately tell you what needs tuning and why.</p><p>This brings job optimization from “expert-only tribal knowledge” to <strong>a </strong>systematic and repeatable process, enabling every data engineer to improve performance and reduce compute cost with confidence.</p><h4>What’s Next: Persisting DataFlint Insights for Long-Term Monitoring</h4><p>The open-source version of DataFlint doesn’t persist metrics beyond the lifetime of the job. Once the Spark UI session ends, all those insights disappear with it.</p><p>To bridge this gap, we built a custom Spark plugin extension that automatically captures all DataFlint alerts and writes them into a managed Delta table — with zero code changes to user jobs. This enables building BI dashboards to visualize job health and cluster efficiency and identify high-waste pipelines, inefficient clusters, and opportunities for right-sizing. All of this becomes possible once DataFlint metrics are persisted in Delta and made queryable.</p><p>In the next story, I will walk through how we built this custom utility, how it plugs into DataFlint’s event listeners, and how you can adopt it for your Databricks environment.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b2694b9fa036" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Data Security in Databricks: A Practical Framework for Classification, Encryption & Access…]]></title>
            <link>https://medium.com/@akshayprabhu818/data-security-in-databricks-a-practical-framework-for-classification-encryption-access-b1cd62258134?source=rss-a67fa6edef6c------2</link>
            <guid isPermaLink="false">https://medium.com/p/b1cd62258134</guid>
            <category><![CDATA[pii-data]]></category>
            <category><![CDATA[unity-catalog]]></category>
            <category><![CDATA[databricks]]></category>
            <category><![CDATA[presidio]]></category>
            <dc:creator><![CDATA[Akshayprabhu]]></dc:creator>
            <pubDate>Sat, 06 Dec 2025 23:09:46 GMT</pubDate>
            <atom:updated>2025-12-07T20:34:53.840Z</atom:updated>
            <content:encoded><![CDATA[<h3>Data Security in Databricks: A Practical Framework for Classification, Encryption &amp; Access Reporting</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/700/1*3qKgnhTmPX7n1t1uul5Ovg.jpeg" /><figcaption>The cornerstone of data protection: encrypt everything, trust nothing by default.</figcaption></figure><p>Contributors — <a href="https://medium.com/u/22f2044e3caf">Sahil Sawant</a></p><p><a href="https://medium.com/@akshayprabhu818/data-security-in-databricks-a-practical-framework-for-classification-encryption-access-b1cd62258134?source=friends_link&amp;sk=9067990aac6d34f861bebefc6b75f486"><em>Free access link</em></a></p><p>Protecting sensitive information is one of the most important responsibilities in modern data platforms. Whether building enterprise analytics in banking, public sector, or philanthropy, organizations must ensure that <strong>PII, confidential, and regulated data</strong> are properly identified, classified, secured, and monitored.</p><p>In this article, we break down a full security framework for Databricks across three pillars:</p><ol><li><strong>Identification of Sensitive Data</strong></li><li><strong>Implementing Encryption / Anonymization Based on Data Classification</strong></li><li><strong>Reporting &amp; Monitoring Access to Sensitive Data</strong></li></ol><p>The examples are grounded entirely in a <strong>real use case built in Databricks</strong>, where <a href="https://microsoft.github.io/presidio/"><strong>Microsoft Presidio</strong></a>, an open-source data protection and de-identification SDK from Microsoft, performs automated PII detection and <strong>Unity Catalog</strong> enforces governance using tags, comments, and table-level policies.</p><p><a href="https://github.com/akshayprabhu15/databricks-pii-governance-framework.git"><strong>Full Source Code</strong></a> — The complete Databricks notebooks, PII detection pipeline, and anonymization framework used in this article are available here.</p><h3>1. Identification of Sensitive Data (Data Classification Layer)</h3><p>The first step in any secure data architecture is knowing <strong>where sensitive information exists</strong>. In a modern Lakehouse, PII can appear in structured fields, nested structures, arrays, or even free-text columns where identifiers are buried in natural language. Relying on manual reviews or naming conventions is never enough.</p><p>In this framework, sensitive data is identified using a <strong>fully automated Presidio-based scanner</strong> that evaluates every column across Unity Catalog. Each field is analyzed using entity recognizers for emails, phone numbers, credit cards, IPs, SSNs, IBANs, names, and dozens of other patterns. The scanner aggregates confidence scores and hit rates to determine which columns are likely to contain PII, then writes this intelligence back into Unity Catalog via tags and comments. This creates a <strong>living classification layer</strong> that continuously keeps the platform aware of where regulated data resides.</p><h4>Goal:</h4><p>Automatically detect, classify, and tag sensitive data so the governance system knows how to protect it.</p><h4>1.1 Automated PII Detection Using Microsoft Presidio</h4><p>In the usecase, Microsoft Presidio is integrated directly into Databricks as a <strong>broadcasted AnalyzerEngine</strong>:</p><pre>broadcasted_analyzer = sc.broadcast(AnalyzerEngine())</pre><p>A custom <strong>Pandas UDF</strong> (analyze_udf) runs Presidio against each column:</p><p>def analyze_series(s: pd.Series) -&gt; pd.Series:<br> return s.astype(str).apply(analyze_text)</p><pre>analyze_udf = pandas_udf(analyze_series, returnType=StringType())</pre><h4>The scanner identifies PII patterns such as:</h4><ul><li>Emails</li><li>Phone numbers</li><li>Credit card numbers</li><li>SSNs, ITIN</li><li>IBAN, BBAN</li><li>Passports</li><li>IP addresses (IPv4, IPv6, IPv4_with_port)</li><li>Free-text PII embedded in long text fields</li></ul><p>This ensures the scanner is validated against both explicit and hidden PII values.</p><h4>The PIIScanner class automates:</h4><ul><li>Applying Presidio detection to <strong>every column</strong></li><li>Parsing results via from_json()</li><li>Aggregating findings using Num entities, Average score, Hit rate</li><li>Comparing against thresholds (hit_rate ≥ 60%, avg_score ≥ 0.5)</li><li>Returning only likely-sensitive columns</li></ul><p>This produces a consistent, repeatable classification layer across catalogs.</p><h4>1.2 Catalog-Wide PII Scanning Across Unity Catalog</h4><p>The notebook doesn’t stop at a single dataframe — it scans <strong>every table across multiple catalogs</strong>:</p><pre>all_tables = pii_scanner.get_all_uc_tables(spark, catalogs)</pre><p>For each securable (table or view), the following happens:</p><ul><li>The table is sampled</li><li>Presidio UDF is applied</li><li>Matching PII types are collected</li><li>Results are returned into a single unified dataframe (scan_results)</li></ul><p>This produces a <strong>sensitive data inventory</strong> across the entire lakehouse.</p><h4>1.3 Automatic Classification Tags in Unity Catalog</h4><p>When PII is found, the notebook automatically tags the table and columns using SQL commands invoked programmatically via Spark:</p><pre>ALTER TABLE &lt;table&gt; SET TAGS (‘PII’)<br>ALTER TABLE &lt;table&gt; ALTER COLUMN email SET TAGS (‘EMAIL_ADDRESS’)</pre><p>It also adds <strong>warnings as comments</strong>, such as:</p><pre>&gt; # WARNING! This column contains PII:</pre><p>These tags act as the <strong>classification layer</strong> that drives downstream security policies.</p><h3>2. Implementing Encryption / Anonymization Based on Data Classification</h3><p>Once sensitive fields are discovered and tagged, the next step is ensuring they are protected. In this framework, protection is enforced using entity-aware anonymization rules powered by Microsoft Presidio’s AnonymizerEngine.</p><p>Using Unity Catalog tags as the source of truth, the system automatically selects the right anonymization strategy for each column — masking credit card numbers, replacing emails with &lt;EMAIL&gt;, scrubbing names, redacting IP addresses, or sanitizing free-text fields containing mixed PII. This process is completely automated: tables tagged with PII are dynamically transformed into sanitized _anonymized versions that retain analytical value while removing sensitive information. This enables safe analytics, data sharing, model development, and downstream consumption without exposing raw PII.</p><h4>2.1 Using Tags to Determine Sensitive Columns</h4><p>The method get_pii_tagged_columns(...) queries Unity Catalog’s internal system tables:</p><pre>SELECT column_name, tag_name<br>FROM system.information_schema.column_tagsSELECT column_name, tag_name<br>FROM system.information_schema.column_tags</pre><p>This identifies exactly which columns:</p><ul><li>Are tagged as PII</li><li>Have specific entity tags such as EMAIL_ADDRESS, IP_ADDRESS, SSN, etc.</li></ul><p>These tags directly inform how each column should be anonymized.</p><h4>2.2 Presidio-Based Anonymization Rules</h4><p>The anonymization behavior is implemented using Presidio’s <strong>AnonymizerEngine</strong> with entity-specific operator rules:</p><pre>defaults = {<br> “EMAIL_ADDRESS”: OperatorConfig(“replace”, {“new_value”: “&lt;EMAIL&gt;”}),<br> “IP_ADDRESS”: OperatorConfig(“replace”, {“new_value”: “&lt;IP&gt;”}),<br> “URL”: OperatorConfig(“replace”, {“new_value”: “&lt;URL&gt;”}),<br> “PERSON”: OperatorConfig(“replace”, {“new_value”: “&lt;PERSON&gt;”}),<br>}</pre><p>A factory function constructs Pandas UDFs to apply these transformations:</p><pre>col_udf = make_anonymize_pandas_udf(entities, language, operator_cfg)</pre><h4>2.3 Creating Anonymized Copies of PII Tables</h4><p>The method _anonymize_table applies anonymizers only to PII-tagged columns:</p><pre>anon_df = df.select(*[col_udf(col(c)) if c in pii_cols else col(c)])</pre><p>Then writes the sanitized version back to Unity Catalog:</p><pre>anon_df.write.mode(“overwrite”).saveAsTable(f”{catalog}.{schema}.{table}_anonymized”)</pre><p>A higher-level method, anonymize_all_tagged_tables_in_catalog, loops through all PII-tagged tables in a catalog and applies anonymization automatically.</p><p>This creates:</p><ul><li>A <strong>raw table</strong> (containing sensitive data)</li><li>A <strong>sanitized version</strong> (safe for analytics, sharing, or lower-privilege users)</li></ul><h3>3. Reporting &amp; Monitoring Access to Sensitive Data</h3><p>A secure platform not only protects sensitive data but also provides <strong>visibility</strong> into where it is, how it is handled, and how it is accessed. This framework produces structured scan results for every table, creating a complete inventory of PII across catalogs — entity types, confidence levels, column names, and scan timestamps.</p><p>Because Unity Catalog tags are applied during classification, downstream access logs can now be tied directly to sensitivity levels. This enables the creation of dashboards that answer questions like: Which tables contain PII? Where has anonymization been applied? Who is accessing sensitive datasets? Are users querying raw PII when anonymized versions exist? Combined with Databricks audit logs, this becomes the foundation for a robust <strong>governance, compliance, and monitoring layer</strong>, ensuring the organization maintains full oversight of how regulated data flows through the Lakehouse.</p><h4>3.1 PII Scan Results as a Governance Report</h4><p>It aggregates every scan into a single dataframe:</p><pre>scan_results = pd.concat([…])</pre><p>Each record contains:</p><ul><li>scan_date</li><li>securable (catalog.schema.table)</li><li>column</li><li>entity_type</li><li>num_entities</li><li>avg_score</li><li>hit_rate</li></ul><p>This becomes a <strong>PII Inventory Report</strong> across the entire lakehouse. Teams can store this in Delta tables for dashboards like:</p><ul><li>Most common PII types</li><li>Tables with highest PII density</li><li>Catalogs with PII drift over time</li><li>A weekly “PII Health Check” report</li></ul><h4>3.2 Reporting on Anonymization Activity</h4><p>The anonymization pipeline logs which tables:</p><ul><li>Were identified as PII</li><li>Were anonymized</li><li>Were skipped due to no PII</li><li>Had errors</li></ul><p>These logs can be converted into operational dashboards such as:</p><ul><li><strong>Anonymization coverage report</strong></li><li><strong>Raw vs anonymized table comparison</strong></li><li><strong>Policy compliance validation</strong></li></ul><h4>3.3 Access Monitoring (Integration Point)</h4><p>The classification layer enables meaningful access reporting by correlating:</p><ul><li>Unity Catalog access logs</li><li>PII tags</li><li>Table usage logs</li><li>Anonymized vs non-anonymized table access patterns</li></ul><p>This supports compliance reporting such as:</p><ul><li>“Which users accessed raw PII this month?”</li><li>“Is anyone querying PII tables without anonymization?”</li><li>“Are access patterns aligned with RBAC/ABAC policies?”</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b1cd62258134" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[FinOps in Action: Monitoring and Optimizing Compute Costs in Databricks Using System Tables]]></title>
            <link>https://medium.com/@akshayprabhu818/finops-in-action-monitoring-and-optimizing-compute-costs-in-databricks-using-system-tables-12985814f2a6?source=rss-a67fa6edef6c------2</link>
            <guid isPermaLink="false">https://medium.com/p/12985814f2a6</guid>
            <dc:creator><![CDATA[Akshayprabhu]]></dc:creator>
            <pubDate>Fri, 05 Dec 2025 18:02:38 GMT</pubDate>
            <atom:updated>2026-02-24T18:49:49.647Z</atom:updated>
            <content:encoded><![CDATA[<p>Contributors —</p><ul><li><a href="https://medium.com/u/3c8f25c249ab">Swapnilspra</a></li><li><a href="https://medium.com/u/4b246283c7d3">Hitesh Parab</a></li></ul><p><a href="https://medium.com/@akshayprabhu818/finops-in-action-monitoring-and-optimizing-compute-costs-in-databricks-using-system-tables-12985814f2a6?source=friends_link&amp;sk=a7152a17e74e0787f6fba4f4cb800bfc"><em>Free access link</em></a></p><p>In the era of cloud-native analytics, the flexibility of scaling compute on demand comes with a catch: <strong>unpredictable and rising costs</strong>. This is where FinOps steps in. If you’re working with Databricks and wondering how to get better visibility into your usage costs and optimize them, you’re in the right place.</p><p>In this blog, I’ll walk you through:<br>- What FinOps is and why it’s essential <br>- How to implement FinOps principles using <strong>Databricks system tables</strong> <br>- A detailed breakdown of a Databricks dashboard I built to monitor compute costs at a <strong>workspace and user level</strong></p><h3><strong>What is FinOps and Why Should You Care?</strong></h3><p><strong>FinOps</strong> (short for <em>Financial Operations</em>) is the practice of managing and optimizing cloud spending through collaboration between engineering, finance, and product teams. It’s about making informed decisions that balance performance, speed, and cost.</p><p>FinOps brings transparency to cloud billing, enabling teams to answer critical questions:</p><ul><li>Who’s spending the most?</li><li>Which workloads are inefficient?</li><li>Are we exceeding budgets?</li></ul><p>Without a FinOps strategy, cloud bills become unpredictable, leading to wasted spend and poor ROI.</p><h3><strong>Why FinOps is Especially Important in Databricks</strong></h3><p>Databricks offers a powerful platform for data engineering and machine learning, but its <strong>on-demand pricing</strong> model can quickly spiral out of control:</p><ul><li>Clusters running idle</li><li>Underutilized compute</li><li>Storage costs piling up</li></ul><p>Without clear cost accountability, even the most optimized Spark pipelines can burn through budget. That’s where <strong>system tables</strong> come in handy.</p><h3><strong>Building a FinOps Dashboard Using Databricks System Tables</strong></h3><p>Databricks provides system tables that expose metadata around jobs, clusters, users, and billing. With these tables, you can create a powerful internal FinOps dashboard.</p><h3>What My Dashboard Tracks</h3><ul><li><strong>Total compute cost over 6 months</strong></li><li><strong>Monthly compute cost by workspace</strong></li><li><strong>Monthly compute cost by user</strong></li><li>Filters for workspace and user selection</li></ul><p>Let’s break down how each component works.</p><h4><strong>Dataset 1: Total compute cost by workspace for last 6 months</strong></h4><p>This dataset calculates the <strong>total compute cost</strong> for the last 6 months:<br>usage.usage_quantity * list_prices.pricing.default AS cost</p><p>We join <strong>system.billing.usage</strong> with <strong>system.billing.list_prices</strong> to calculate compute cost and associate it with workspaces from <strong>system.access.workspaces_latest</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EK097t6JwVmS-4dL_JJ78Q.png" /></figure><p><strong>Key Insight</strong>: You can now identify which workspaces are consuming the most compute and track trends over time.</p><h4>Dataset 2: Monthly cost analysis per user</h4><p>This query helps <strong>attribute compute cost to specific users</strong>:<br><strong>usage.identity_metadata.run_as</strong> AS user</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/919/1*fuTS23RN4TxKPCJK3oY2dA.png" /></figure><p>By aggregating cost by user and month, we surface insights into <strong>who </strong>is generating compute workloads — and potentially uncovering unused clusters or overprovisioned jobs.</p><h4>Dataset 3: Total cost 6 months</h4><p>A simple aggregation that gives a <strong>single-number snapshot</strong> of total compute cost over the last six months:<br>SELECT SUM(list_cost) AS total_cost FROM usage_with_cost</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*snrUDHvJT8Gbr9w2xo_ujA.png" /></figure><p>This metric helps leadership understand high-level spend without drilling into the weeds.</p><h4>Dataset 4: Monthly Cost Analysis per Workspace</h4><p>This allows us to visualize how costs fluctuate by <strong>workspace and month</strong>, helping teams understand seasonality or the impact of new data initiatives.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/933/1*jSj_PIVgRqojeOXLUqJYlg.png" /></figure><h3>How Compute Costs Are Calculated</h3><p>All datasets follow a similar pattern:<br>1. Pull usage metrics from system.billing.usage <br>2. Join with price data in system.billing.list_prices <br>3. Join with workspace metadata via system.access.workspaces_latest <br>4. Optionally attribute usage to users via identity_metadata.run_as</p><p>Each cost line is:<br>usage_quantity * pricing.default</p><p>And is grouped by:</p><ul><li>Workspace ID / Name</li><li>User</li><li>Month</li></ul><h3>Visualizing in Databricks Dashboards</h3><p>The compute_cost_analysis dashboard brings it all together:</p><ul><li><strong>Total Cost Counter</strong>: One glance at the total dollar amount.</li><li><strong>Bar Charts</strong>: Costs by workspace and user across time.</li><li><strong>Filters</strong>: Select workspace or user to zoom in.</li></ul><h4>Final Thoughts</h4><p>Implementing FinOps isn’t just about saving money — it’s about <strong>enabling growth with control</strong>. By using Databricks system tables, we gain clear visibility into who is using compute, how it’s being used, and where optimizations can be made.</p><h3>What’s Next?</h3><ul><li><strong>Smart Cluster Configuration Recommendations</strong><br> Use cost and utilization data to suggest right-sizing or autoscaling policies.</li><li><strong>Proactive Alerts</strong><br> Set up alerts when spending spikes unexpectedly or usage trends break historical patterns.</li></ul><p><strong>Want the code or template? </strong><br>Drop a comment or DM me — happy to share a redacted version and collaborate.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=12985814f2a6" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>