<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Harish K Rao on Medium]]></title>
        <description><![CDATA[Stories by Harish K Rao on Medium]]></description>
        <link>https://medium.com/@harishkrao?source=rss-5dfb120ae725------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*iG0wyd9BMSk3BKL9YnlC1Q.jpeg</url>
            <title>Stories by Harish K Rao on Medium</title>
            <link>https://medium.com/@harishkrao?source=rss-5dfb120ae725------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 25 May 2026 22:04:09 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@harishkrao/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Building a Governed Data Lakehouse]]></title>
            <link>https://medium.com/@harishkrao/building-a-governed-data-lakehouse-9c1707ebe0ed?source=rss-5dfb120ae725------2</link>
            <guid isPermaLink="false">https://medium.com/p/9c1707ebe0ed</guid>
            <category><![CDATA[data-lakehouse]]></category>
            <category><![CDATA[delta-lake]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[unity-catalog]]></category>
            <category><![CDATA[databricks]]></category>
            <dc:creator><![CDATA[Harish K Rao]]></dc:creator>
            <pubDate>Sat, 23 May 2026 11:06:02 GMT</pubDate>
            <atom:updated>2026-05-23T11:12:13.067Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction</h3><p>Most data teams building a data lakehouse think of governance, schema conformity, transactional integrity, or versioning as an afterthought. By the time they get to it, there are already three definitions of “active customer,” two pipelines implementing logic for flavours of the same use case, and several dashboards reading from the same source but producing different numbers for the same metric — quietly eroding trust in the entire data platform.</p><blockquote><em>The cost of getting data governance right is tangible. It prevents data swamps, compliance exposure, and AI systems that hallucinate because the data they were trained on was never reliable to begin with.</em></blockquote><p>The decisions you make in the early phase of a greenfield lakehouse project have a lasting impact on the data platform. I learned this by having lived through the consequences of both getting it right early and retrofitting it later. This post is an account of what building a governed lakehouse actually looks like in practice.</p><h3>Why do organizations need a data lakehouse?</h3><p>It is important to understand what a data lakehouse is and what problems does it solve.</p><p>Earlier, organisations managed their data lifecycle through three separate systems:</p><ul><li>a data warehouse for structured analytics</li><li>a data lake for raw and unstructured data</li><li>separate ML platforms for model development</li><li>the BI layer on top of the above systems</li></ul><p>Each system had its own access control, audit trail and governance framework. This resulted in an architecture where moving data between systems was quite cumbersome along with the added overhead of maintaining multiple copies of the same data, in different formats, with different freshness guarantees, compounding over time.</p><p>The lakehouse pattern combines all of this and aims to simplify the data architecture. It combines the flexibility and cost-effectiveness of a data lake with the reliability and query performance of a data warehouse in a single, open-format and unified architecture. Structured, semi-structured, unstructured, and streaming data: all of these coexist in the same platform. Analytics, ML, and AI workloads run against the same underlying data. And there is one governance layer for all of it.</p><p>This is technically possible due to the underlying storage format layer. Traditional data lakes lacked ACID transaction support, which makes it possible for failed jobs to corrupt files, multiple pipelines to write concurrently could compromise data integrity, and schema enforcement was absent by default. Modern open table formats like Delta Lake and Apache Iceberg provide ACID guarantees, schema enforcement, time travel, and audit history on top of Parquet, thereby bringing data warehouse reliability to the flexibility of object storage. That is the foundation everything else is built on.</p><blockquote><em>In Databricks, the Delta Lake’s transaction log, house within the </em><em>_delta_log directory is the mechanism that implements the above concepts of ACID guarantees, schema enforcement, time travel and audit history.</em></blockquote><p>Every write to a Delta table appends a JSON entry to the log recording exactly what changed, when, and by which operation. This means failed jobs do not corrupt or incorrectly/partially write to the table but instead preserve a clean state, and time travel queries ( VERSION AS OF, TIMESTAMP AS OF) are first-class operations which helps users restore previous versions.</p><p>However, having a reliable storage format does not automatically mean it is a governed lakehouse.</p><h3>What is “well-governed” in the context of a data lakehouse?</h3><p>A well-governed lakehouse:</p><ul><li>contains data that can be trusted</li><li>is secure</li><li>is discoverable</li><li>is reliable.</li></ul><p>Governance is baked into the architecture from the beginning, not retrofitted afterward.</p><p>It also means the organisation is at some stage of a data maturity curve.</p><ol><li>In the early stages, the focus is on historical reporting: structured data with batch loads, SQL queries, precanned dashboards.</li><li>As maturity increases, teams add streaming data, ML workloads, predictive analytics.</li><li>At the most mature end, organisations build AI systems that learn from their data and its surrounding metadata: RAG pipelines, agents, natural language interfaces.</li></ol><p>Each stage of that curve makes higher demands on the governance layer beneath it.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/573/0*JwA5c1mHurYypWzg.png" /><figcaption>Image Courtesy: Databricks</figcaption></figure><p>Let us walk through how that looks like in practice.</p><h3>Layered architecture with clear data contracts</h3><p>The Medallion architecture: the Bronze / Silver / Gold pattern is widely adopted across the industry.</p><ul><li>Bronze is raw data, which is stored exactly as it is received, is append-only, and contains ingestion metadata.</li><li>Silver is cleaned, conformed, and deduplicated data.</li><li>Gold is aggregated and optimised data, ready for consumption.</li></ul><p>However, the architecture is not effective if it is inconsistently implemented. The layers work only with a clear contract definition for each layer:</p><ul><li>what data enters</li><li>in what form</li><li>with what guarantees</li></ul><p>This is typically effective when enforced via a combination of process and system checks.</p><blockquote><em>In Databricks, Delta Lake schema enforcement makes this esay. Setting </em><em>mergeSchema to </em><em>false (the default) means a pipeline writing a column that does not exist in the target table will fail explicitly rather than silently modify or drift the schema to an undesired structure.</em></blockquote><p>Combined with enforceSchema in Auto Loader, this enforces the schema at the Bronze ingestion layer so that upstream schema changes are caught here, rather than being discovered much downstream when the data has already landed in a Gold table.</p><p>For the Silver-to-Gold contract specifically, Delta Lake constraints add another enforcement layer:</p><pre>ALTER TABLE gold.daily_account_revenue<br>ADD CONSTRAINT revenue_non_negative<br>CHECK (daily_revenue_usd &gt;= 0);</pre><p>A constraint violation at write time blocks the operation and surfaces the problem to the pipeline owner immediately. This is enforcement built into the storage layer, not maintained via a separate data quality check or tooling.</p><p><strong>In practice, the contract between the Silver and the Gold layers is where ambiguity typically sets in and also where most governance failures can occur. Teams often debate upon what goes into which layer and what the contracts for the layers are. It is important to achieve a clear consensus early on and stick to it.</strong></p><p>If a Gold table is rebuilt from a Silver table whose schema was changed without notice, every downstream consumer breaks. Clear contracts mean documented grain, documented SLAs, and a defined breaking-change policy (most importantly involving Standard Operating Procedures around communication to all stakeholders).</p><p>The layer boundaries might feel like overhead at the beginning. As the data platform matures, they ensure that the platform is consistently governed as the number of producers and consumers multiplies.</p><h3>Global naming convention</h3><p>When several teams are building pipelines independently, naming conventions diverge over time. A column called customer_id in one domain turns out to be a different entity from customer_id in another - same name, different grain, different update cadence. Joins that look correct produce wrong results. Data quality checks pass and produce data inconsistency or corruption downstream. Reports diverge.</p><p>The conventions that mattered most in practice: domain-prefixed table names, consistent use of surrogate versus natural keys across the platform, standardised date column naming ( created_at, updated_at, effective_from, effective_to), and agreed abbreviation rules for column names in data products.</p><p>In Unity Catalog, the three-level namespace ( catalog.schema.table) makes it possible to define domain boundaries. A table registered as prod.sales.sales_bookings is architecturally separated from prod.product.events. Users cannot join across domains without explicitly referencing both namespaces. This can be effective only if it is complemented by effective naming conventions, however, it enforces domain separation at the metadata layer.</p><p>When data products are declared declaratively, such as the schema, column descriptions, grain, SLAs — naming consistency becomes enforceable rather than advisory. Teams cannot easily drift from a standard that is codified in the data product specification.</p><p>On a slightly different note, such problems are usually discovered by Data Quality checks at several points in the pipeline. If a team does not have such checks in place, these problems are reported by customers or business users, which results in delayed problem discovery, wrong decisions being made due to incorrect data and more such probelms. Identifying the source of the problem becomes a tedious task as well, since without the Data Quality checks, it is hard to know where the problem is originating.</p><h3>Domain ownership</h3><p>It is not scalable to have a central data team own every table in the lakehouse. Data mesh principles guide such scenarios: the teams closest to the data understand it best, and should own not just the data, but the entire data lifecycle and its governance. In practice, this concept of domain ownership means that the team producing the data is responsible for its quality, its schema evolution, and its SLAs. The platform team provides the guardrails: Unity Catalog access policies, schema validation, lineage tracking and other tooling, but does not own the data itself.</p><p>Each table, schema, and catalog in an Unity Catalog has an explicit owner, an user or a group, and permissions are configured and inherited hierarchically. Assigning ownership of the prod.sales schema to the Sales team means they control access grants for their own data products. The platform team retains metastore-level administration without owning the data.</p><p>Enforcing these standards prevents teams from building pipelines that do not conform to agreed contracts or using different naming conventions, thereby also preventing them from building tables with fields which cannot be joined reliably across domains.</p><p>In practice, the platform team also defines and enforces standards across the metastore — table property requirements, mandatory column tags for PII, required grain and owner properties at table registration. These are enforced via Unity Catalog table properties.</p><p>Enforcing these standards prevents teams from building pipelines that do not conform to agreed contracts or using different naming conventions, thereby also preventing them from building tables with fields which cannot be joined reliably across domains.</p><p>When a team is responsible for their data products:</p><ul><li>they write column descriptions that explain business meaning, not just data type.</li><li>they define the grain explicitly.</li><li>they version their schemas.</li><li>they own their SLA and are paged when they miss it.</li></ul><blockquote><em>The difference between a team that owns their data and a team that merely produces it is visible in the quality of the metadata.</em></blockquote><h3>Data contracts</h3><p>A data contract is the explicit agreement between a data producer and its consumers. It mainly covers four things: schema, grain, SLAs, and the breaking-change policy.</p><p><strong>Schema:</strong> what columns exist, their types, and which are nullable. In Delta Lake, the DESCRIBE DETAIL SQL command and the transaction log provide the schema history, including for prior schema version and the timestamp of each of the changes.</p><p><strong>Grain:</strong> what one row represents: one event, one account per day, one transaction. This is the most commonly undocumented property, which also means that Data Engineers can easily overlook this and implement inconsistent grains. In Databricks, the grain is best documented as a Unity Catalog table property:</p><pre>ALTER TABLE silver.transactions<br>SET TBLPROPERTIES (<br>  &#39;grain&#39; = &#39;one row per transaction_id&#39;,<br>  &#39;owner&#39; = &#39;payments-engineering@company.com&#39;,<br>  &#39;sla_hours&#39; = &#39;4&#39;<br>);</pre><p>These properties are queryable via SHOW TBLPROPERTIES and surfaced in the Unity Catalog UI, making the grain a first-class and discoverable attribute.</p><p><strong>SLA:</strong> when is the data expected to be available and how fresh is it?</p><p><strong>Breaking-change policy:</strong> what constitutes a breaking change, how much notice consumers receive, and what the migration path looks like.</p><p>An example scenario: a producer table’s column type was recently changed from integer to string during a routine pipeline refactor. The downstream Gold table schema data quality check caught it six hours later, after three dashboards had already served incorrect aggregations to business stakeholders. Tracing the root cause took two engineers almost a day. The fix took twenty minutes thereafter. A contract with a breaking-change policy would have prevented the type change from reaching downstream without a version increment and consumer notification.</p><h3>Governance and access</h3><p>Data classification precedes access control decisions. Before you can apply access controls, you need to know if the incoming data contains: PII, user generated content, financial data, internal-only metrics, publicly referenceable aggregates.</p><p>For instance, Databricks Unity Catalog’s tagging system handles this if used consistently.Tags applied at the column level propagate to downstream derived tables through lineage — meaning a column tagged pii:email in Bronze retains that classification when it appears in Silver and Gold, without requiring manual re-tagging at each layer:</p><pre>ALTER TABLE bronze.raw_events<br>ALTER COLUMN user_email<br>SET TAGS (&#39;pii&#39; = &#39;email&#39;, &#39;sensitivity&#39; = &#39;high&#39;);</pre><p>Unity Catalog’s column masking policies then apply automatically based on these tags. So, a role without PII access sees user_email as **** without needing masking code to be implemeted as part of the data pipeline:</p><pre>CREATE ROW FILTER mask_pii_email ON bronze.raw_events<br>AS (user) -&gt; CASE<br>  WHEN is_member(&#39;pii-access&#39;) THEN true<br>  ELSE user_email IS NULL<br>END;</pre><p>Access controls in most cloud platforms include RBAC (role-based) and ABAC (attribute-based) access controls which provide controls at different granularities. RBAC is table-level and schema-level. ABAC is row-level and column-level. For example, they could be used to mask a PII column for one role while exposing it to another.</p><p>PII handling includes masking, but it also includes retention policies, audit logging of who accessed what, and ensuring that PII does not leak through derived columns or aggregations. Governance in regulated environments has to account for derived sensitivity, not just source sensitivity.</p><p>Delta Lake’s VACUUM command honours retention thresholds. For PII data with hard deletion requirements, the correct and complete way to delete the data is via the column-level deletion combined with REORG TABLE ... APPLY (PURGE) to physically remove the deleted data from Parquet files. Marking the data as deleted in the transaction log would not suffice.</p><h3>Data discovery and audits</h3><p>Data discovery is a hard problem to solve and it becomes challenging over time. This is a problem that partially occurs due to missing or incomplete Data Lineage.</p><p>When a number in a board-level dashboard is wrong, the first question is always “where does this come from?” Without an end-to-end data lineage, that question takes days to answer without end-to-end data flow lineage. With lineage, it is a simple query of the tables storing the lineage relationships.</p><p>Unity Catalog captures column-level lineage automatically for SQL operations on Databricks: a Gold table metric traced back to the Silver layer’s transformations and back to the Bronze layer’s ingestion, with every intermediate table and column visible in the lineage graph. No additional steps or configuration are required for setting up lineage for SQL workloads.</p><blockquote><em>Python and Spark DataFrame operations require the appropriate writing patterns (using </em><em>spark.sql() or Delta Lake merge operations) to appear in the lineage graph.</em></blockquote><p>What a catalogue is actually for — versus what people think it is for: most teams treat the data catalogue as a documentation project. A place to write descriptions and tag tables. That is the wrong framing. A catalogue is a runtime artefact. It reflects the live state of the data platform: what tables exist, what they contain, who owns them, when they were last updated, and how they relate to each other. Documentation that lives separately from the pipeline inevitably falls behind the pipeline. Descriptions written at data product creation time — as part of the schema definition — stay current because they are versioned alongside the schema.</p><p>Tagging and classification tiers work best when they are mandatory at data product registration time, not optional at consumption time. A sensitivity tag added when a table is first defined is almost always more accurate than one added retroactively when a compliance audit surfaces it.</p><h3>Data observability</h3><p>Data observability covers five properties: freshness, volume, distribution, schema, and lineage.</p><p><strong>Freshness:</strong> is the data as recent as defined in the SLA? On Delta Lake, DESCRIBE HISTORY provides the timestamp of the most recent write operation - making freshness checks queryable without external tooling:</p><pre>SELECT timestamp, operation, operationParameters<br>FROM (DESCRIBE HISTORY silver.transactions)<br>WHERE operation = &#39;WRITE&#39;<br>ORDER BY timestamp DESC<br>LIMIT 1;</pre><p><strong>Volume:</strong> are row counts within expected bounds? Are we recording the % of deviation from the average row counts for the past 30 days? Delta Lake table history also records the number of files added and removed per operation, giving a write time volume signal without a full table scan.</p><p><strong>Distribution:</strong> are column value distributions stable? A shift in the distribution of a key dimension is often the first signal of an upstream data quality issue</p><p><strong>Schema:</strong> did the schema change without a corresponding contract update? Delta Lake’s transaction log records every schema change with a timestamp, making it auditable by comparing the current schema against a prior schema version:</p><pre>from delta.tables import DeltaTable<br><br>dt = DeltaTable.forName(spark, &quot;silver.transactions&quot;)<br>history = dt.history(10)</pre><p><strong>Lineage:</strong> can you trace a value in a Gold table back to its source system?</p><p>The ownership question matters more than the tooling. A team that owns their data product owns the observability signals for that product.</p><ul><li>They set the expected volume numbers.</li><li>They define what “freshness” means for their SLA.</li><li>They are paged when the distribution shifts.</li></ul><p>By making observability declarative, we built it into pipelines without making it another task to be performed later. Expected row count ranges, null rate thresholds, and distribution are defined in the data product specification alongside the schema. The pipeline validates against them at each layer. Violations block promotion from Bronze to Silver. This is not a separate observability system — it is the pipeline enforcing the contract at runtime.</p><h3>Semantic layer</h3><blockquote><em>One governed definition per metric. This is the principle that most data teams agree with consistently.</em></blockquote><p>A semantic layer sits between the physical data model and the consumer — whether that consumer is a BI tool, an analyst running SQL, or an AI agent generating queries. It defines what “revenue” means, what “active user” means, what time zone “today” refers to. One definition, one place, applied consistently.</p><p>Without it, different teams build their own definitions in their own tools. The organisation ends up with five different revenue numbers depending on which dashboard you open. This is not a data quality problem, it is a governance problem. The data is correct; the definitions are inconsistent.</p><p>The semantic layer is also where the connection to AI readiness becomes concrete. An AI agent generating SQL against your lakehouse will use whatever metric definitions it can infer from column names and table descriptions. If those definitions are inconsistent, the agent produces inconsistent results.</p><h3>AI-readiness</h3><p>An LLM querying your lakehouse — whether through a natural language query interface, a RAG pipeline, or a custom agent depends entirely on the quality of your metadata. Column descriptions that say “flag” or “id” are useless to a language model. Column descriptions that say “binary indicator set to 1 when the account has had an active subscription within the last 30 days” are genuinely useful. The difference between a Genie Space that returns accurate results and one that hallucinates is largely determined by the quality of the column-level annotations in Unity Catalog.</p><p>AI-readiness means:</p><ul><li>descriptions are written for a language model, not a data engineer.</li><li>embeddings exist for columns and tables so that semantic search over the data catalogue works.</li><li>agent-accessible metadata: what tables exist, what they contain, how they relate — is current and accurate. This is possible if it is maintained at data product definition time rather than retroactively.</li></ul><p>A lakehouse without this is not AI-ready. Point an LLM at undocumented tables with opaque column names and you will get confident, plausible, wrong answers. The data quality problem that used to surface as a wrong number in a dashboard now surfaces as a confident hallucination in an AI system that business users trust more, not less, than the dashboard it replaced.</p><h3>Walking through one example end to end</h3><p>Follow a single transaction entity from source through to the reporting layer to see how naming, contracts, lineage, and quality checks touch the same record at different points.</p><p><strong>Bronze:</strong> the transaction arrives from the source system exactly as received as a raw JSON, appended to the Bronze table with an ingested_at timestamp. No transformation. The schema is documented; the grain is one event per row.</p><p><strong>Silver:</strong> the transaction is cleaned, typed, deduplicated, and conformed to the platform naming convention. txn_id, account_sk, transaction_amount_usd, transaction_ts. The Silver contract specifies grain (one transaction per txn_id), nullable columns, and a freshness SLA of T+4 hours. A row count check validates that Silver contains within 1% of the expected Bronze volume. A null check validates that account_sk is never null.</p><p><strong>Gold:</strong> the transaction is aggregated to the account-day grain: daily_account_revenue, transaction_count, avg_transaction_value. The Gold contract specifies that &quot;revenue&quot; means sum(transaction_amount_usd) where transaction_status = &#39;settled&#39;. This definition is registered in the semantic layer. Every downstream consumer such as a BI tool, an analyst SQL query or an AI agent uses the same definition.</p><p><strong>Lineage:</strong> an Unity Catalog lineage trace from the Gold metric back to Bronze is available. When a reporting discrepancy surfaces, the trace takes minutes, not days.</p><blockquote><em>This is what governance built into the architecture looks like, as opposed to governance implemented afterward.</em></blockquote><h3>Ungoverned data platforms</h3><p>These are some problems arise in a data platform without governance:</p><ul><li><strong>Definition drift</strong> — the same metric calculated differently by different teams, diverging silently over months</li><li><strong>Undocumented PII</strong> — personal data in columns that are not tagged, not masked, and not audited; a compliance exposure that nobody knows about until it matters</li><li><strong>Broken lineage</strong> — a pipeline refactored six months ago but the lineage graph never updated, so the dependency map is wrong</li><li><strong>Dashboard divergence</strong> — two dashboards showing different numbers for the same metric, both technically correct by their own definitions, neither trustworthy</li><li><strong>Small-file proliferation</strong> — hundreds of thousands of tiny Parquet files accumulating in a Bronze table with no compaction policy, degrading query performance progressively until someone notices.</li></ul><h3>Conclusion</h3><p>Based on past experience, implementing the below items up front simplifies your data platform and saves time and effort down the line:</p><ul><li>Defining the data contract structure before any team writes their first pipeline.</li><li>Agreeing on the naming conventions before any table is created.</li><li>Making column descriptions mandatory at schema definition time.</li><li>Instrumenting lineage from the start.</li></ul><blockquote><em>The hardest part of building a governed lakehouse is not the tooling. It is the organisational discipline to maintain standards as the number of teams, tables, and consumers grows. Platform guardrails help, but they do not substitute for teams that genuinely own their data products.</em></blockquote><p>Governance is an engineering discipline — built into the architecture, maintained by the teams that own the data, and enforced by the platform.</p><h3>Further reading</h3><p>For practitioners who want to go deeper on the topics covered in this post:</p><p><strong>Books</strong></p><p><a href="https://www.databricks.com/resources/ebook/data-lakehouse-for-dummies">The Data Lakehouse for Dummies, 2nd Databricks Special Edition</a> by Ari Kaplan and Amit Kara.</p><p><a href="https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/">Fundamentals of Data Engineering</a> by Joe Reis and Matt Housley.</p><p><a href="https://www.oreilly.com/library/view/data-management-at/9781492054771/">Data Management at Scale</a> by Piethein Strengholt.</p><p><strong>Official documentation and reference material</strong> <a href="https://docs.delta.io/latest/index.html">Delta Lake Documentation</a> <a href="https://docs.databricks.com/en/data-governance/unity-catalog/index.html">Unity Catalog Documentation</a> <a href="https://iceberg.apache.org/docs/latest/">Apache Iceberg Documentation</a> <strong>Blog posts and articles</strong></p><p><a href="https://www.databricks.com/blog/category/engineering">Databricks Engineering Blog</a>: Practitioner posts from the Databricks engineering team on Delta Lake internals, Unity Catalog features, Spark optimisation, and MLOps. The posts on Delta Lake deletion vectors, liquid clustering, and Unity Catalog lineage are particularly relevant to this post’s topics.</p><p><a href="https://dataintensive.net/">Martin Kleppmann — Designing Data-Intensive Applications (selected chapters)</a>: Please note, this not a lakehouse book specifically, but Chapter 3 (storage engines) and Chapter 11 (stream processing) provide the systems foundation that makes Delta Lake’s design choices legible. Strongly recommended for anyone working at the storage and streaming layer.</p><p><a href="https://www.datacouncil.ai/talks">Data Council and Subsurface Conference Talks</a> — Practitioner talks from data engineers at companies operating lakehouses at scale.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9c1707ebe0ed" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From Experimental Notebooks to Production: A Data Engineer’s perspective of Scaling Data Science…]]></title>
            <link>https://medium.com/@harishkrao/from-experimental-notebooks-to-production-a-data-engineers-perspective-of-scaling-data-science-6ca109a5ec1b?source=rss-5dfb120ae725------2</link>
            <guid isPermaLink="false">https://medium.com/p/6ca109a5ec1b</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[jupyter-notebook]]></category>
            <category><![CDATA[semantic-search]]></category>
            <dc:creator><![CDATA[Harish K Rao]]></dc:creator>
            <pubDate>Wed, 20 May 2026 01:59:47 GMT</pubDate>
            <atom:updated>2026-05-20T01:59:47.949Z</atom:updated>
            <content:encoded><![CDATA[<h3>From Experimental Notebooks to Production: A Data Engineer’s perspective of Scaling Data Science Applications</h3><p><em>Originally published at </em><a href="https://harishkesavarao.github.io/blog/2025/deploying-ds-apps/"><em>harishkesavarao.github.io</em></a></p><p>A Data Scientist’s notebook optimised for exploratory work on a small sample will often need significant rework before it can run reliably at production scale. This reflects a natural difference in priorities between DE and DS: notebooks favour iteration speed, while production pipelines prioritise reliability, scalability, and cost efficiency. Bridging that gap requires both disciplines to have a working understanding of each other’s constraints.</p><p>Recently, I partnered with Data Scientists to scale a Semantic Search and Theming application from an exploratory notebook to a production data pipeline. This post describes the journey — what I had to learn, what we had to align on, and how the code evolved from an experimental notebook to a production pipeline.</p><h3>Learning Data Science concepts</h3><p>Over the past few years, Data Engineering’s primary use case has evolved from pure analytics — for technical and business decision makers — to a combination of analytics and machine learning model support for Data Scientists and data analysts.</p><p>This shift requires a Data Engineer to gain a deep understanding about what they are building. For semantic search and theming specifically, the questions I needed to answer before I could contribute meaningfully were:</p><ul><li>What does semantic search actually do, and how is it different from keyword search?</li><li>What is an embedding, and why does chunking large text documents before embedding matter?</li><li>What is the difference between cosine similarity and dot product search, and what does that mean for how you store and query vectors?</li><li>Which embedding models are appropriate for the kinds of text being processed?</li></ul><p>These are not questions a Data Engineer needs to answer at a research level. But understanding them well enough to make architecture decisions — which vector storage to use, how to design a chunking pipeline, what is the right compute type for generating embeddings at scale — is the prerequisite to ensure that the application works as intended for the user count and data volume in production.</p><p>A Data Engineer’s job in this collaboration is to educate Data Scientists on the nuances of production data engineering with regards to their use cases, agree on trade-offs and roadmap priorities together, and build the infrastructure that will scale their Search and Theming logic.</p><h3>Understanding business value</h3><p>Before any technical work, both Data Engineers and Data Scientists need alignment on three questions:</p><ul><li>Who are the customers, and what are they trying to do with the data?</li><li>What problem are we actually solving — not just technologically, but the business value for the customers?</li><li>What does “good enough” look like for the first version?</li></ul><p>Once we had answers to the above questions, we used it to formulate a problem statement. A closer look at the problem statement often reveals that it can be decomposed into smaller, prioritised use cases with clearer success criteria. Getting frequent and early feedback from both Data Scientists and business users — before the pipeline is complete — prevents the risk of building something that no one might be using in the future.</p><p>Additionally, it is also possible that there are other users or teams trying to solve the same set of problems (technologically) or to unlock the same kinds of business value as us. Identifying such users or teams, collaborating with them to learn and share lessons and also to reduce redundancies and overlaps to promote reusability is a valuable exercise for an organization.</p><p>For semantic search and theming, the business question was: can a user describe what they are looking for in natural language and get relevant and accurate results back, without needing to know the exact terminology used in the underlying data?</p><h3>What the Data Scientist needed vs. what production required</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/963/1*br0r66y4voRYEqxuMFoWbQ.png" /></figure><h3>Features and scalability</h3><h4>The standard pipeline structure</h4><p>For any production data pipeline — whether analytics or ML — these considerations are relevant:</p><ol><li><strong>Ingest</strong> data at an agreed latency (batch, micro-batch, or streaming)</li><li><strong>Transform</strong> data to meet the use case’s needs</li><li><strong>Load</strong> into appropriate storage with read/write optimisation</li><li><strong>Validate</strong> data quality at multiple points in the pipeline</li></ol><p>For ML pipelines, a fifth step applies: <strong>serve</strong> — making the output (embeddings, features, model outputs) available to downstream consumers reliably and at low latency.</p><h3>The notebook-to-pipeline gap</h3><p>When a Data Scientist’s notebook becomes a production workload, the following changes are almost always required:</p><p><strong>pandas → Spark</strong></p><p>The most common refactor. pandas operates on a single node. As data volume scales, this becomes a compute bottleneck. The refactor requires replacing pandas operations with Spark DataFrame equivalents and registering Python functions as UDFs for operations that cannot be expressed natively.</p><p>Notebook-style, single-node:</p><pre>import pandas as pd<br><br>df = pd.read_csv(&quot;dataset.csv&quot;)<br>results = df[df[&quot;category&quot;] == &quot;billing&quot;][&quot;description&quot;].apply(embed_text)<br>print(results.head())p</pre><p>Production-ready, Spark:</p><pre>from pyspark.sql import functions as F<br>from pyspark.sql.types import ArrayType, FloatType<br><br>@udf(returnType=ArrayType(FloatType()))<br>def embed_text_udf(text: str):<br>    return embed_text(text)<br><br>df = (<br>    spark.table(&quot;dataset&quot;)<br>    .filter(F.col(&quot;category&quot;) == &quot;billing&quot;)<br>    .withColumn(&quot;embedding&quot;, embed_text_udf(F.col(&quot;description&quot;)))<br>)<br><br># Write directly — never collect() in production<br>df.write.format(&quot;delta&quot;).mode(&quot;overwrite&quot;).save(&quot;/mnt/embeddings/billing&quot;)</pre><p>Two things to note in the production version: the UDF wraps the embedding function so it runs distributed across the cluster, and the result is written directly to Delta Lake rather than collected to the driver. Collecting a large embedding dataset to the Spark driver is one of the most common causes of out-of-memory failures in production.</p><p><strong>Removing or converting print/display/show statements</strong></p><p>Every print(), display(), or show() call in a Spark context collects data to the driver. In a notebook, this is fine — the dataset is small. In production, it is expensive and in some cases causes driver OOM errors.</p><p>Replace with logging:</p><pre>import logging<br><br>logger = logging.getLogger(__name__)<br><br># Instead of: print(df.count())<br>row_count = df.count()<br>logger.info(f&quot;Processed {row_count} records in embedding pipeline&quot;)<br><br># Instead of: df.show()<br># Log a pre-calculated summary, not the raw data<br>logger.info(f&quot;Embedding pipeline complete. Sample schema: {df.schema}&quot;)</pre><p>Logging statements should display pre-calculated results or perform minimal actions on DataFrames. Avoid triggering new Spark actions inside logging calls.</p><p><strong>Data chunking</strong></p><p>As data volumes grow, processing everything in a single run becomes impractical — both for compute cost and for pipeline resumability. Chunk by date or another appropriate attribute:</p><pre>from datetime import date, timedelta<br><br>def get_date_chunks(start_date: date, end_date: date, chunk_days: int = 7):<br>    current = start_date<br>    while current &lt; end_date:<br>        chunk_end = min(current + timedelta(days=chunk_days), end_date)<br>        yield current, chunk_end<br>        current = chunk_end<br><br>for chunk_start, chunk_end in get_date_chunks(start_date, end_date):<br>    chunk_df = (<br>        spark.table(&quot;dataset&quot;)<br>        .filter(<br>            (F.col(&quot;created_date&quot;) &gt;= chunk_start) &amp;<br>            (F.col(&quot;created_date&quot;) &lt; chunk_end)<br>        )<br>    )<br>    process_and_write(chunk_df, chunk_start, chunk_end)</pre><p>Chunking also makes pipeline failures recoverable — a failed chunk can be reprocessed without rerunning the entire historical load.</p><p><strong>Statistics and profiling</strong></p><p>Data profiling operations — row counts, null checks, distribution summaries — are useful during development but expensive at scale. Make them optional or conditional:</p><pre>ENABLE_PROFILING = False  # set via config or environment variable<br><br>if ENABLE_PROFILING:<br>    logger.info(f&quot;Null count in description: {df.filter(F.col(&#39;description&#39;).isNull()).count()}&quot;)<br>    logger.info(f&quot;Category distribution: {df.groupBy(&#39;category&#39;).count().collect()}&quot;)</pre><h3>API reliability</h3><p>ML pipelines typically call external APIs — embedding model endpoints, LLM APIs, enrichment services. Notebooks call these once per cell and fail fast. Production pipelines need to handle transient failures gracefully.</p><p>Implement retries with exponential backoff and explicit handling for rate limit responses:</p><pre>import time<br>import requests<br>from requests.adapters import HTTPAdapter<br>from urllib3.util.retry import Retry<br><br>def get_session_with_retries(<br>    retries: int = 3,<br>    backoff_factor: float = 2.0,<br>    statuses: tuple = (429, 500, 502, 503, 504),<br>) -&gt; requests.Session:<br>    session = requests.Session()<br>    retry = Retry(<br>        total=retries,<br>        backoff_factor=backoff_factor,<br>        status_forcelist=statuses,<br>        respect_retry_after_header=True,  # honours 429 Retry-After headers<br>    )<br>    adapter = HTTPAdapter(max_retries=retry)<br>    session.mount(&quot;https://&quot;, adapter)<br>    return session<br><br><br>def embedding_api(text: str, session: requests.Session) -&gt; list[float]:<br>    response = session.post(<br>        &quot;https://your-embedding-endpoint/embed&quot;,<br>        json={&quot;text&quot;: text},<br>        timeout=30,<br>    )<br>    response.raise_for_status()<br>    result = response.json()<br><br>    # Validate result structure before returning<br>    if &quot;embedding&quot; not in result or not isinstance(result[&quot;embedding&quot;], list):<br>        raise ValueError(f&quot;Unexpected API response structure: {result.keys()}&quot;)<br><br>    return result[&quot;embedding&quot;]</pre><p>Two details worth calling out: respect_retry_after_header=True honours the Retry-After header that rate-limited APIs return (common with OpenAI, Anthropic, and most embedding endpoints), and the result validation after raise_for_status() catches cases where the API returns 200 with malformed output — which happens more often than it should in production.</p><h3>The pipeline architecture</h3><p>The journey from notebook to production pipeline follows a consistent path:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/513/1*qAKLeMXpSJz28b2g3lEgww.png" /></figure><p>The collaboration between Data Engineers and Data Scientists is iterative — not a single handoff. Data Scientists validate that the refactored pipeline produces the same results as the notebook. Data Engineers validate that the pipeline meets production reliability and cost requirements. Several iterations are normal before both criteria are satisfied simultaneously.</p><h3>Conclusion</h3><p>The hardest part of productionising a Data Science application is not the technical refactoring — it is the alignment on what “production-ready” means between the two disciplines that have different definitions of the term.</p><p>For a Data Scientist, production-ready means the model works correctly. For a Data Engineer, production-ready means the pipeline runs reliably at scale, fails gracefully, and can be maintained by someone other than the person who built it. Both definitions are correct. Getting to a version that satisfies both requires trade-offs and some rounds of discussions — the table in this post is a useful starting point for that conversation.</p><p>Data Engineers and Data Scientists solve complementary problems. The notebook provides the logic and the validated approach. The production pipeline provides the infrastructure that makes that approach work at scale. Neither is complete without the other.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6ca109a5ec1b" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[2015, a memoir]]></title>
            <link>https://medium.com/@harishkrao/2015-a-memoir-9e5a718d4aa9?source=rss-5dfb120ae725------2</link>
            <guid isPermaLink="false">https://medium.com/p/9e5a718d4aa9</guid>
            <category><![CDATA[2015]]></category>
            <category><![CDATA[2016]]></category>
            <category><![CDATA[new-year]]></category>
            <dc:creator><![CDATA[Harish K Rao]]></dc:creator>
            <pubDate>Thu, 31 Dec 2015 20:49:38 GMT</pubDate>
            <atom:updated>2016-01-07T17:45:02.386Z</atom:updated>
            <content:encoded><![CDATA[<p>The year was quite eventful in some respects, just the same as previous years in others.</p><h4>Looking ahead…</h4><p>Every new year gives us an opportunity for a fresh, promising perspective to existing aspects of our life. Career, personal goals, habit building and the world around us overall.</p><p>I think I can sum up my 2015 with the words — grateful, enduring, thought-provoking, maturing.</p><p>While 2015 made great strides for technology, what lies ahead of us in 2016 and beyond is more exciting than ever.</p><h4>Space travel</h4><p>SpaceX made a milestone success of getting a launched spacecraft back to the Earth. That, in itself, drastically reduces the cost to put man and material in space. I will not be surprised that space tourist travel will become mainstream before the middle of this century. More than making space travel possible, learning more about how humans adapt to it physically and psychologically is key. Making space travel safer and reliable is extremely important for it to develop as an industry and for people to spend money on it.</p><p>It took a better part of a century for us to master air travel. Space travel may not take that long, but we are, for the first time, looking beyond the Earth. That is literally unchartered territory. It will be interesting to wait and see what unfolds in 2016 and beyond.</p><h4>Technology</h4><p>Drones have become more commonplace than ever before. I saw a newly constructed office space using drones to create a timeline video of the entire construction process using drones. Drones saved lives by surveying inaccessible locations, during floods and other disasters.</p><p>Hover-boards are interesting too. Once the manufacturers sort the explosion/fire problem with reasonable safety for the devices, I think they will replace or compliment segways and help people having a lot of walking to do as part of their jobs — such as warehouses or parks.</p><h4>Connected living</h4><p>An aisle at Best Buy features a host of gadgets to control the home with a smartphone/tablet. This, I would think, is just the beginning of connecting every conceivable electronic item in your home to the smartphone. Life will certainly be different with the home constantly surrounded by wi-fi enabled devices. We still have a long way to go before mainstream adoption. Time will tell.</p><h4>Nature</h4><p>2015, especially, the second half, proved to be very different from others for most parts of the world in terms of weather. Unprecedented warm temperatures, snowfall, rain, floods, heat waves etc. shattered weather records in multiple continents.</p><p>We did see a significant step towards combating climate change with the agreement in France this year. Pollution in some of the most densely populated cities in the world is one reason to cut down on carbon emissions, even with ongoing debate of the impact of humans on climate change. Beijing and New Delhi especially posed a grim picture of how bad pollution can get.</p><h4>Information</h4><p>People were able to receive and share information much faster than before, in 2015. Thanks to Facebook and Twitter, my home city was able to get back on its feet, after a devastating and unprecedented flooding and cyclones. Volunteers, relief work, rescue operations etc. were quickly coordinated using Facebook, Whatsapp and Twitter. For the first time, people commented that Facebook was not just for sharing content about the fun times we all have, Facebook saves lives too.</p><h3>Closing the year 2015, welcoming 2016…</h3><p>Thinking back about the year 2015, I just like all of you, have a lot to learn from, reflect upon, grow out of and continue what we all have been doing. It may well be just a landmark in a calendar system, but, the New Year can be an opportunity for us to reflect, refine and reform and for us to get better at what we all do.</p><p>Good luck and best wishes for a Happy and Joyous 2016!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9e5a718d4aa9" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[First on Medium]]></title>
            <link>https://medium.com/@harishkrao/first-on-medium-a22a5cf62b14?source=rss-5dfb120ae725------2</link>
            <guid isPermaLink="false">https://medium.com/p/a22a5cf62b14</guid>
            <category><![CDATA[medium]]></category>
            <category><![CDATA[writing]]></category>
            <dc:creator><![CDATA[Harish K Rao]]></dc:creator>
            <pubDate>Wed, 14 Oct 2015 04:06:29 GMT</pubDate>
            <atom:updated>2015-10-14T17:37:25.217Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RpFDHPLZ2MBaIR9X0Hv1WA.jpeg" /></figure><p>The English language always fascinates me. More than anything else, I love to read — news, stories from business leaders, successful entrepreneurs, so on and so forth. I have always wanted to write and I did for quite some time in 2008. After that, time got the better of me.</p><p>I stumbled upon Medium for reading an article by one of the most renowned names in Carnatic Music, <a href="https://medium.com/u/57f9bc38618c">Sanjay Subrahmanyan</a>. I started liking the theme of Medium, the structures, tags and the most important component, the people writing in Medium.</p><p>I hope to continue the streak of writing and in the event, interact with the wonderful people here.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a22a5cf62b14" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>