Massively Accelerated AI Migrations with Medallion Mesh

Franco Patano
DBSQL SME Engineering
6 min readAug 7, 2024

The opinions expressed in this blog are mine, crafted with a unique blend of wit and wisdom to entertain and advise. Any complaints, disagreements, or raised eyebrows should be directed at me. My employer, family, and friends bear no responsibility for my musings, ramblings, or rare bouts of brilliance. They have their own lives to live, comedies to laugh at, and dramas to eat popcorn to.

Migrations are hard. Primarily due to the amount of technical debt that has accumulated over the years and sometimes decades. With many overlapping source systems from mergers and acquisitions. Most of the time, they are under-estimated and over-promised, with unclear requirements that lead to scope creep. Costs can go into the millions, and time spans can take years. Compounded by the fact that IT and business partners must work together on a solution, it adds another layer of complexity to the mix. The death march these poor souls must endure is something I don’t wish on any enemy. Databricks can offer impressive cost savings and enable AI use cases simultaneously, but the multi-year, multi-million-dollar migration stands in the way. This is the future; shouldn’t we have a better approach?

When discussing the current state of the data analytics landscape with CDOs and Architects, the common theme is the need for more transparency in lineage and utilization to develop a cohesive strategy. Most believe that only a portion of the data gets utilized, but which portion? If the business isn’t using all of the transformed data in the legacy system, should we be migrating or lifting and shifting everything? Most leaders want to assess assets and inventory before modernizing. An approach that is flexible enough to allow business units to move at their own pace while ensuring data is available for modern use cases is a desirable state.

At the Data and AI Summit, I presented Medallion Mesh, a flexible business-led approach to modernizing data architecture for AI. Instead of just blindly migrating the sins of the past into the future, we use Lakehouse Federation to federate to legacy Data Warehouse systems. Leveraging Unity Catalog for authentication and access controls, it also tracks consumption down to the column level. We can leverage Materialized Views as a way to reduce the stress on legacy systems by materializing the federated source in the cloud. Then, when ready to migrate, we can move to low latency, high throughput incremental CDC from operational database sources with Databricks LakeFlow Connect.

This approach allows us to track the utilization of data sources to understand business demand. One customer found that 50% of their pipelines led to data no one was consuming. They were able to cut their migration timeline in half and also had a prioritized list of popular data sources to start with. With business buy-in and alignment, this enabled higher confidence in the migration outcome. Now that we have data to support the migration, can we leverage AI to speed up the process?

Once we are ready to migrate the portion of the legacy enterprise data warehouse system that is utilized by the business, we can accelerate with automation and AI. When discussing with various practitioners about how they were leveraging AI, I often found they were trying to translate ETL and SQL between systems directly, and getting mixed results. I think this is due to things getting lost in translation. Anyone who speaks multiple languages can attest that there isn’t a word for every word in every language. Except for German. I theorized and presented the concept of using metadata-driven frameworks as an intermediary at the Data and AI Summit 2024. AI is used to extract metadata, such as source-to-target mapping, transformation and business logic, and data quality rules. Map that metadata to a framework file, and then use that to drive automation or AI to create a Workflow for that ETL process. My friend and collaborator Thomas Zhang implemented this strategy for SSIS and SQL Server and found that it was able to accelerate the migration by up to 80%, making migration faster and cheaper with high confidence.

The first step is the hardest and never more true than with migrations. Yet, the show must go on, and how will we handle changes with the new platform? In legacy Enterprise Data Warehousing, CI/CD was an afterthought, often including many different products held together loosely with scripts and fraught with many steps. In the modern era, different point solutions spawned, like git, airflow, dbt, liquibase, Azure DevOps, and AWS CodePipeline, to bring modern software development practices like agile and CI/CD into the warehouse world. In the software world, you are dependent on users and how they interact with your software. In the data world, you depend on the data models from those software solutions. Quite often, the data team has no idea changes are coming, like additional fields, data type changes, or restructuring of fields. I've heard people say we have 5 fields with similar names; the correct field is different depending on the time frame. Again, this is the modern era, is there a simpler way to manage the changes in data while minimally disrupting the users flow? Enter SQLMesh.

Change happens. What is important is how your data management system reacts to it. SQLMesh was developed by Tobiko, founded by industry veterans from Netflix, Apple, Airbnb, and Google, to efficiently deliver scalable and collaborative approaches to building and maintaining data pipelines. A couple of the core pieces that make SQLMesh stand out are its novel Virtual Data Environments that use views as a virtual layer that minimizes data movement and computation costs and its ability to understand SQL semantically with SQLGlot to understand if changes are non-breaking or breaking with its downstream impacts. If you want a data platform that brings order to the chaos that is the data ecosystem, Medallion Mesh with SQLMesh offers a powerful combination that is akin to a master architect designing a structurally sound skyscraper that can withstand the test of time.

The next blog in this series, in collaboration with Tobiko, will offer an opinionated approach to building an efficient, collaborative, and resilient data platform with Databricks SQL and Tobiko Cloud, leveraging the Medallion Mesh principles.

--

--

Franco Patano
DBSQL SME Engineering

I spend my time learning, practicing, and having fun with data in the cloud.