Top 10 reasons why your cloud data platform migration should leave stored procedures in the past with a perspective on the future

Published in

DBSQL SME Engineering

5 min readJan 31, 2024

Stored procedures came on the scene in the 2000s to gain better price performance mechanics in SQL Warehouses. The SQL language is traditionally declarative, meaning you tell it what you want from the data, and it figures out how to do it. Programming languages like Python however are imperative, meaning you tell it exactly what you want to do and how to do it every step of the way. Bringing procedural execution to SQL required bridging that gap between declarative and imperative by bringing conditional control, loops, and error handling. At the time ingesting and transforming data from raw sources into the curated data models used for reports and dashboards required additional software and hardware ETL tools like Informatica, Talend, AbInitio, and the like. When your data warehouse was on-premises, where capital was spent on hardware and unlimited use of software was licensed, the concept of using the same warehouse used for serving data in the presentation layer could be used for the transformations made sense. While warehouse engines were not built to handle the complexity of transformations, often they perform poorly, but the costs you saved on the ETL tools far exceeded the inefficiency and make a compelling business argument to avoid extra costs of specific ETL tooling and maximize the utilization of the warehouse. Before we get into the top 10 reasons why you should consider leaving these vestiges of the past behind in your migration, let's take a look at some alternatives.

https://docs.getdbt.com/blog/migrating-from-stored-procs

Redshift, Synapse/Fabric, BigQuery, and Snowflake all allow you to lift and shift your stored procedure code into the cloud. Often causing logarithmic growth curves in cost that are good for the vendor but bad for company budgets. One alternative I have seen lately is dbt leveraging its robust model framework to shine some light on these black-box stored procedures. Along with the freebies offered in dbt, like automatic documentation and lineage, this offers a compelling argument for traditional stored procedure processing.

With Databricks Workflows, you can create individual tasks of ANSI-compliant SQL code that visibly separates the unit of work, making it easier to understand and debug. With git supported within Workflows, version control, and CI/CD are much simpler and intuitive. Since the logic has been extracted from the vendor-specific code into ANSI-compliant SQL you have better portability than you had in the past. Since Workflows is expressed as a DAG (Directed Acyclic Graph) dependency is much cleaner to reason about while also increasing throughput with increased concurrency of independent tasks. With these benefits, the overall complexity is lowered, allowing you to iterate faster, recover from failure simply, and lower your overall technical debt. With Workflows in Databricks, you can even call a dbt job if you want to leverage both modern solutions.

Without further ado, let's get into the top 10 reasons you should reconsider lifting and shifting your procedures into the cloud.

Cloud costs grow disproportionately to data volume due to inefficiencies

Cloud environments expose stored procedure inefficiencies, increasing costs under consumption-based pricing
Cloud vs on-premise cost structures reveal stored procedures’ flaws, particularly in creating temporary objects and redundant code, leading to escalating costs with data growth and time

Separate storage and compute diminish many of the benefits of stored procedures

Stored procedures performance benefits were due to the colocation of logic execution and data, but the separation of storage and compute in the cloud nullifies these benefits
Many of the patterns leveraged in procedures have extra costs, using tempdb on-prem was faster, but in the cloud these temporary objects have costs, and require cleanup

Debugging is difficult and transparency is a black box

Stored procedures become harder to debug as they grow larger due to numerous temporary variables, tables, and table states
Frequent use of PRINT statements for tracking adds to the complexity
Recovering from mid-execution failures requires step-by-step execution, increasing complexity and time

Migration headaches with vendor lock-in syntax

Vendors have unique SQL extensions for imperative procedural logic, creating challenges in migrating to different vendors
Migrating requires significant code refactoring due to syntax and feature differences, which is time-consuming and costly

Maintenance overhead of large lengthy procedures

The prevalence of stored procedures leads to increased complexity as code grows from hundreds to thousands of lines
Reliance on consultants for patchwork solutions further compounds this complexity and defers the resolution of technical debt, postponing necessary long-term solutions

CI/CD is complex with automated deployment pipelines

Continuous Improvement and Continuous Delivery practices accelerate engineering application development but are complicated or impeded when applied to stored procedures
Automation and version control of stored procedures are challenging due to their integration with the warehouse
Rolling back from failures in stored procedures involves multiple steps and scripts, adding complexity

Difficult to test and iterate

The tight coupling of code and data in the warehouse creates obstacles in testing and effective iteration
The necessity for different testing environments with production-like data increases complexity and incurs duplicative costs for maintaining multiple environments

Collaboration is challenging

Tools for creating/modifying stored procedures usually support only single users, lacking concurrent editing or validation capabilities
Manual validation post-execution creates bottlenecks in solution delivery
Stored procedure logic is isolated, with no automatic linkage to documentation, necessitating manual updates for documentation and lineage

Security risks

Stored procedures are susceptible to SQL injection attacks, especially when using dynamic SQL with user input parameters
Some vendors offer methods to mitigate this risk, but it depends on correct implementation by engineers

Loops and cursors are anti-patterns in modern set-based processing

Loops and cursors in stored procedures enable row-by-row iteration, which can be complex to manage
Set-based, distributed, and parallel processing are more efficient for handling large data sets

What do you think? Should we convert these monolithic black box stored procedures into glass boxes of visible units of work, or continue to use the sins of the past in our future data platform architectures?