Converting Chained Stored Procedures to Databricks

Ryan Chynoweth

Published in

DBSQL SME Engineering

6 min readFeb 22, 2024

Author: Ryan Chynoweth

Introduction

As the demand for more efficient data processing grows, so does the need to rethink our traditional methods. Recently, I was asked to provide more detail on how best to convert chained stored procedures to Databricks. In my Converting Stored Procedures to Databricks blog, I state that stored procedures “… should be converted to a task within a Databricks workflow”. Databricks Workflows is a robust orchestration product that allows users to build scalable pipelines supporting data engineering, data science, and data warehousing. Workflows are used to create jobs that can have many tasks.

Due to the availability of a native orchestration engine, the legacy method of chaining stored procedures is no longer required. When migrating stored procedures to Databricks I encourage customers to remove the orchestration from the code and define it in Workflows. This gives users much more flexibility, observability, and features to build production-grade pipelines.

Practical Example

In the past, it was very common to create multiple SQL scripts and register each of them as a stored procedure object in a data warehouse. This approach aimed to encapsulate logic, promote code reusability, and streamline maintenance. However, it often led to challenges due to the dependent and black box nature of stored procedures.

Consider a scenario where a pipeline consists of three stored procedures: ingesting data from an external location into a staging table, merging data into a set of tables in a data model, and refreshing aggregation tables used for reporting. Each stored procedure relies on the successful execution of the previous one. To manage this dependency, users would typically create a fourth stored procedure, my_schema.parent_procedure, to orchestrate the execution sequence.

The parent_procedure definition may look like the following:

CREATE PROCEDURE my_schema.parent_procedure
(
    -- Add the parameters for the stored procedure here
    @var1 nvarchar(50) = 'default param value 1',
    @var2 nvarchar(50) = 'default param value 2'
)
AS
BEGIN

exec my_schema.ingest_procedure @var1 = N'param1', @var2 = N'param2';

exec my_schema.merge_procedure @var1 = N'param1', @var2 = N'param2';

exec my_schema.aggregate_procedure @var1 = N'param1', @var2 = N'param2';

END
GO

While this method provided the ability to orchestrate code, it introduced complexities in scheduling, maintenance, and scalability. Data engineers would often find themselves making ad-hoc and unapproved changes in production to deal with the constant issues that would arise.

Migrating this pipeline to Databricks, each child stored procedure would be mapped to a task within a workflow. Databricks Workflows support the following task types: notebooks, python scripts, python wheels, JARs, Spark Submits, SQL Files, Queries, dbt, dashboard refreshes, other jobs, and Delta Live Tables.

For our use case, I will migrate each child procedure to a Databricks SQL Notebook. Once the three notebooks are created I can then create a Databricks workflows that looks like the following.

If I click on one of the tasks, the UI will show the task settings which includes a set of parameters as seen in the previous SQL Server example. This allows users to provide default parameters within the job definition, while still providing runtime values to override as needed.

Notice above that each of these SQL notebooks are running on a Databricks SQL Warehouse instead of a job cluster. Users are able to take advantage of outstanding query concurrency offered by warehouses, simplified compute deployment, and the ability to run jobs on the same compute that they developed procedures. Additionally, while many love SQL editors, like SQL Server Management Studio or the SQL editor available in Databricks. Notebooks are a great for SQL workloads. Databricks Notebooks offer an interface that make debugging, readability, and managing SQL code much better and allow you to avoid those nasty 3000 line stored procedures we are all guilty of creating and hate inheriting.

To schedule tasks on a SQL Warehouse simply click the drop down and scroll down to the section containing the available warehouse options. Please note that if you do not see this option available in should be available in the coming weeks as it rolls out to all customers.

Users can access widget values passed as parameters from Databricks Workflows to a SQL notebook or Databricks Query by using ${param_name} syntax to access the values. In the future, we will support the :param_name syntax as used in the SQL Execution API Endpoint.

Some of the benefits of using Databricks workflows (and really any orchestration engine) over chaining stored procedures are:

Schedules and Triggers: Databricks offers a range of scheduling options, including built-in CRON scheduler, file arrival triggers, continuous jobs, job triggers, and API-triggered executions, eliminating the need for external schedulers.
Isolated and Shared Compute: Databricks Workflows allow users to isolate tasks within a workload to ensure performance guarantees. Additionally, users can reuse and share compute resources across multiple tasks, streamlining resource management.
Git Integration: Workflows seamlessly integrate with code repositories, enabling CI/CD practices to ensure that the most up-to-date code is always executed.
Job Queue: If a user has a parameterized jobs that needs to be executed with 100 different parameters, then they can submit all the parameters at once and Databricks will automatically handle the backlog according to the concurrency settings.
Hard and Soft Timeouts: Users can set two timeouts that allow them to know if a job is running late or a hard timeout to stop the job.
Run If Dependencies: Workflows offer built-in support for running tasks based on dependencies, allowing users to execute tasks following set conditions.
Repair Run: If a job has many tasks and all the tasks succeed except for the last one, then users can fix the failed task and easily re-run the job which will only execute the failed or skipped tasks. In stored procedures, this would likely require making edits to production to avoid re-running successful tasks.
Built in Dynamic Values: Databricks Workflows provide built-in support for dynamic values such as task names, job start time, and job IDs, enhancing flexibility and automation capabilities.
Observability: Databricks offer comprehensive observability features, tracking the status and progress of each task, as well as providing data lineage information for related datasets and objects.

While it is last in the list above, observability is a critical feature that is missing from stored procedures. Databricks offerings great observability and tracking, below is a screenshot of one of the offerings available in the user interface.

Conclusion

As explored in this blog, stored procedures were once relied upon for not only transforming data, but scheduling and orchestrating data workloads. While it worked, the limitations of often led to drawbacks and perpetuated outdated practices within many data organizations. For more insights, I encourage you to read Franco Patano’s blog “Top 10 reasons why your cloud data platform migration should leave stored procedures in the past with a perspective on the future”.

Transitioning from coordinating code with stored procedures to Databricks Workflows represents a significant step towards modernizing data pipelines and embracing efficient code orchestration.

Disclaimer: these are my own thoughts and opinions and not a reflection of my employer

Converting Chained Stored Procedures to Databricks

Introduction

Practical Example

Conclusion

Written by Ryan Chynoweth