Using Databricks Notebooks for Production Data Pipelines

Published in

DBSQL SME Engineering

10 min readMar 5, 2024

Introduction

Have you heard that notebooks do not belong in production? There is a belief that production notebooks should never see the light of day. This blog aims to explore the other side of the discussion, presenting reasons why utilizing notebooks in production is not only a sensible choice, but a better choice for many data developers.

Popular articles will reference downfalls of running production notebooks and dismiss them as only good for ad hoc work and demos. However, I would like to focus on a quote that I fully agree with is as follows:

Data scientists do not really understand the concerns of professional software developers such as automated, reproducible, and auditable builds, or the need and process of thorough testing, or the importance of good design in making codebases supportable and flexible. In turn, many software developers do not really understand what data scientists are doing.

Over the last decade, the interaction between these two distinct personas has significantly increased, primarily due to the growing deployment of Machine Learning solutions. While it might be tempting for a software engineer to overhaul data science code to make it ‘production-ready,’ this approach does not bridge the gap effectively. Instead of labeling data scientists’ code as subpar, a more constructive solution is to guide them in developing production-level code within notebooks, using their preferred IDE.

As reference, I encourage you to check out this article from Databricks’ official blog page.

What is a Databricks Notebook?

Databricks Notebooks are simply Python files in version control. Let’s take a quick look at an example of the source code from a basic Databricks notebook stored in a git repository, which is version controlled as a python file named sample_notebook.py:

# Databricks notebook source

# MAGIC %md
# MAGIC # Heading 1 in Markdown

a = 1
b = 2 

# COMMAND - - - - -

c = a + b

# COMMAND - - - - -

print(c)

Looking at the example above, can you tell the difference between a notebook and a Python file saved to a repository? Obviously, you can see that there is a # Databricks notebook source comment at the top with # COMMAND----- seperating the command cells. Then you may be aware that Databricks notebooks allow for magic commands allowing for use of different languages and markdown for added documentation. In that case each line is prefixed with # MAGIC. Note, that the # operator is replaced with the appropriate comment notation for the selected default language of the notebook. Lastly, to simplify user experience notebooks instantiate and set up common developer requirements like the SparkSession.

To qualify my assertion that notebooks have a place in production, it is important to note that I’m not suggesting exclusive use of notebooks or claiming they are the best deployment method. Notebooks are just one tool in the toolbox of production workflows, and they facilitate fast development, promote reproducible high-quality data pipelines, and encourage code that is easy to maintain and read.

Data Teams

The Databricks platform provides persona based UI experiences with a unified governance and compute engine allowing for enhanced collaboration to streamline production deployments. The sustained success of notebooks over the last decade has elevated data scientists to the forefront of enterprise software. Yet, the question remains: What about notebooks for other data developers?

Organizations typically have a dedicated data platform team of engineers and architects build and maintain the data and ensure accessibility for various consumers. Platform engineers may adhere to traditional software development practices. Data platform teams should modularize their code and follow software engineering best practices. It helps reduce monolithic applications and decouple the core platform objects.

On the other side are the data consumers — individuals who focus on building solutions and extracting analytical insights. This group of domain data engineers, scientists, and analysts focus on delivering tangible business value beyond data availability and are incentivized to get solutions done rather than making sure the pipeline is fully optimized. While many domain teams follow similar practices to the core platform team, they often have a heavier reliance on SQL over other languages. Notebooks have a distinct advantage for SQL based workflows over traditional SQL editors.

Ways to improve your production notebooks:

Notebooks as the entry point for your code: using Python imports allows you to modularize with the interactive interface.
Leverage markdown and command cell titles: markdown cells with headers will show up in the Table of Contents allowing easy navigation to different parts of your code making it clear what each section of the process is doing.
Split code into many cells: Especially with legacy SQL, data pipeline code can be a never ending functional script with a ton of common table expressions. With Notebook cells users can easily separate sections of the code and leverage temporary views to make it more manageable, readable, and maintainable.

SQL Users

The amount of SQL code running in production is absurd (not in a bad way). Every organization that processes data uses SQL in their data pipelines. As data practitioners we have all either written or inherited a 5,000 line stored procedure saved as a single script. Is this a great way to deploy production code? No it is not. Monolithic SQL code is difficult to read, maintain, and understand. Therefore, stating that notebooks do not belong in production while also accepting the fact that these stored procedures are current running in production ignores the reality that not all engineers are capable of following software development practices. Notebooks enable users to have an interactive and easy development environment that promote modular code, easily maintainable, and collaboration.

As a specific example, below is a view of what some stored procedures look like. Monolithic SQL stored procedures can be overwhelming very quickly, and the only documentation is in-line code comments which often get jumbled with random legacy logic that is no longer required but engineers decide to comment instead of delete.

-- This stored procedure contains a series of CTEs that work together, creating a seemingly compact but intricate logic.

CREATE PROCEDURE GetEmployeeData
AS
BEGIN
    -- CTE 1: Fetch initial employee details
    WITH InitialEmployeeCTE AS (
        SELECT EmployeeID, FirstName, LastName, Department
        FROM Employees
        WHERE IsActive = 1
    ),
    
    -- CTE 2: Calculate average salary based on initial employee details
    AverageSalaryCTE AS (
        SELECT AVG(Salary) AS AverageSalary
        FROM InitialEmployeeCTE
    ),
    
    -- CTE 3: Generate a report using average salary
    ReportCTE AS (
        SELECT 'Employee Report' AS ReportType,
               EmployeeID,
               FirstName,
               LastName,
               Department,
               AverageSalary
        FROM InitialEmployeeCTE
        INNER JOIN AverageSalaryCTE
    )

    -- This CTE is no longer needed as it references an old table
    -- instead of just removing, we will just comment it out 
    -- making it way harder to maintain this sproc

    -- CTE 3: Generate a report using average salary
    -- ReportCTE AS (
        -- SELECT 'Employee Report' AS ReportType,
        --       employee_id,
        --       first_name,
        --       last_name,
        --       department,
        --       average_salary
        -- FROM OLD_TABLE
        -- INNER JOIN AverageSalaryCTE
    -- ) 

    
    -- ... Additional CTEs and logic going on for thousands of lines
    -- ... An Endless Sea of SQL
    -- ... Going on FOREVER....
    -- ... Just when you think its about to end, another CTE appears


    -- FINALLY! Produce a resultset 
    SELECT *
    FROM FinalDataset;
END;

As opposed to the same script in a notebook, making it much easier to section of portions of code and provide extremely detailed comments.

Please note that I converted the CTEs to temporary views in a notebook to make a point about how organized and easy to read notebooks can be for SQL users. Databricks fully supports the use of common table expressions. For more information on stored procedure conversion to Databricks, check out my previous blog.

Lastly, existing SQL code can even be modularized by using the scala/python spark.sql("<insert sql code here>") method to wrap SQL code and execute it from various classes.

Counterpoints to Software Engineering View of Notebooks

The article referenced in the introduction describes specific downfalls of running production notebooks to support the idea that they are only good for ad hoc work and demos. Please note that the reference does not discuss Databricks Notebooks specifically and it was written four years ago, so while it may have been very accurate in the past — it may not be so anymore. I am also cherry picking a single post, but a simple google search will show you that there are many blogs with this sentiment. I am not stating that the blog is incorrect but simply providing the counterpoints that align with my view of notebooks.

The article specifically lists the following issues with production notebooks:

Inevitably, experimental code will get into the production
Notebooks are a security risk because they are “fully powered shell”
Code Modularization and Testing is difficult
Versioning
Hidden state due to out of order execution
Parameters

Inevitably, experimental code will get into the production

This is always a danger, no matter which IDE or programming language you choose. Establishing processes such as code reviews, testing, and coding standards prevent this from happening. Often organizations will have data scientists develop the code, and machine learning engineers are responsible for its production deployment. So while the quality of the code is often left to the data scientist, it is the responsibility of the machine learning engineer to ensure it is production ready. Therefore, if there is bad code in production do not blame a notebook or the data scientist but the individual that approved the production deployment.

Notebooks are a security risk because they are “fully powered shell”

This simply does not apply to Databricks notebooks. The permissions the code is executed under is set on the cluster and user level. Databricks allows individuals to write and deploy code with specific access modes to restrict the types of operations that can be performed.

Additionally, Databricks runs access controls through Unity Catalog to ensure that data operations adhere to the established permission model, further enhancing the security posture of notebooks. This applies to all code that is executed under this framework, not just notebooks.

Code Modularization and Testing is difficult

I recommend treating Databricks notebooks as the entry point to the application. In Databricks, users can import libraries from Python files to modularize and improve automated testing. For instance, consider the following Python class in a file as an example.

class import_me():
  
  def __init__(self, a):
    self.a = a
  
  def import_print_func(self, str):
    print(str)

Then, in a Databricks notebook or in another Python file, I have the ability to do the following for modularization.

# Databricks notebook source
from PythonImports.import_me import import_me

im = import_me('a')

im.import_print_func("Hello! This is how you do python imports using repos!")

# COMMAND ----------

Furthermore, if I wanted to perform tests I can follow a similar pattern. Let’’s take a class called simple_math.py where the contents are shown below.

class SimpleMath:
    def multiply(self, a, b):
        return a * b

    def is_even(self, num):
        return num % 2 == 0

Then I can create another file test_simple_math.py which contains my pytests for the previous class. Please see below.

from simple_math import SimpleMath

def test_multiply():
    math = SimpleMath()
    result = math.multiply(3, 4)
    assert result == 12

def test_is_even():
    math = SimpleMath()
    assert math.is_even(6) == True
    assert math.is_even(7) == False

Then in a Databricks notebook, I can run the following code to actually execute the pytest.


# Databricks notebook source
# MAGIC %pip install pytest

# COMMAND ----------

import pytest
import sys

# Skip writing pyc files on a readonly filesystem.
sys.dont_write_bytecode = True

# Run pytest.
retcode = pytest.main(["test_simple_math.py", "-v", "-p", "no:cacheprovider"])

# Fail the cell execution if there are any test failures.
assert retcode == 0, "The pytest invocation failed. See the log for details."

# COMMAND ----------

Versioning

Databricks notebooks seamlessly integrate with git version control. They appear as Python files in git repositories, allowing for easy tracking of changes. For example, here is a link to a Databricks notebook in one of my personal repositories. Furthermore, Databricks provides a solution with Databricks Repos, enabling version control for entire projects. This capability ensures that changes to the notebooks can be tracked, rolled back if necessary, and collaborated upon efficiently.

Additionally, if you wish to deploy your python files and classes as a package and/or wheel this is fully supported. Simply add a release pipeline that creates the package based on a release branch, then you can install the library on the cluster just as you would any other python library. Then use the notebook to orchestrate the task within your job.

Hidden state due to out of order execution

A common concern associated with notebooks is hidden state due to out-of-order execution. While it is true that individuals can execute notebook cells interactively in any order, leading to unexpected states of variables and data during development, this concern does not translate to a production environment. Production code is automated and notebooks follow a sequential execution model just as a normal piece of code. This eliminates the risk of hidden state issues.

Parameters

Databricks Notebooks can easily be parameterized and changed at runtime using Widgets. Moreover, notebooks can leverage environment variables that can be set at deployment. This versatility helps make Databricks Notebooks suitable for a range of production scenarios.

Conclusion

To be clear, promoting poorly written and disorganized code to production is never a good idea. The IDE the code was developed in does not determine its quality. Poor code quality, processes, and communication are the reason why bad code ends up in production. As discussed in this article, notebooks present numerous benefits, and with the right practices, running notebooks in production can be a viable and efficient choice.

Databricks notebooks are excellent for your code entry point and modernizing legacy SQL pipelines. Notebooks provide an excellent environment for interactive development, debugging, and allow for easier management of production solutions.

Edit — March 8, 2024: It is worth extending this blog with a note that I did write that “I’m not suggesting exclusive use of notebooks or claiming they are the best deployment method”. This blog targets an audience of “never notebooks” to show that notebooks can and should be used. However, if notebooks are not your IDE of choice Databricks does offer the ability to use other IDEs. For example, there is a Visual Studio Code Extension and Databricks Asset Bundles product which are excellent alternatives. With that being said, Notebooks have a place in production. Engineers need to stop blaming notebooks for the poor code they put in production and hold the themselves accountable for code quality by implementing many of the best practices discussed here.

Disclaimer: these are my own thoughts and opinions and not a reflection of my employer