Bricksflow: Databricks development made convenient

Jiří Koutný
DataSentics
Published in
4 min readMay 29, 2020

For data engineers, analysts and scientists who used to work with on-premise big data technologies, Databricks seems like a miracle. Data pipelines that used to take hours now take a few minutes by using the almost unlimited scalability of the cloud. Online notebook-based IDE allows you to develop code from anywhere without installing anything on your local computer.

At DataSentics, we have been working with Databricks for more than three years. We started by building simple data pipelines and models and, as time goes on, increasingly complex solutions are being demanded by our innovative clients.

Being a software developer by nature, I have realized that it’s actually pretty hard to develop a nontrivial app on Databricks due to the following limitations:

  • Sharing code across notebooks is hard
  • Missing proper GIT support
  • Third-party package management is hard
  • No proper search (and replace)
  • No code checking/auto-completion
  • No debugger
  • Code testing is almost impossible

Most of the aforementioned limitations arise from the fact, that the web-based Databricks IDE is very limited and lacks a lot of standard features of local IDEs (PyCharm, Visual Studio Code, etc.)

I’ve spent many years working with PHP and Javascript, where frameworks like Symfony or NestJS have standardized the whole development process and made the creation of complex solutions simpler. Nothing similar existed in the (Py)Spark/Databricks world, so I’ve decided to take all my previous experience and create a more convenient and standardized way to develop apps in (Py)Spark and Databricks. That’s how Bricksflow was born.

Bricksflow’s vision & focus

Bricksflow is a set of best practices for Databricks projects development automated in Python and focused on the following paradigms:

  • one code for all environments (your favorite IDE + Databricks UI),
  • anyone with basic python skills can create pipelines and improve the business logic,
  • developing a standard DataLake project requires almost no engineers,
  • pursue consistency as the project grows.

Base components to be used by everyone:

  • Configuration in YAML
  • Tables & schema management
  • Automated deployment to Databricks
  • Documentation automation

Advanced components to be mostly used by engineers:

  • Production releases workflow
  • Unit & pipeline testing
  • Extensions API

The Master Package

A typical Bricksflow project consists of 2 main parts: Data pipelines in Databricks notebooks and “The Master Package”.

A typical Bricksflow project structure

When using this architecture, all business logic is placed in the Databricks notebooks which can be easily modified either using the Databricks web UI or in your local IDE.

The master package itself contains the bare Bricksflow framework which can be extended with the following bundles:

  • databricks-bundle: Automated SparkSession initialization based on the running environment (local or Databricks UI)
  • datalake-bundle: DataLake as a configuration
  • dbx-deploy: Deploy your locally-developed code to the Databricks cluster, release it into production

Each bundle can be configured using YAML along with the app-wide configuration:

Example of a Bricksflow’s configuration

Most of The Master Package’s code & configuration is developed locally using standard IDEs like PyCharm or VS Code. When deploying to Databricks:

  1. The Master Package python wheel file is built and uploaded to DBFS
  2. Databricks notebooks are uploaded to the Databricks workspace
Notebooks can be developed both locally or in the Databricks UI

After being deployed to the Databricks workspace, the Databricks notebooks look very similar to what you are used to. The only and most important difference lies in the first line. By running the %run ../app/install_master_package command, The Master Package’s wheel gets installed to the notebook’s scope and you can start using it right away.

Typical Bricksflow-powered data pipeline broken into functions to simplify testability

Create your first Bricksflow-based app!

Bricksflow does NOT aim to completely change the way you build solutions in Databricks. It’s an extra layer that gives you more flexibility & maturity for development, configuration and testing of your Databricks-powered apps.

Let’s now create your first Bricksflow-powered app to feel the difference!

--

--