Handy Databricks Features for Development

Başak Tuğçe Eskili
Marvelous MLOps
Published in
6 min readJul 31, 2024

We know how important it is for ML practitioners to be able to develop locally. Local development environments provide a familiar, customizable, and efficient workspace that accelerates the coding process. Similarly, notebooks are commonly preferred choice for writing and testing code in an interactive way. Databricks recognizes these needs and provides several features designed to improve and support both local and notebook-based development.

We’ll look into 3 features that enhance and support the development process.

  1. Git Folders
  2. Databricks Connect
  3. Databricks VS Code Extension

Git Folders (previously Git Repos)

Databricks Git folders is a Git integration that allows you to clone your repository, commit, pull, and push just like you would do locally. In this way, you can continue working in the Databricks environment while using version control. It’s designed to foster collaborative work and supports both cloud and enterprise git providers. For detailed documentation and guideline.

It’s quite easy to start with. In your workspace, either under the Repos folder or your personal username, start creating a Git folder, and pass your Gir repository URL and provider. Depending on your provider and repository privacy settings, you will need to authenticate either before cloning your repo (if it’s a private repo) or before commit & push. It’s mostly done via PAT (personal access token).

Authentication

Go to your profile settings, find linked accounts, and click Git Integration. You will need to choose Git Provider, get your personal access token and paste it here.

In your workspace, either on the home page or in the Repos folder, you can create a Git Folder that will serve as your synchronized code base.

Once you create the folder and clone your repository, you will see the content. I just cloned our example movie-recommender repo.

It will show you which branch you are on. Now, let’s say we want to run some code using our modules from the repo. Create a notebook and attach a cluster. If you are creating a cluster from scratch, make sure to adjust the termination time, as the default is very long and can cost you money.

I created an example notebook in which I import my modules from the "topn" folder and execute the main script.

If you want to make changes in your code, you can develop here and then commit & push back to git repos. Let’s say I want to add this example notebook to my repository. Click on where you see the branch name (master), it will show you the changes you are about to push.

Why use Git Folders?

  • Easily get your code base available in the Databricks workspace and continue developing while using the Databricks cluster.
  • Collaborate with other developers in the same environment.

But of course this feature has some drawbacks. It’s not easy to solve merge conflicts if you get one while pushing. Debugging is hard as you don’t have the cool debugging features offered by the IDEs.

Databricks Connect

Databricks connect is a library that allows you to connect from IDEs to Databricks compute and run code remotely by using spark API. In this way, you can write and run large-scale Spark code in Datarbicks runtime, without needing to have Spark installed. It’s pretty straightforward to set up and use. It is built on Spark Connect. For Python applications, you must use version 13.3 or above.

This guideline explains how to start with it for Python applications. I will briefly show the setup to give you some context but feel free to follow it from the original docs.

Install Databricks CLI and configure your authentication.

brew tap databricks/tap
brew install databricks
databricks auth login - configure-cluster - host <workspace-url>

If you have 1 cluster in your workspace, it will automatically assign that one to your Databricks profile, if you have multiple, you need to select the cluster you want to run your code, via the CLI command.

Following the tutorial, I set up the Python project and executed the code below.

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.profile("dbc-7ccf7e51–3b7b").getOrCreate()
df = spark.read.table("samples.nyctaxi.trips")
df.show(5)

Cluster’s Spark UI will show the executions in progress and completed.

To use Databricks Connect, we import DatabricksSession and create a Spark session. Behind the scenes, the Spark code is sent to the Databricks cluster and executed there, and the results are retrieved back to your local environment. This means that only your Spark code runs in the Databricks environment, while your Python code is executed on your local machine.

Why using Databricks-connect?

  • Debug and step through your code in your IDE while working with a Databricks cluster.
  • Easily iterate when developing libraries without needing to restart the cluster.
  • Change Python or Scala library dependencies in Databricks Connect without cluster restarts, as the client sessions are isolated within the cluster.

Databricks-connect is a handy tool, but keep in mind that some code adjustments might be necessary when you transition between running locally and within a Databricks workflow.

Databricks VS Code Extension

The Visual Studio Code (VS Code) has a Databricks extension that allows you to connect to your Databricks workspace and run your local code files on Databricks clusters directly from your VS Code environment.

This is especially a beneficial practice if you need larger compute power, or require access to data available to your cluster (e.g. mounted Azure blob storage), and want to execute your code in an environment that closely mirrors your production setup.

Official step-by-step guide.

Start with installing the Databricks extension. Once you install the extension, you’ll see the Databricks logo icon on the left bar. Click to configure authentication. You need to add your workspace url and login via the browser.

Next, you can attach a cluster and start.

Next, you need to sync your files, this step will upload your files to a location under your username on Databricks workspace. Once the sync is done, you can see that the files are available.

It’s under my username and .ide folder.

We have 2 options, either upload and run the file on Datarbricks, which will run the file on your attached cluster.

Or you can submit it as a workflow (job).

I executed on the Databricks cluster, and saw the progress on the Debug console:

When you run as a workflow, you will instantly see that a job run on workflow runs has been initiated via submit API.

The job run details and output are also shown in your VS code. It was executed on the attached cluster as well.

Why use VS Code Databricks Extension?

  • Execute local Python code files from VS Code on Databricks clusters in your remote workspaces.
  • Execute your scripts in a job run that can mirror the production environment.

If you’re already a Databricks customer, why not take full advantage of its features? Databricks offers several features that allows users to develop their applications smoothly. Utilizing those capabilities can really ease your development process. We do our best to try out what’s available and share with you, happy coding!

--

--