Migrating from Databricks Notebooks to IDE for Development

Hang Xu

Published in

data-surge

3 min readDec 1, 2021

Why Develop on an IDE instead of in Databricks?

There are many limitations in regards to working with Notebooks in Databricks such as:

Difficulty managing version control
No linting or code corrections via IDE integration
Can not take test driven approaches to development(Notebooks are great for exploration, but not so much for production)

Why use an IDE(PyCharm, Visual Studio Code, etc.)?

Gives you intelligent code completion for SQL and Database queries
Allows developers to debug using specified breakpoints
Better environment to test functions and refactor code
Great abundance of well maintained packages for any need

Getting Started/Databricks Connect

To get started with developing outside of Databricks, we first need to identify the library to which we will be utilizing to connect to our clusters.

We will be using a library called “Databricks Connect”: here, which will allow us to connect to our Databricks cluster to run Spark jobs on.

The first step to connecting would be to choose a cluster runtime environment that is supported by Databricks connect.

These are the supported runtimes and they also need to have Python versions that match the runtime. We will be using the latest runtime, 9.1 along with the required Python version 3.8.

Creating a Virtual Environment

One of the greatest benefits of developing through an IDE is that you will be able to coordinate with fellow developers much easier via git and version control, but not everyone is going to have the same development environment. So it is important to create a virtual one that will contain all the necessary packages along with their associated versions.

Conda is an open source package manager that will allow us to easily create new virtual environments and install the required dependencies so that you never have to sweat about whether or not your team is working on the same environment and have the same package versions.

You can create that virtual environment with these simple lines. Conda will create a virtual env called “dbconnect” with Python 3.8 installed and then will activate that environment and then proceed to uninstall pyspark and then install databricks-connect at the version defined in the Databricks cluster.

Configuring the Connection Properties

Databricks Connect

Databricks Connect allows you to connect your favorite IDE (Eclipse, IntelliJ, PyCharm, RStudio, Visual Studio Code)…

docs.databricks.com

Prior to running this command, the above article will cover the tokens and IDs required for connecting to the cluster. These parameters include, Host, PAC, Org ID, Port(default) and Cluster ID which can all be either found or generated on the Databricks cluster page.

To test the connection, simple type

Which will boot up your cluster and run a couple simple Python/Scala scripts.

Running Databricks Example

The article also provides examples that you can execute from your IDE, so go ahead and try some of them. i.e.

Conclusion

Now that you can see that your IDE can connect to your Databricks cluster and run jobs there, you will no longer have to deal with the headaches associated with developing and debugging in notebooks. Making it so that you spend less time refactoring and more time doing the things you love.

If you would like us to evaluate and review your current data architecture or help you implement a modern architecture, please email us at info@datasurge.com or complete the form on our contact us page.