Connecting to Spark

Published in

Version 1

4 min readFeb 24, 2022

Motivation

Spark is a great engine for data engineering, data science and machine learning tasks. The Spark offers the capability to get the job done that would not be possible on an ordinary workstation and its managed by providing a distributed environment where the code can be executed and results extracted. This is great, however, how does one go about utilising the Spark capabilities from an application that runs in a Docker container?

Options

What are our options for the task at hand? There are a few options available:

· Apache Livy, claims to allow us to interact with the Spark container and run the jobs or have interactive sessions.

· Databricks Connect, claims to allow us to connect custom applications directly to the Databriks cluster.

Both of them have drawbacks and benefits and some of them are outlined below.

Apache Livy

The Apache Livy is a service that enables easy interactions with the Spark cluster over the REST interface. This application required some setup and required access to the Spark installation directory.

However, this service provides REST access that is agnostic to the language that is a consuming application built upon. Also, Livy provides some other features such as security, fault tolerance and sharing cached data. The use case for this service is a scenario when one cannot use cloud services and must use an onsite Spark instance.

From https://livy.apache.org

Databricks Connect

The Databricks connect does allow you to connect your IDE or custom application to the Databricks cluster and run the code as needed. It is a very easy setup and does not require many configuration steps or installations.

One glaring drawback in this approach is a dependency on Databricks, as this approach cannot be used on plain Spark. If for one reason or another you cannot use cloud and must use onsite Spark then this approach is unsuitable. However, if one can use cloud and Databriks then this approach seems to be relatively simple and easy to use.

Implementation

For this evaluation, we selected Databricks Connect for the connection to Spark as we do not have offsite Spark and we can use the cloud for our data and application.

Configuration

The installation and configuration are covered here and will not be replicated so you need to run the configuration steps. However, ensuring that all existing pyspark libraries are removed from Python virtual environment that will contain the connector, is important.

The configuration creates the .databricks-connect file in the user home directory. The content of this file during configuration looks somewhat like this:

{

“host”: “https://adb-453245345.4.azuredatabricks.net/?o=45645645645645#",

“token”: “ewrtdfg56456dhgdfgh5dgjjkljkldfgd”,

“org_id”: “45645645645645”,

“port”: “15001”,

“cluster_id”:”34534–345345-ices56567"

}

All the values can be obtained in the Databriks portal; where the user token is created. Following successful configuration copy the .databricks-connect file into the project directory.

Docker

You also need to update your Docker file to include one line of code, which copies the .databricks-connect file to the home directory.

COPY .databricks-connect /root

Depending on your setup you may also need to include JDK in the configuration if you do not have one already.

Application

The application part is the last step here. The approach is not prescriptive, it is a description of how the POC is done.

The application consists of a Flask application that serves as a serving module by exposing the REST endpoint. There is no particular reason for this to exist, except that we need to have a button to push so we can see the process working.

The serving module creates an instance of Broker from the Broker package. The instance of the Broker is the one that creates and encapsulates the Spark session.

self.spark = SparkSession.builder.getOrCreate()

The same instance of the Broker exposes functions to interact with the Spark session. During the creation of the Spark session, the connection will be made to the cluster defined in the .databricks-connect with the appropriate credentials. You can now use this Spark session as you need it.

Conclusion

There are a couple of ways to connect to the Spark cluster from the local environment and either get access to the Spark session or the ability to run code via the REST interface. The usage majorly depends on external factors (such as the possibility of using the cloud) and minorly on personal preferences.

The Apache Livy way appears to be more suitable for the onsite environments, provides a common interface for heterogeneous consumers and is geared towards some products.

The Databricks Connect used and described above gives one access to the Spark session and provides quick and direct access to the Spark cluster.

Apache Livy and Databricks Connect build different strategies in how to deal with the access to the Spark cluster.