Data Sharing between multiple Spark Jobs in Databricks

Using Global temporary view

Karthikeyan Siva Baskaran
4 min readFeb 22, 2020

There are multiple use cases where you want to share the data across multiple Spark Jobs when the data is not too huge. Using createOrReplaceGlobalTempView, the data can be shared between spark jobs instead of storing intermediate data persistently in disk and later cleaning those intermediate data.

createOrReplaceTempView or createOrReplaceGlobalTempView creates a lazily evaluated “view” from the dataframe that you can then use like a hive table in Spark SQL. But, it does not persist into memory unless you cache the data set.

The temp view created by these both methods will create memory reference to the dataframe in use. It will create a temporary view of the table in memory, it is not persistent at this moment but you can run SQL query on top of that.

The lifetime of temp view created by createOrReplaceTempView() is tied to Spark Session in which the dataframe has been created.

The lifetime of global temp view created by createGlobalTempView() is tied to Spark Application. So, this memory reference can be used across Spark Sessions. it will be automatically dropped when the application terminates. It’s tied to a system preserved database global_temp(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.my_view

In Databricks, you can share the data using this global temp view between different notebook when each notebook have its own Spark Session. If each notebook shares the same spark session, then using normal temp view also you can share the data across notebooks, but due to some security reasons, this option is set to false by default. But you can turn it on, based on your requirements.

Databricks command to view the configuration value

Spark Session Isolation is enabled by default. With Spark Session Isolation, different notebooks attached to a cluster are in different sessions with isolated runtime configurations and current database setting. In order to share temporary views across notebooks when session isolation is enabled, users can use global temporary views. Users can still disable session isolation by setting spark.databricks.session.share to true. If you enable this option, createOrReplaceTempView itself shares the data between different spark sessions(different notebooks). From Spark 2.0.2-db1 and above versions due to some security reasons and for user stability, session isolation is disabled by default. You can enable Spark session isolation so that every notebook uses its own SparkSession

To disable session isolation, declare it on cluster level and then restart the cluster. But, as a good practice session isolation shouldn’t be disabled.

Disable Spark Session Isolation

Without Databricks Cluster

If you are not using Databricks Cluster, Spark Application can be considered as a single batch job, it can contain more than one Spark Session. Global temporary views will be used to share data between multiple spark session.

Spark session is a unified entry point of a spark application from Spark 2.0. It provides a way to interact with various spark’s functionality with a lesser number of constructs.

Why we need multiple Spark Session in a same Spark Job?

One of the scenario is, when two different configuration needs to be used to create a different Spark Session, for example 2 different Hive metastores. You have requirement to combine the data coming from two different Hive metastores, then you need two different Spark Session with different configuration:

Here, the usage is mostly within Spark Jobs. But in Databricks, you can share these same between different Spark Jobs.

Using Databricks Cluster

In Databricks as they share the same cluster, we can share the data between different Spark Applications using Notebook Workflows in Databricks.

Notebook workflows in databricks allows you to easily build complex workflows and pipelines with dependencies and conditional routing based on previous Job status(success/failure).

// Output of data that got created in different notebook get accessed in another notebook:+-----+
|value|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
+-----+

Wrapping up

In a nutshell, TEMPORARY skips persisting the view definition in the underlying metastore, if any. If GLOBAL is specified, the view can be accessed by different sessions and kept alive until your application ends; otherwise, the temporary views are session-scoped and will be automatically dropped if the session terminates. All the global temporary views are tied to a system preserved temporary database global_temp. The database name is preserved, and thus, users are not allowed to create/use/drop this database(global_temp db). You must use the qualified name to access the global temporary view.

Happy Learning !!

--

--