Create Spark Session In Scala

3 min readJul 13, 2022

https://bigdata-etl.com/create-spark-session-in-scala/

Creating a Spark Session object, which instructs Spark how to access a cluster, is the first step a Spark application must do. You must first generate a SparkSession object, which holds details about your application, before you can establish a SparkContext and SQLContext instances which open for you the Spark functionalities.

Every Spark application, at its core, comprises of a driver software that performs the user’s primary purpose and a number of parallel tasks on a cluster.

This post is a part of Free Spark Tutorial!

Table Of Contents

What Is RDD?

A resilient distributed dataset (RDD), which is a set of items divided across the cluster’s nodes and capable of being processed in parallel, is the primary abstraction Spark offers. RDDs are made by changing an existing Scala collection in the driver application or a file in the Hadoop file system (or any other file system supported by Hadoop) as the starting point for a new RDD.

Additionally, users can request that Spark keep an RDD in memory so that it can be effectively used in several concurrent processes. RDDs also automatically restore themselves after node failures.

Spark Driver VS Executor

Each Spark application must consist of:

One Spark Driver application
One or more Spark Executors

Spark Driver is like a Boss. It manage the whole application. It decides what part of job will be done on which Executor and also gets the information from Executors about task statuses.

The communication must be bidirectional. In Hadoop world when the application is submitted to YARN for acceptation the requested resources should be given. Spark Driver is setup on one of the Hadoop Node and the executors on the other Nodes (Spark Driver also can be on the same machine as one of the executor).

Spark Session VS Spark Context

Spark Session is the main object in Spark — it’s the entry point of each Spark application.

Spark Context is the Spark Session object variable which is used to operate on RDD.

SQL Context is the same as Spark Context the Spark Session object variable which is used to execute operation on DataFrames and DataSets.

To visualize these dependencies take a look at the following diagram:

Create SparkSession Object

First of all what we need to do to start woking with Spark is to create the SparkSession instance. To create it you need just few lines of code. That’s all! Now you can start your journey with Apache Spark!

val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("BigData-ETL.com")
    .getOrCreate()

// Sets the Spark master URL to connect to, such as "local" to run locally, "local[4]" to run locally with 4 cores, with * to run with all available cores, or "spark://master:7077" to run on a Spark standalone cluster.

Summary

That’s all about how to create Spark Session In Scala. Now you are ready to write your first Spark Application. Don’t waste the time and let’s go to the next section!

See other Apache Spark related articles on my blog: https://bigdata-etl.com/big-data-articles/big-data-topic/apache-spark-articles/