Apache Spark is a powerful, fast and cost efficient tool for Big Data problems with having components like Spark Streaming, Spark SQL and Spark MLlib. Therefore Spark is like the Swiss army knife of the Big Data world.
Moreover Spark also has another very important feature which is horizontal scaling. In other words Spark supports standalone (deploy) cluster mode. A single Spark cluster has one Master and any number of Slaves or Workers. Workers can run their own individual processes on a horizontal Spark cluster on seperate machines as well as on the same machine with vertical scaling.
We can start a standalone cluster either manually by hand or we can use launch scripts provided by Spark (sbin folder). Moreover we can also create and run cluster on a single machine as for testing purposes.
Today I will try to show you how to create a single Spark standalone cluster with 1 Master and 2 Slave Nodes. I will use 2 different machines for this cluster. I will use the first one for creating my Master and 1 Slave node, also the second for creating another Slave node. So that we will have both a standalone cluster and also a single machine cluster.
To create a Spark standalone cluster, we have to install a compiled version of Apache Spark on each node on the cluster , Python and also we have to install Java JDK on each machine. It is also very important to install the same versions on each machine.
After completing installing, we need to edit spark-env.sh.template and slaves.template files in the Master machine. First open the spark conf folder and create a copy of spark-env.sh.template and rename it as “spark-env.sh”. Then edit the following parameters;
Then open spark conf folder and create a copy of slaves.template and rename it as “slaves”. Then edit the following parameters;
Lastly we need to create a data folder, which we will use for our data in Spark, in the same path for both machines (e.g. C:\data). If we do not have Hadoop HDFS on our cluster or we do not use AWS instances, we need to create a data folder for each node. Otherwise the cluster can not find the data and will give an error. Normally databases like HDFS or S3 will be used for Spark clusters in the real world but as we are creating it for understanding the basics of Spark standalone cluster, we will demonstrate it by creating a data folder in each machine.
Starting the Cluster
After completion of installing and editing steps, we are now ready for creating our lovely 😀 Spark standalone cluster. Firstly open a terminal in Master machine and move into bin folder of Spark and write;
“spark-class org.apache.spark.deploy.master.Master” command.
As you can see from screenshot above, our Master started successfully at “spark://<MASTER-IP>:7077”. We can also monitor our cluster from Spark standalone web-based user interface at port 8080.
You can easily see that we still do not have any Workers,so it is time to raise our first Worker node by opening another terminal in the Master machine and write;
“spark-class org.apache.spark.deploy.worker.Worker spark://<MASTER-IP>:7077” command.
So far we started our Master node and one Worker node in the same machine. Now let’s start our second Worker node from the second machine. We will open again a terminal and write the same command as we did for the first Worker.
After we started the second Worker, we can monitor the last situation of our cluster from web ui of Spark Standalone cluster;
We managed to create our Spark Standalone cluster with 2 Worker nodes. As you can see , ourWorkers have different sizes of memories (6.9 and 14.8 gib). This is because of different configurations of both machines. Master machine has 16 gb of memory and the one has 8 gb of memory. Also Spark uses default 1 gb of memory for each Worker node which we can change later in our Pyspark script.
Running an Application in Cluster
Our Spark standalone cluster is ready for use. Let’s test it with an example Pyspark script with Jupyter Notebook. I will use the same example which I used in the previous article about AWS. The only difference is about creating Spark Session. This time we will write “spark://<MASTER-IP>:7077” instead of “local” in master method. Also as I mentioned before, I will change the memory usage of Workers (4gb for each Worker and driver) and will assign core numbers for them. You can also adjust configuration with other options which you can find in Spark document.
We managed to run our query successfully with our new cluster. Finally we can look at our cluster from Spark UI webpage for monitoring Running Applications.
Our Spark application (spark-standalone-cluster) is running without any problem with our 2 Workers which are using 4 gb memory for each one as we specified in SparkSession.
In this article, I have tried to brief how to create and run Apache Spark Standalone Cluster with Jupyter Notebook. In this example we used 2 Workers which you can increase to any number with the same method as I mentioned in the article.You can also improve your knowledge about Spark standalone clusters with the help of Spark Documents.
I hope you will find this article helpful. In the next article, I will write about how to integrate Spark applications with the famous NoSQL database “MongoDB”.
I will be happy to hear any comments or questions from you. May the data be with you!