Creating Apache Spark Standalone Cluster with on Windows

Sercan Karagoz
Jan 27 · 5 min read

Apache Spark is a powerful, fast and cost efficient tool for Big Data problems with having components like Spark Streaming, Spark SQL and Spark MLlib. Therefore Spark is like the Swiss army knife of the Big Data world.

Moreover Spark also has another very important feature which is horizontal scaling. In other words Spark supports standalone (deploy) cluster mode. A single Spark cluster has one Master and any number of Slaves or Workers. Workers can run their own individual processes on a horizontal Spark cluster on seperate machines as well as on the same machine with vertical scaling.

Image for post
Image for post

We can start a standalone cluster either manually by hand or we can use launch scripts provided by Spark (sbin folder). Moreover we can also create and run cluster on a single machine as for testing purposes.

Today I will try to show you how to create a single Spark standalone cluster with 1 Master and 2 Slave Nodes. I will use 2 different machines for this cluster. I will use the first one for creating my Master and 1 Slave node, also the second for creating another Slave node. So that we will have both a standalone cluster and also a single machine cluster.

Prerequisites

To create a Spark standalone cluster, we have to install a compiled version of Apache Spark on each node on the cluster , Python and also we have to install Java JDK on each machine. It is also very important to install the same versions on each machine.

After completing installing, we need to edit spark-env.sh.template and slaves.template files in the Master machine. First open the spark conf folder and create a copy of spark-env.sh.template and rename it as “spark-env.sh”. Then edit the following parameters;

 export SPARK_MASTER_HOST=<MASTER-IP>
export JAVA_HOME=<Path_of_JAVA_installation>
export PYSPARK_PYTHON=python3

Then open spark conf folder and create a copy of slaves.template and rename it as “slaves”. Then edit the following parameters;

<SLAVE01-IP>
<SLAVE02-IP>

Lastly we need to create a data folder, which we will use for our data in Spark, in the same path for both machines (e.g. C:\data). If we do not have Hadoop HDFS on our cluster or we do not use AWS instances, we need to create a data folder for each node. Otherwise the cluster can not find the data and will give an error. Normally databases like HDFS or S3 will be used for Spark clusters in the real world but as we are creating it for understanding the basics of Spark standalone cluster, we will demonstrate it by creating a data folder in each machine.

Starting the Cluster

After completion of installing and editing steps, we are now ready for creating our lovely 😀 Spark standalone cluster. Firstly open a terminal in Master machine and move into bin folder of Spark and write;

“spark-class org.apache.spark.deploy.master.Master” command.

Image for post
Image for post

As you can see from screenshot above, our Master started successfully at “spark://<MASTER-IP>:7077”. We can also monitor our cluster from Spark standalone web-based user interface at port 8080.

Image for post
Image for post

You can easily see that we still do not have any Workers,so it is time to raise our first Worker node by opening another terminal in the Master machine and write;

“spark-class org.apache.spark.deploy.worker.Worker spark://<MASTER-IP>:7077” command.

Image for post
Image for post
Image for post
Image for post

So far we started our Master node and one Worker node in the same machine. Now let’s start our second Worker node from the second machine. We will open again a terminal and write the same command as we did for the first Worker.

After we started the second Worker, we can monitor the last situation of our cluster from web ui of Spark Standalone cluster;

Image for post
Image for post

We managed to create our Spark Standalone cluster with 2 Worker nodes. As you can see , ourWorkers have different sizes of memories (6.9 and 14.8 gib). This is because of different configurations of both machines. Master machine has 16 gb of memory and the one has 8 gb of memory. Also Spark uses default 1 gb of memory for each Worker node which we can change later in our Pyspark script.

Running an Application in Cluster

Our Spark standalone cluster is ready for use. Let’s test it with an example Pyspark script with Jupyter Notebook. I will use the same example which I used in the previous article about AWS. The only difference is about creating Spark Session. This time we will write “spark://<MASTER-IP>:7077” instead of “local” in master method. Also as I mentioned before, I will change the memory usage of Workers (4gb for each Worker and driver) and will assign core numbers for them. You can also adjust configuration with other options which you can find in Spark document.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

We managed to run our query successfully with our new cluster. Finally we can look at our cluster from Spark UI webpage for monitoring Running Applications.

Image for post
Image for post

Our Spark application (spark-standalone-cluster) is running without any problem with our 2 Workers which are using 4 gb memory for each one as we specified in SparkSession.

Conclusion

In this article, I have tried to brief how to create and run Apache Spark Standalone Cluster with Jupyter Notebook. In this example we used 2 Workers which you can increase to any number with the same method as I mentioned in the article.You can also improve your knowledge about Spark standalone clusters with the help of Spark Documents.

I hope you will find this article helpful. In the next article, I will write about how to integrate Spark applications with the famous NoSQL database “MongoDB”.

I will be happy to hear any comments or questions from you. May the data be with you!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Sercan Karagoz

Written by

Data Scientist, Electrical Engineer and Commercial Pilot

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Sercan Karagoz

Written by

Data Scientist, Electrical Engineer and Commercial Pilot

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store