Spark on cluster (Standalone/Yarn)
Here I’m going to run a standalone spark cluster and then use my existing yarn cluster to run spark application.
feel free to check my previous posts to configure the Hadoop cluster.
SPARK standalone cluster
In a standalone cluster, we can run spark either client or cluster deploy modes.
Basically, It decides where the driver program runs on. In the client-deploy mode, the application driver runs on your same client process, but in cluster-deploy mode, the driver starts on one of the worker nodes in the cluster.
Google it, there are plenty of resources that explain how this works.
I’m using my Hadoop master node VM for spark master node and all Hadoop data nodes VM for spark work nodes.
Install spark on our master node
Download Spark distribution
wget https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgzExtract to any location
I extract to /usr/local/spark/ folder.
Set enviroment veriables
nano .bash_profileThen add these entries
export SPARK_HOME=/usr/local/spark/
export HADOOP_CONF_DIR=/home/shehan/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark/
export LD_LIBRARY_PATH=/home/shehan/hadoop/lib/native:$LD_LIBRARY_PATH
Configure Spark
Create a copy of spark-defaults.conf.template
cp /usr/local/spark/conf/spark-defaults.conf.template /usr/local/spark/conf/spark-defaults.confOpen spark-defaults.conf & add properties.
nano /usr/local/spark/conf/spark-defaults.conf
Note
Make sure you add these entries and this required when Spark runs on the Yarn cluster.spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled trueNow, copy shuffle library (this library file is on %SPARK_HOME%/yarn)spark-3.0.0-preview2-yarn-shuffle.jar Copy this file into /hadoop/share/hadoop/yarn/ folder.
cp /usr/local/spark/yarn/spark-3.0.0-preview2-yarn-shuffle.jar hadoop/share/hadoop/yarn
Configure Spark
Open spark-env.sh & add properties
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/
export HADOOP_CONF_DIR=/home/shehan/hadoop/etc/hadoop
export SPARK_MASTER_HOST=master.hadoop.smfCreate folder in HDFS for spark-event logs
hadoop fs -mkdir /spark-events
Verify
run spark-shell -version to check spark is installed correctly and to see the version.
Run Spark Master.
/usr/local/spark/sbin/start-master.sh
Add workers.
Create a copy of “slaves.template” as “slaves” then add node names
cp /usr/local/spark/conf/slaves.template/usr/local/spark/conf/slavesMy entries
node01.hadoop.smf
node02.hadoop.smf
node03.hadoop.smf
Finally, copy the spark folder to all the worker nodes. (copy to the same location as master, then no need to change any configuration), then make sure .bash_profile also updates accordingly.
Run Spark Slaves.
/usr/local/spark/sbin/start-slaves.sh
Everything seems okay, and all the workers are registered. Now its time to run some samples and see cluster is working expectedly.
Example ( deploy-mode=cluster)
spark-submit --master spark://master.hadoop.smf:7077 --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/local/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar
We can see results on driver logs.
Example ( deploy-mode=client)
spark-submit --master spark://master.hadoop.smf:7077 --deploy-mode client --class org.apache.spark.examples.SparkPi /usr/local/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar
When running the application we can see progress via the SparkUI http://<nodename>:4040/jobs/. However, once the job is finished it will disappear. So, we can run the Spark history server to see status/progress even after the job is finished. And History server is run on port 18080 (defined in spark-defaults.conf)
/usr/local/spark/sbin/start-history-server.sh
SPARK On YARN
Spark supports two modes for running on the YARN cluster, “yarn-cluster” mode and “yarn-client” mode. It’s the same as the Spark cluster. In the “yarn-cluster” mode driver will run on one of the data nodes and it is called “Application master”. “ Application master” will be just another java process, and its purpose is to start to drive the given application.
Open yarn-site.xml & add/update properties<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle,mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
We don’t need to spark master and its worker nodes to run here.
Stop spark cluster
/usr/local/spark/sbin/stop-slaves.sh
/usr/local/spark/sbin/stop-master.shstart yarn daemons
start-yarn.sh
Example
cluster mode
spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/local/spark/examples/jars/spark-examples_2.12-3.0.0-preview 2.jarClient mode
spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi /usr/local/spark/examples/jars/spark-examples_2.12-3.0.0-preview 2.jar
Next :-