Setting up a Multi-Node Apache Spark Cluster on Apache Hadoop and Apache Hive

Published in

CodeX

4 min readApr 12, 2022

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. In this post, I will set up Apache Spark 3.0.1 on Apache Hadoop 3.3.0. Since there should be a running Apache Hadoop cluster to set up an Apache Spark cluster, I recommend using my previous blog post Running a Multi-Node Hadoop Cluster to set up your Apache Hadoop cluster.

Running a Multi-Node Hadoop Cluster

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across…

medium.com

I may use some Scala codes and CentOS (RHEL) commands.

Setup Apache Spark Environment

After running the Hadoop framework, we can start setting up Apache Spark. I will assume that you have set up the Hadoop framework using my post Running a Multi-Node Hadoop Cluster to be on the same page. Now, download and copy the file to all workers.

[hduser@master ~]# wget {Apache Spark LINK (I downloaded spark-3.0.1-bin-hadoop3.2.tgz)}
[hduser@master ~]# scp spark-3.0.1-bin-hadoop3.2.tgz hduser@worker1:/home/hduser/spark-3.0.1-bin-hadoop3.2.tgz
[hduser@master ~]# scp spark-3.0.1-bin-hadoop3.2.tgz hduser@worker2:/home/hduser/spark-3.0.1-bin-hadoop3.2.tgz

On all servers, extract Apache Spark, and create a soft link to be able to change its version simply in the future. I will extract all files in the /opt directory.

[hduser@{server} ~]# cd /opt
[hduser@{server} /opt]# sudo tar xzf /home/hduser/spark-3.0.1-bin-hadoop3.2.tgz
[hduser@{server} /opt]# sudo ln -s spark-3.0.1-bin-hadoop3.2/ spark
[hduser@{server} /opt]# sudo chown -R hduser:hadoop spark
[hduser@{server} /opt]# sudo chown -R hduser:hadoop spark-3.0.1-bin-hadoop3.2

Configure Apache Spark

Before starting configuring process, I should mention that configuring the Apache Spark cluster on an Apache Hadoop cluster is a two-step process. At first, Spark should know how Hadoop is working, then it should know how it is supposed to work! Let me show you the code. Copy the config files of Hadoop to the config directory of Spark on all servers. All Spark config files are available under the directory /pot/spark/conf.

[hduser@{server} ~]# cd /pot/spark/conf
[hduser@{server} /opt/spark/conf]# cp /opt/hadoop/etc/hadoop/core-site.xml .
[hduser@{server} /opt/spark/conf]# cp /opt/hadoop/etc/hadoop/hdfs-site.xml .
[hduser@{server} /opt/spark/conf]# cp /opt/hadoop/etc/hadoop/yarn-site.xml .

On the master server, edit the config files of Spark and copy them to workers.

[hduser@master ~]# cd /pot/spark/conf
[hduser@master /opt/spark/conf]# cp slaves.template slaves
[hduser@master /opt/spark/conf]# cp spark-env.sh.template spark-env.sh
[hduser@master /opt/spark/conf]# cp log4j.properties.template log4j.properties
[hduser@master /opt/spark/conf]# vi slaves# Add the names of all workers
master
worker1
worker2[hduser@master /opt/spark/conf]# vi spark-env.sh# Add these lines
export JAVA_HOME=/opt/jdkexport SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=$SPARK_HOME/conf
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoopexport SPARK_WORKER_DIR=/tmp/spark/work[hduser@master /opt/spark/conf]# vi log4j.properties# Add this line
log4j.logger.org=OFF[hduser@master /opt/spark/conf]# scp log4j.properties spark-env.sh hduser@{slaves}:/opt/spark/conf/

Finally, edit .bashrc file, and add these lines.

# Spark
export SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=$SPARK_HOME/conf
export PATH=$SPARK_HOME/bin:$PATH

Provide Apache Spark JAR Files

Configuring Apache Spark is finished but before submitting any Spark application we should provide Spark JAR files. Simply, run these commands on any server!

[hduser@{server} ~]# hdfs dfs -mkdir -p /user/spark/share/lib
[hduser@{server} ~]# hadoop fs -put /opt/spark/jars/* /user/spark/share/lib/

This point is where you can submit your Spark applications and use Apache HDFS and Apache Yarn using a command like this one on any server.

[hduser@{server} ~]# /opt/spark/bin/spark-submit \
--master yarn --deploy-mode cluster \
--class com.github.saeiddadkhah.hadoop.spark.Application \
--driver-memory 2G --driver-cores 2 \
--executor-memory 1G --executor-cores 1\
--name MyApp \
--num-executors 10 \
MyApp.jar

You should provide some other configurations to Apache Spark inside your application. The following code is a Scala code but Apache Spark has a similar API for the other languages.

Spark Config

Add Apache Hive support

Deciding to use Apache Hive in your Spark application or not is your responsibility, and it depends completely on the use case and your preferences. Here, I will just configure Apache Spark to use Apache Hive 3.1.2. Naturally, we need a running Apache Hive, and the good news is you can use my previous post “Setting up an Apache Hive Data Warehouse” to set it up.

Setting up an Apache Hive Data Warehouse

The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets using SQL in Hadoop…

medium.com

Now, copy the config file of Hive to the config directory of Spark on all servers.

[hduser@{server} ~]# cd /pot/spark/conf
[hduser@{server} /opt/spark/conf]# cp /opt/hive/conf/hive-site.xml .

You should also add other configs inside your application to enable Hive support.

Spark Config with Hive

Stay tuned for the next posts to set up more tools. I hope you enjoyed this tutorial. Thank you.