Distributed Hadoop Cluster — 1 (Spark) with all dependencies

AHMET FURKAN DEMIR
17 min readDec 18, 2023

--

Impressive Apache Hadoop Cluster Installation: Step-by-Step Guide

Cluster Architecture

Hello! In this article, we will explore the exciting steps of creating a Hadoop cluster. This cluster will be implemented on virtual machines created using Multipass, consisting of a master and two nodes. Throughout this process, our goal is to bring together essential components such as Hadoop, YARN, and HDFS to build a robust data processing infrastructure. Additionally, we will configure the Apache Spark cluster to further enhance big data processing capabilities.

Our Roadmap:

  1. What is Hadoop?
  2. Cluster configuration
  3. Hadoop&Yarn installation
  4. Apache Spark installation
  5. Sample project with Hadoop and Spark

What is Hadoop?

Hadoop is an open-source software framework designed for large-scale data clustering and distributed storage. The Apache Hadoop project consists of a set of tools and libraries used to store and process large datasets. Hadoop is commonly employed by large-scale organizations seeking to process and store complex and massive datasets.

Key features of Hadoop include:

  1. HDFS (Hadoop Distributed File System): It is a distributed file system used for storing large datasets. Data is replicated across many nodes in the cluster to ensure resilience.
  2. MapReduce: It is a programming model used to process large datasets using a parallel computing model. MapReduce enables applications running on Hadoop to process large datasets in parallel. In this article, instead of MapReduce, we will use Apache Spark, which is more modern and advanced while performing the same operations.
  3. YARN (Yet Another Resource Negotiator): It is a resource management layer that manages resources in a Hadoop cluster. YARN is used to run multiple applications efficiently by effectively utilizing resources in the cluster.
  4. Hadoop Ecosystem: The Hadoop ecosystem comprises a set of projects and tools that extend the core features of Hadoop. For example, projects like Apache Hive (database query language), Apache Pig (data analysis tool), Apache HBase (distributed database), Apache Spark (fast big data processing), Apache Kafka (stream processing), and many others collectively form the Hadoop ecosystem.

Hadoop is designed to meet the needs of big data analytics and processing with scalability, resilience, and cost-effective storage. Its ability to perform parallel processing on vast datasets makes Hadoop a crucial tool for large-scale data analysis and mining.

Cluster configuration

In this section, we will configure port communication and SSH connection between Ubuntu machines. Before proceeding with this step, please create three machines using Multipass one master and two nodes, as shown in the image below.

If you wish, you can use VirtualBox, virtual machines from any cloud provider or direct physical servers instead of Multipass.

Use the following commands to create virtual machines using Multipass.

multipass launch --name master --cpus 4 --memory 5120M --disk 40G
multipass launch --name node1 --cpus 4 --memory 5120M --disk 40G
multipass launch --name node2 --cpus 4 --memory 5120M --disk 40G

# list your machines
multipass list
List of ubuntu machines opened on Multipass

Machines

  • master (172.20.93.59) | Ubuntu 22.04 LTS | 4 CPU, 5 GB RAM, 40 GB SSD
  • node1 (172.20.82.41) | Ubuntu 22.04 LTS | 4 CPU, 5 GB RAM, 40 GB SSD
  • node2 (172.20.81.188) | Ubuntu 22.04 LTS | 4 CPU, 5 GB RAM, 40 GB SSD
Features of master machine

If your machine processes are completed, open the necessary ports using ufw. These ports will enable communication between nodes in the Hadoop cluster.

ufw (Uncomplicated Firewall) is a firewall management tool used on Ubuntu. Essentially, it is designed to simplify and make firewall settings more user-friendly. It is widely used in Ubuntu and Debian-based distributions.

You can open the necessary ports with the following commands:

sudo ufw allow 8020
sudo ufw allow 8080
sudo ufw allow 50090
sudo ufw allow 8021
sudo ufw allow 50060
sudo ufw allow 1004
sudo ufw allow 9000
sudo ufw allow 9001
sudo ufw allow 22
sudo ufw allow ssh
sudo ufw allow 8032
sudo ufw allow 9870
sudo ufw allow 9866
sudo ufw allow 9864
sudo ufw allow 9868
sudo ufw allow 8088
sudo ufw allow 8042
sudo ufw allow 50020
sudo ufw allow 8030
sudo ufw allow 8031
sudo ufw allow 8040
sudo ufw allow 7077


sudo ufw enable
sudo ufw reload
sudo ufw status

After opening the ports, we must assign hostnames separately for each machine.

# master
sudo hostnamectl set-hostname master

# node1
sudo hostnamectl set-hostname node1

# node2
sudo hostnamectl set-hostname node2

For example, when you run the sudo hostnamectl hostname command, you should get the following output. It may give a different output depending on each machine.

learning hostname

You can install SSH-related packages to enable communication between machines using the commands below. These commands must be run for each machine.

sudo apt-get install ssh
sudo apt-get install ssh*

To establish SSH connections successfully, we need to add the IP addresses and DNS information of all machines to the /etc/hosts file. This file allows name-based connections between machines on the network by specifying the corresponding IP address for each machine name. This enables us to establish SSH connections securely and seamlessly.

The /etc/hosts file is a file that manages local DNS resolution on a computer. This file provides mappings between IP addresses and hostnames. It is commonly used to associate IP addresses with hostnames before using a DNS server, especially in small networks or on a single computer.

This file serves various purposes, such as organizing incoming network traffic to your computer, mapping a specific IP address to a name, or providing local resolution when reliable access to DNS servers is not available.

Open and edit the /etc/hosts file separately for all machines, you can edit this file with the following command sudo nano /etc/hosts

Paste the following text into the opened file. Repeat for each machine and paste the same text. Edit and use this text according to your own machines and IP addresses.

172.20.93.59 master
172.20.82.41 node1
172.20.81.188 node2

Important warning !!

Be sure to delete the “127.0.0.1 localhost” line in the master machine and nodes. Otherwise, the master and nodes will not be able to see each other in the following stages. Delete this line on all machines. If you do not have such a line, continue.

/etc/hosts file

Then, we will open a new user named hadoop-user on all machines and we will do all the remaining operations within this user. Following the commands, it should automatically switch to new users.

# Run the same commands separately for all machines
sudo adduser hadoop-user
sudo usermod -aG hadoop-user hadoop-user
sudo adduser hadoop-user sudo
su hadoop-user
cd /home/hadoop-user
Switching to hadoop-user on the master machine

Run the following codes on the master machine. Thanks to SSH, all operations will be performed on the master from now on. Now all commands will be run only on MASTER.

sudo apt-get purge openssh-server
ssh-keygen
cat /home/hadoop-user/.ssh/id_rsa.pub

Copy the created ID to all machines with the following command for password-free login.

ssh-copy-id hadoop-user@node1
ssh-copy-id hadoop-user@node2
ssh-copy-id hadoop-user@master

Use the following commands for control purposes. If the connection is established without a password and without any problems, all operations are successful and you have successfully completed the cluster configuration.

ssh node1
exit # Exit from machine slave1

ssh node2
exit # Exit from machine slave2

ssh master
exit # Exit from master
Password-free migration from master machine to node1 machine via SSH

If “Permission denied (publickey)” on Multipass SSH side. If you receive an error, please review this article and follow the solution method. Post link

Hadoop&Yarn installation

In this step, we will complete the Hadoop&Yarn installations and complete the cluster creation process.

Hadoop Configuration

First, complete the java installation by running the following commands on all machines.

# for master machine
sudo apt install openjdk-8-jre
sudo apt install openjdk-8-jdk
java -version

# for node1 machine
ssh node1
sudo apt install openjdk-8-jre
sudo apt install openjdk-8-jdk
java -version
exit

# for node2 machine
ssh node2
sudo apt install openjdk-8-jre
sudo apt install openjdk-8-jdk
java -version
exit
Make sure Java installation is complete on all machines

Complete the java configuration on all machines with the following commands

# for master machine
cd /usr/lib/jvm
sudo ln -sf java-8-openjdk* current-java

echo 'export JAVA_HOME=/usr/lib/jvm/current-java' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $JAVA_HOME

# for node1 machine
ssh node1
cd /usr/lib/jvm
sudo ln -sf java-8-openjdk* current-java

echo 'export JAVA_HOME=/usr/lib/jvm/current-java' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $JAVA_HOME
exit

# for node2 machine
ssh node2
cd /usr/lib/jvm
sudo ln -sf java-8-openjdk* current-java

echo 'export JAVA_HOME=/usr/lib/jvm/current-java' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $JAVA_HOME
exit

Downloading Hadoop on the master machine and copying Hadoop to all node machines. You can check and download the latest Hadoop version from this link https://archive.apache.org/dist/hadoop/common/. Since the latest Hadoop version was 3.3.6 at the time the article was published, Hadoop 3.3.6 version will be used in future commands.

cd ~ # Go to home directory
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xvf hadoop-3.3.6.tar.gz # Untar Hadoop files
rm hadoop-3.3.6.tar.gz # Delete redundant archive file
scp -r ./hadoop-3.3.6 hadoop-user@node1:~/ # Copy files to node1
scp -r ./hadoop-3.3.6 hadoop-user@node2:~/ # Copy files to node2

On all machines, move hadoop files to /opt directory for maintenance and processing permissions

# For master Machine
sudo mv hadoop-3.3.6 /opt/ # Move Hadoop files to /opt/ directory
sudo ln -sf /opt/hadoop-3.3.6 /opt/hadoop # Create symbolic link for abstraction
sudo chown hadoop-user:root /opt/hadoop* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/hadoop* -R # Allow group to read-write-execute

# For node1 Machine
ssh node1
sudo mv hadoop-3.3.6 /opt/ # Move Hadoop files to /opt/ directory
sudo ln -sf /opt/hadoop-3.3.6 /opt/hadoop # Create symbolic link for abstraction
sudo chown hadoop-user:root /opt/hadoop* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/hadoop* -R # Allow group to read-write-execute
exit # Logout from node1

# For node2 Machine
ssh node2
sudo mv hadoop-3.3.6 /opt/ # Move Hadoop files to /opt/ directory
sudo ln -sf /opt/hadoop-3.3.6 /opt/hadoop # Create symbolic link for abstraction
sudo chown hadoop-user:root /opt/hadoop* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/hadoop* -R # Allow group to read-write-execute
exit # Logout from node2

Adding hadoop paths to $PATH variable on all machines

# master
echo 'export PATH=$PATH:/opt/hadoop/bin:/opt/hadoop/sbin' >> ~/.bashrc
source ~/.bashrc # Reload the changed bashrc file
echo $PATH # Confirm that $PATH variable is changed properly

# node1
ssh node1
echo 'export PATH=$PATH:/opt/hadoop/bin:/opt/hadoop/sbin' >> ~/.bashrc
source ~/.bashrc # Reload the changed bashrc file
echo $PATH # Confirm that $PATH variable is changed properly
exit

# node2
ssh node2
echo 'export PATH=$PATH:/opt/hadoop/bin:/opt/hadoop/sbin' >> ~/.bashrc
source ~/.bashrc # Reload the changed bashrc file
echo $PATH # Confirm that $PATH variable is changed properly
exit

Export Hadoop variables on all machines

# master
echo '
# Bash Variables for Hadoop
export HADOOP_HOME="/opt/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $HADOOP_HOME $HADOOP_COMMON_HOME $HADOOP_CONF_DIR $HADOOP_HDFS_HOME $HADOOP_MAPRED_HOME

# node1
ssh node1
echo '
# Bash Variables for Hadoop
export HADOOP_HOME="/opt/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $HADOOP_HOME $HADOOP_COMMON_HOME $HADOOP_CONF_DIR $HADOOP_HDFS_HOME $HADOOP_MAPRED_HOME
exit

# node2
ssh node2
echo '
# Bash Variables for Hadoop
export HADOOP_HOME="/opt/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $HADOOP_HOME $HADOOP_COMMON_HOME $HADOOP_CONF_DIR $HADOOP_HDFS_HOME $HADOOP_MAPRED_HOME
exit

We will configure the host first and then transfer the configuration file to other nodes.

On the host machine we will edit the $HADOOP_HOME/etc/hadoop/core-site.xml file. Your “configuration” tag in this file should look like this

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/core-site.xml

Then we will edit the $HADOOP_HOME/etc/hadoop/hdfs-site.xml file on the host machine. Your “configuration” tag in this file should look like this.

The dfs.webhdfs.enabled and dfs.permissions variables here allow us to put files into HDFS through all users and APIs.

<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/hdfs-site.xml

We will edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh on the host machine. Find the line containing Export JAVA_HOME and change this line to

export JAVA_HOME=/usr/lib/jvm/current-java

$HADOOP_HOME/etc/hadoop/hadoop-env.sh

We will edit $HADOOP_HOME/etc/hadoop/workers on the host machine. If there is a localhost line in the file, delete it. Just add the following lines

node1
node2
$HADOOP_HOME/etc/hadoop/workers

The process of transferring all the changes we made to other nodes.

scp  $HADOOP_HOME/etc/hadoop/* node1:$HADOOP_HOME/etc/hadoop/ # For node1
scp $HADOOP_HOME/etc/hadoop/* node2:$HADOOP_HOME/etc/hadoop/ # For node2

Formatting the HDFS file system on the host machine

hdfs namenode -format

Starting HDFS on host

start-dfs.sh # or $HADOOP_HOME/sbin/start-dfs.sh

Verify that everything started correctly by running the jps command as the sudo user on all machines. In the master node, you should see SecondaryNameNode and NameNode as shown below:

14259 NameNode
14581 Jps
14472 SecondaryNameNode

When you go to the site (http://master:9870 or ip:port | http://172.20.93.59:9870) you should see the number of live nodes as 2.

http://master:9870
datanodes

Yarn Configuration

YARN (Yet Another Resource Negotiator) is a cluster resource management system that is part of the Apache Hadoop ecosystem and aims to manage big data processing jobs more effectively. YARN emerged specifically in Hadoop version 2.0 and later and provides a more flexible and general-purpose big data processing infrastructure than previous versions of Hadoop.

YARN primarily consists of two main components:

ResourceManager: Manages all resources on the cluster. It accepts job applications, tracks available resources, and schedules jobs to be sent to the appropriate nodes.

NodeManager: Acts as an agent running on each node. It communicates with the ResourceManager, monitors resource usage on the node, and manages the applications to run.

On all machines, export the $HADOOP_YARN_HOME variable.

# master
echo '# Bash Variables for Yarn
export HADOOP_YARN_HOME=$HADOOP_HOME' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $HADOOP_YARN_HOME

# node1
ssho node1
echo '# Bash Variables for Yarn
export HADOOP_YARN_HOME=$HADOOP_HOME' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $HADOOP_YARN_HOME
exit

# node2
ssho node2
echo '# Bash Variables for Yarn
export HADOOP_YARN_HOME=$HADOOP_HOME' >> ~/.bashrc

source ~/.bashrc # Reload the changed bashrc file
echo $HADOOP_YARN_HOME
exit

On the host machine we will edit the $HADOOP_HOME/etc/hadoop/yarn-site.xml file. Your “configuration” tag in this file should look like this.

<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/yarn-site.xml

Transfer yarn-site.xml from host machine to other machines

scp $HADOOP_HOME/etc/hadoop/yarn-site.xml node1:$HADOOP_HOME/etc/hadoop/yarn-site.xml # Copy yarn config to node1
scp $HADOOP_HOME/etc/hadoop/yarn-site.xml node2:$HADOOP_HOME/etc/hadoop/yarn-site.xml # Copy yarn config to node2

Start Yarn on the host machine

start-yarn.sh # or $HADOOP_HOME/sbin/start-yarn.sh

Check the nodes in Yarn

yarn node --list

yarn node --list

Test Hadoop & Yarn

Calculate the number pi with the following command.

yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar pi 16 1000

You can go to the Yarn&Hadoop UI and see the number of active nodes and running jobs. http://master:8088 or ip:port | http://172.20.93.59:8088/cluster/cluster

Yarn&Hadoop UI

We have successfully completed the Hadoop&HDFS and Yarn installation, the next step will be to run Apache Spark in a cluster.

Apache Spark installation

Apache Spark is an open-source, fast and general-purpose data processing framework for big data processing and analysis. Developed as an alternative to the Hadoop MapReduce model, Spark can perform fast and distributed operations on large data sets on the Hadoop Distributed File System (HDFS). Spark attracts attention with its high performance, ease of use and wide range of applications.

The key features of Apache Spark are:

Fast Processing: Spark can perform data processing operations much faster than Hadoop MapReduce. This fast processing is based on Spark’s ability to keep data in memory.

Distributed Data Processing: Spark can process large data sets in a parallel and distributed manner. It provides high performance on large data sets and completes operations faster.

In-Memory Processing: Spark provides significant performance gains with its ability to keep data in memory. This feature provides the advantage of storing intermediate results in memory and using these results when necessary.

Various Language Support: Spark supports various programming languages ​​such as Java, Scala, Python and R. This gives developers the flexibility to use the language of their choice.

Wide Range of Applications: Spark can be used in various application areas such as data processing, querying, machine learning, graph analysis, and streaming data processing. Therefore, it is a preferred big data framework for a wide range of applications.

Spark is compatible with the Hadoop ecosystem and is typically run on Hadoop YARN. However, Spark can also run on its own cluster manager, Standalone Cluster Manager. Apache Spark is an important tool that provides fast and flexible solutions in the world of big data processing.

Install Scala on all machines with the following commands

# master
sudo apt install scala
scala -version

# node1
ssh node1
sudo apt install scala
scala -version

# node2
ssh node2
sudo apt install scala
scala -version

Download & Install Spark 3.5.0

We will perform the operations on the master machine and then copy them to the nodes. For other versions of Spark, go to https://archive.apache.org/dist/spark/

cd ~ # Go to home directory
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvf spark-3.5.0-bin-hadoop3.tgz # Untar Spark files
rm ./spark-3.5.0-bin-hadoop3.tgz # Delete redundant archive file
scp -r ./spark-3.5.0-bin-hadoop3 hadoop-user@node1:~/ # Copy files to node1
scp -r ./spark-3.5.0-bin-hadoop3 hadoop-user@node2:~/ # Copy files to node2

On all machines, we will move the hadoop files under the /opt directory for maintenance and usage permissions.

# For master Machine
sudo mv spark-3.5.0-bin-hadoop3 /opt/ # Move Spark files to /opt/ directory
sudo ln -sf /opt/spark-3.5.0-bin-hadoop3 /opt/spark # Create symbolic link for abstraction
sudo chown hadoop-user:root /opt/spark* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/spark* -R # Allow group to read-write-execute

# For node1 Machine
ssh node1
sudo mv spark-3.5.0-bin-hadoop3 /opt/ # Move Spark files to /opt/ directory
sudo ln -sf /opt/spark-3.5.0-bin-hadoop3 /opt/spark # Create symbolic link for abstraction
sudo chown hadoop-user:root /opt/spark* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/spark* -R # Allow group to read-write-execute
exit # Logout from node1

# For node2 Machine
ssh node2
sudo mv spark-3.5.0-bin-hadoop3 /opt/ # Move Spark files to /opt/ directory
sudo ln -sf /opt/spark-3.5.0-bin-hadoop3 /opt/spark # Create symbolic link for abstraction
sudo chown hadoop-user:root /opt/spark* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/spark* -R # Allow group to read-write-execute
exit # Logout from node2

Add Spark paths to $PATH variable on all machines and export $SPARK_HOME

# master
echo '
# For Spark
export PATH=$PATH:/opt/spark/bin
export SPARK_HOME=/opt/spark
' >> ~/.bashrc
source ~/.bashrc # Reload the changed bashrc file
echo $PATH $SPARK_HOME # Confirm that $PATH and $SPARK_HOME variable is changed properly

# node1
ssh node1
echo '
# For Spark
export PATH=$PATH:/opt/spark/bin
export SPARK_HOME=/opt/spark
' >> ~/.bashrc
source ~/.bashrc # Reload the changed bashrc file
echo $PATH $SPARK_HOME # Confirm that $PATH and $SPARK_HOME variable is changed properly
exit

# node2
ssh node2
echo '
# For Spark
export PATH=$PATH:/opt/spark/bin
export SPARK_HOME=/opt/spark
' >> ~/.bashrc
source ~/.bashrc # Reload the changed bashrc file
echo $PATH $SPARK_HOME # Confirm that $PATH and $SPARK_HOME variable is changed properly
exit

Configure Spark

We will edit $SPARK_HOME/conf/spark-env.sh on the host machine. But first, change spark-env.sh.template to spark-env.sh.

mv $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh

Open $SPARK_HOME/conf/spark-env.sh with a text editor (e.g. GNU Emacs, Vim, Gedit, Nano) and set the following parameters by adding the following lines.

export SPARK_MASTER_HOST=master
export JAVA_HOME=/usr/lib/jvm/current-java
$SPARK_HOME/conf/spark-env.sh

On the host machine we will edit the $SPARK_HOME/conf/workers file. But first, change workers.template to workers.

mv $SPARK_HOME/conf/workers.template $SPARK_HOME/conf/workers

Open the $SPARK_HOME/conf/workers file with a text editor. If there is a localhost line in the file, delete it. Just add the following lines

node1
node2
$SPARK_HOME/conf/workers

Start Spark on the host machine

$SPARK_HOME/sbin/start-all.sh

To test the Spark installation, http://master:8080/ or ip:port | Go to http://172.20.93.59:8080/.

Spark UI

Thus, we have successfully completed the Hadoop&Yarn and Spark cluster installation. Our cluster is now operational with one master and two nodes. In the next step, the last step, we will develop a sample project with HDFS&Spark.

Sample project with Hadoop and Spark

In the example we will use, we will write a Spark application that calculates the number of words in a text file. We will use Spark and HDFS that we created for this application.

First, create a file named string.txt on the master machine and write the following in it.

Apache Spark is an open-source distributed computing system that can process large datasets quickly.
It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
Spark is designed for speed and ease of use.
It can perform in-memory computations to increase processing speed.
The core abstraction of Spark is the resilient distributed dataset (RDD), a fault-tolerant collection of elements that can be processed in parallel.

string.txt file

Send/write the string.txt file to HDFS with the following command.

hdfs dfs -copyFromLocal string.txt /string.txt

View string.txt file via Hadoop UI

Then, let’s complete the installation of python-spark on the master machine with the following commands.

sudo apt install python3-pip
pip3 install pyspark

After the installation, open a shell with the python3 command. From now on, we will continue through this shell.

Python shell

Write the following code in Python Shell and run the commands.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("PythonWordCount") \
.master("spark://master:7077") \
.config("spark.executor.cores", "1") \
.config("spark.executor.memory", "2g") \
.getOrCreate()

hdfs_path = "hdfs://master:9000/string.txt"

lines = spark.read.text(hdfs_path).rdd.map(lambda r: r[0])

word_counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(lambda x, y: x + y)

for (word, count) in word_counts.collect():
print(f"{word}: {count}")

After running the Python-Spark commands, you should get an output like the following.

Spark Job output

In the picture below, you can see information about the operation of the application via SparkUI.

Information of the running Spark application

After the commands were successfully completed and the outputs were obtained, we observed the application running through Spark UI. It is possible to state that all transactions were completed without any problems.

Thus, we have successfully completed the process of running Hadoop&Yarn and Apache Spark in a distributed and cluster manner. Take care, see you in the next articles :)

--

--