Install Apache Spark (pyspark) — Standalone mode

Siva Chaitanya
2 min readMay 18, 2018

--

While there is a lot of documentation around how to use spark, I could not find a post which could help me install Apache Spark from scratch on a machine to set up a standalone cluster. Hence, this post.

A mighty flame followeth a tiny spark — Dante

If you are here, you are already aware of the advantages of distributed computing. So, let’s get started. Here is all the things you need to set up your own cluster:
1. One or more machines with Ubuntu 16.04
2. Internet connectivity
3. Snacks to eat while you wait

This post is divided into three parts as below:
1. Installing the necessary dependencies
2. Installing pyspark and hadoop
3. Setting up pyspark configuration

Installing necessary dependencies:

Update your system and install tmux using the following commands:

sudo apt-get update
sudo apt-get install tmux

If you are not aware of tmux and what it does, my friend there a lot more that you have to explore in this world and it is out of scope of this post.

Install Oracle Java 8:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

While you can install Spark separately, Python has more utility in developer world. Hence, I picked pyspark for my exercise. You can either install python alone or get Anaconda. I prefer Anaconda as it comes with almost all libraries, jupyter notebook, etc. Steps to download and install are as follows:

wget https://repo.anaconda.com/archive/Anaconda3-5.1.0-Linux-x86_64.sh
sh Anaconda3–5.1.0-Linux-x86_64.sh

At the end of installation, you can add Anaconda to the path or skip that step and add manually (better choice):

export PATH=~/anaconda3/bin:$PATH

Update pip and install pyspark

pip install — upgrade pip
pip install pyspark

On Apache Spark website, you can download only Spark or with Hadoop. A lot of tasks need Hadoop and it is better to configure early on rather than figure it out later. Currently Apache Spark ships with Hadoop 2.7.0 and even though there are more recent versions, I have faced issues while using those:

wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.0/hadoop-2.7.0.tar.gz
tar xvfz hadoop-2.7.0.tar.gz

Use the below configuration steps to set up Hadoop in your path:

export HADOOP_HOME=~/hadoop-2.7.0 export PATH=$HADOOP_HOME/bin:$PATH export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Yarn can be used to run the cluster as well. Configuring it is easy:

export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

Hadoop-2.7.0 ships with all necessary jar files. We need to add it to hadoop classpath:

export HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*

Adding hadoop classpath to spark distribution classpath is necessary step in running in a cluster mode. This step is necessary while running on AWS EMR too.

export SPARK_DIST_CLASSPATH=`hadoop classpath`

Now you can start your master and worker nodes and submit your applications in either cluster or client mode to distribute your workloads.

Happy Distributing!

--

--