Installing Apache Spark 2.x on Cloudera Quickstart WM

Ali Yesilli
4 min readSep 23, 2018

--

This tutorial guides you through the process of downloading and running Apache Spark 2.x version in Cloudera Quickstart VM. As you know Apache Spark 2.3.1 has been released on 08 June 2018. If you want to practice and work with Spark 2.x features then it is time to upgrade Spark.
Let’s start it.

I am using Cloudera Quickstart VM 5.13.0 and as you can see above, its Spark version is 1.6.0. Before we start to install Spark 2.x version, we need to know current Java version and Hadoop version. First let’s check Java version.

$ java -version

It is 1.7 but it should be 1.8 so needed to upgrade Java. Firstly we should remove current Java

$ sudo yum remove java

It is removed. To install OpenJDK 8 JRE using yum, run this command:

$ sudo yum install java-1.8.0-openjdk

Installation has been finished so now you could set JAVA_HOME environment variable with using script like below

$ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64

It is done! Let’s check Java version again:

It is time to download newest Spark version but before we do it, we need to learn Hadoop version.

$ hadoop version

Hadoop 2.6 is installed, it means we should download Spark version which is available for Hadoop 2.6. Go to Apache Spark website and Download page.

You can choose any 2.x version, the newest version is 2.3.1 now so I select it. Please be sure you choose Apache Hadoop 2.6 as a package type in second option. Its default is Apache Hadoop 2.7 so please don’t forget the change the Hadoop version. Ok, let’s click the download. It is 213 MB.

Unpack the archive and move the folder to /usr/local path

$ cd Downloads/
$ tar -xvf /home/cloudera/Downloads/spark-2.3.1-bin-hadoop2.6.tgz
$ sudo mv spark-2.3.1-bin-hadoop2.6 /usr/local/spark

Let’s check if it works

Great! It is working but we should make it default Spark. Otherwise we have to run it with using its path all the time

$ /usr/local/spark/bin/pyspark

To do this we should change 3 files below and replace Spark home path with new version’s path.

/usr/bin/pyspark
/usr/bin/spark-shell
/usr/bin/spark-submit

So let’s change “/usr/lib/spark” to “/usr/local/spark” in the files

$ sudo vi /usr/bin/pyspark

After you open it with vi editor, go to “l” character of “lib” then push “x” 3 times, then back space and push “a” and write “local”. Afterwards you finish it, push “esc” and save it (“:wq”), enter. (If you did something wrong, leave without save it “:q!” and try again) That’s it!

Do it same for other 2 files. After that we are ready. Run pyspark!

As you can see it is 2.3.1! It is great, let’s check spark-shell

Perfect! It is done successfully. We are ready to work with the newest version of Apache Spark!

--

--