How to install Apache Spark on Windows 10

This guide is for beginners who are trying to install Apache Spark on a Windows machine, I will assume that you have a 64-bit windows version and you already know how to add environment variables on Windows.

Note: you don’t need any prior knowledge of the Spark framework to follow this guide.

1. Install Java

First, we need to install Java to execute Spark applications, note that you don’t need to install the JDK if you want just to execute Spark applications and won’t develop new ones using Java. However, in this guide we will install JDK.

To do that go to this page and download the latest version of the JDK. After you install it, add the JAVA_HOME variable to your System Variables and make sure that it’s path value is pointing to JDK parent folder (see figure bellow for demonstration)

After you add this variable, it’s time to modify the Path system variable and add a new entry like this: %JAVA_HOME%\bin. This will let Windows command line recognize Java commands

Now, start the Command Line and type:

java -version

to check if Java was correctly installed

2. Install Scala

Download Scala windows installer from this page, scroll down to “Other resources” section and download the MSI file for windows (see figure bellow). Install it and add a new variable to your System Variables named SCALA_HOME which will point to the parent folder of Scala. Then, add %SCALA_HOME %\bin to Path system variable

3. Spark binaries

Since it’s not easy to build Spark from sources, we will download a pre-built package that contains all Spark binaries needed to execute it. Go to this page and choose the latest stable version pre-built for Hadoop 2.7 and later (see figure bellow). Extract the compressed file in any location you choose and make sure that the path to this location doesn’t contain any spaces. I suggest you to place Spark folder directly into a partition (C: for example).

Add a new variable to you system variables and name it: SPARK_HOME. This variable holds the Spark parent directory path (C:\spark-2.2.0-bin-hadoop2.7 for example). After that, add %SPARK_HOME%\bin to Path system variable

4. Hadoop WinUtils

Since we are using a pre-built Spark binaries for Hadoop, we need also additional binary files to run it. To do that. Create a new folder and name it “WinUtils” and place it in a parent directory of any partition (C:\WinUtils for example), then, go to this page and download this repository by clicking in the right green button and choosing Download ZIP option. After you download the zip file, extract it and copy files from this folder “hadoop-2.7.1” to WinUtils folder (don’t copy the whole directory, just its content, the bin folder).

Note that you can use another folder/location for Hadoop Windows Binaries, but to simplify things and organize the work we used this method.

Now, add HADOOP_HOME variable to your system variables and make it points to the WinUtils folder (C:\WinUtils in this case).

Note: make sure that the variables we added before point to parent directories and not to bin folders !

5. Run Spark shell

Run Command Line as an administrator and type:

spark-shell

If things work well, you will end up with an output like this:

6. Run a sample Spark application

Spark comes with various examples that you can run them directly from the command line using this command:

run-example

Let’s run a sample app that computes an approximation value to pi:

run-example SparkPi
run-example SparkPi 10

We have tested two times: the first one will use the default number of partitions (2) and the second one will use 10

Results:

Right: default partitions value, Left: partitions = 10

That’s all, thank you for reading this post and hope this simple guide will help you to install Apache Spark on your own Windows machine.


Originally published at guendouz.wordpress.com on July 18, 2017.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.