How to setup Apache Spark(PySpark) on Jupyter/IPython Notebook?

Learn how to setup Apache Spark on Windows/Mac OS in under 10 minutes!

Ashish Shah
3 min readMay 1, 2018

To know more about Apache Spark, check out my other post!

Let’s get started!

  1. Prerequisite: You should have Java installed on your machine. If you don’t have Java on your machine, please go to this site and download Java.
  2. You can verify if Java is installed through this simple command on the terminal. If Java is already, installed on your system, you get to see the following response.

3. Download Apache Spark from this site and extract it into a folder. I extracted it in ‘C:/spark/spark’.

4. You need to set 3 environment variables.
a. HADOOP_HOME (Create this path even if it doesn’t exist)

b. JAVA_HOME

c. SPARK_HOME (This should be the same location as the folder you extracted Apache Spark in Step 3.

5. Windows users, download this file and extract it at the path ‘C:\spark\spark\bin’

This is a Hadoop binary for Windows — from Steve Loughran’s GitHub repo. Go to the corresponding Hadoop version in the Spark distribution and find winutils.exe under /bin. For example, https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

6. Open the terminal, go to the path ‘C:\spark\spark\bin’ and type ‘spark-shell’.

Spark is up and running!

Now lets run this on Jupyter Notebook.

7. Install the 'findspark’ Python module through the Anaconda Prompt or Terminal by running python -m pip install findspark.

8. To run Jupyter notebook, open the command prompt/Anaconda Prompt/Terminal and run jupyter notebook.

If you don’t have Jupyter installed, I’d recommend installing Anaconda distribution.

Open a new Python 3 notebook.

import findspark
findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.sql(‘’’select ‘spark’ as hello ‘’’)
df.show()

Paste this code and run it. If you see the following output, then you have installed PySpark on your system!

Please leave a comment in the section below if you have any question.

Stay tuned for more on Apache Spark!

--

--