Analytics Vidhya
Published in

Analytics Vidhya

Installing and using PySpark on Windows machine

Installation steps simplified (and automated to certain extent…)

Image source: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1280px-Apache_Spark_logo.svg.png

Installing Prerequisites

  1. Java
java -version
java version "1.8.0_271"
Java(TM) SE Runtime Environment (build 1.8.0_271-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode)
tar -xvzf jre-8u271-windows-x64.tar.gz
python --version
Python 3.7.9

Scripted setup

Following steps can be scripted as a batch file and run in one go. Script has been provided after the below walkthrough.

Getting the Spark files

Download the required spark version file from the Apache Spark Downloads website. Get the ‘spark-x.x.x-bin-hadoop2.7.tgz’ file, e.g. spark-2.4.3-bin-hadoop2.7.tgz.

tar -xvzf spark-2.4.3-bin-hadoop2.7.tgz

Putting everything together

Setup folder

Create a folder for spark installation at the location of your choice. e.g. C:\spark_setup.

Adding winutils.exe

From this GitHub repository, download the winutils.exe file corresponding to the Spark and Hadoop version.

  • C:\spark_setup\spark-2.4.3-bin-hadoop2.7\bin
  • C:\spark_setup\spark-2.4.3-bin-hadoop2.7\hadoop\bin

Setting environment variables

We have to setup below environment variables to let spark know where the required files are.

  1. Variable name: SPARK_HOME
    Variable value: C:\spark_setup\spark-2.4.3-bin-hadoop2.7 (path to setup folder)
  2. Variable name: HADOOP_HOME
    Variable value: C:\spark_setup\spark-2.4.3-bin-hadoop2.7\hadoop
    OR
    Variable value: %SPARK_HOME%\hadoop
  3. Variable name: JAVA_HOME
    Variable value: Set it to the Java installation folder, e.g. C:\Program Files\Java\jre1.8.0_271
    Find it in ‘Program Files’ or ‘Program Files (x86)’ based on which version was installed above. In case you used the .tar.gz version, set the path to the location where you extracted it.
  4. Variable name: PYSPARK_PYTHON
    Variable value: python
    This environment variable is required to ensure tasks that involve python workers, such as UDFs, work properly. Refer to this StackOverflow post.
  5. Select ‘Path’ variable and click on ‘Edit…
  1. Variable name: PYSPARK_DRIVER_PYTHON
    Variable value: jupyter
  2. Variable name: PYSPARK_DRIVER_PYTHON_OPTS
    Variable value: notebook

Scripted setup

Edit and use below script to (almost) automate PySpark setup process.

Using PySpark in standalone mode on Windows

You might have to restart your machine post above steps in case below commands don’t work.

Commands

Each command to be run in a separate Anaconda Prompt

  1. Deploying Master
    spark-class.cmd org.apache.spark.deploy.master.Master -h 127.0.0.1
    Open your browser and navigate to: http://localhost:8080/. This is the SparkUI.
  2. Deploying Worker
    spark-class.cmd org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
    SparkUI will show the worker status.

Alternative

Run below command to start pyspark (shell or jupyter) session using all resources available on your machine. Activate the required python environment before running the pyspark command.

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store