Installation of Apache Spark — PySpark on Linux — Ubuntu

Ankit Mistry
Sep 6, 2018 · 2 min read

— — — — — — — — — — — — — -
| Different Ways of Installation |
— — — — — — — — — — — — —

Getting Linux Machine Ready With Spark

Windows, Linux, Mac

If Linux you are good to go ahead

If Windows and Mac,

Virtual Machine — Linux + VirtualBox

Cloud Setup —
On Demand Pre installed Hadoop + Spark Based Instance

Buy Linux Virtual Machine and install Apache Spark
Amazon EC2
GCP
Databricks
Digital Ocean
Rackspace
Linode

Windows — Digital Ocean Linux Instance + Spark

If you are looking for installation on Windows machine check —

https://medium.com/@ankit.25587/installation-of-apache-spark-on-windows-4e4e4141f877

Installation Part — 1 Cloud Digital Ocean Setup

Create Digital Ocean Account + Credit Card

Create Droplet

Connect virtual instance through SSH — 22
ssh root@159.65.177.136efabad96c4b89f30738ef81f0b

Installation Part — 2 Jupyter notebook + Python3

Check Python2 and Python3 installed.Install Jupyter notebookif not installed : trysudo apt install python3pip3 install jupytersudo apt install python3-pipjupyter notebook — ip=$ip — allow-root

Installation Part — 3 Install Scala, Java, Py4j, Spark

Why Scala :
Spark is Written in Scala
Scala — sudo apt-get install scala
scala -version
Why Java :
Spark Compiler Converts Scala code to JVM ByteCode
No need full JDK
Just JRE is Sufficient
Java — sudo apt-get install default-jre
java -version
Why Py4j:
Py4j — Python to JAVA
pip3 install py4j
Spark :wget http://redrockdigimark.com/apachemirror/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
Change as per nearest mirror

tar -xvf spark-2.2.1-bin-hadoop2.7.tgz
mv spark-2.2.1-bin-hadoop2.7.tgz spark

Installation Part — 4 Set Path and start Jupyter notebook

export SPARK_HOME=’/root/spark’ — Change required
export PATH=$SPARK_HOME:$PATH

export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=”jupyter”
export PYSPARK_DRIVER_PYTHON_OPTS=”notebook”
export PYSPARK_PYTHON=python3

Add this path to .bashrc
execute . .bashrc

Verify :

From Command Line :

python3
import pyspark

From Jupyter notebook :

jupyter notebook — ip=$ip — allow-root (If you are running as root user)import pyspark

Checkout complete course on Apache spark with Python Pyspark at

https://www.udemy.com/machine-learning-and-bigdata-analysis-with-apache-spark-python-pyspark/?couponCode=KNOWLEDGEISPOWER

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade