PySpark Setup

Kaushik G
8 min readMar 4, 2024

--

PySpark is a Python API for Apache Spark. It allows users to write Python and SQL-like commands to analyse and manipulate data in a distributed processing environment 💻.

PySpark excels at handling enormous, complex datasets, particularly those emerging from new sources that can quickly accumulate gigabytes or more. While you can run it on your local computer, its true potential is unlocked when utilized in a distributed computing environment, like the one provided by Amazon Web Services (AWS) ☁️. Installing and experimenting with PySpark locally is comparable to owning a sports car limited by a track’s constraints. Achieving unparalleled performance on massive datasets requires a distributed system like AWS to harness PySpark’s full potential.

Implementing Expected Percentile Rank metric in Spark | Polar Tropics

The following outlines the steps for installing PySpark on AWS:

To use PySpark with the free tier, we use AWS services to achieve this.

The details for setting up the EC2 instance have been shared in a separate story. Please find the link to access the EC2 setup information.

After following the steps provided in the EC2 setup story, we can proceed to install PySpark.

Following the successfull creation of a instance now we need to access the instance to install PySpark and it’s components. After establishing the connection with the Ubuntu AMI as specified in the setup information with the ssh-i pysparksetup.pem ec2–44–206–224–236.compute-1.amazonaws.com varies to individuals.

Run the following commands in the cmd where we have connected with our virtual Ubuntu AMI.

Downloading Anaconda:

$ wget http://repo.continuum.io/archive/Anaconda3-2020.02-Linux-x86_64.sh

wget is a command-line utility for downloading files from the web.

This helps us to access JUPYTER NOTEBOOK virtually.

Installation Command:

$ bash Anaconda3–2020.02-Linux-x86_64.sh

Once the installer script is downloaded using wget, the command ($ bash Anaconda3-2020.02-Linux-x86_64.sh) is used to execute the Anaconda installer script. This initiates the installation process for Anaconda on the Linux system.

Changing Environment:

$ source ~/.bashrc

When you run $ source ~/.bashrc, it tells the current Bash session to re-read and execute the commands from the .bashrc file. This is useful when you've made changes to the .bashrc file, and you want those changes to take effect without having to close and reopen the terminal.

Alternatively, you can use the . (dot) command as a shorthand for source, so the command can also be written as:

$ . ~/.bashrc

Both commands achieve the same result of applying changes from the .bashrc file to the current Bash session.

Check the version of Python Installed with Anaconda:

$ which python

/home/ubuntu/anaconda3/bin/python

checking python

The path indicates that the Python interpreter is located in the specified path within the Anaconda distribution.

$ python — version

This command is used to display the version of the Python interpreter currently installed on your system. We got to ensure that Python is installed because PySpark is a Python API which requires Python to be installed.

Setting up SSL/TLS encryption for Jupyter Notebook:

$ jupyter notebook --generate-config

This command generates a default Jupyter Notebook configuration file if it doesn’t already exist. The configuration file (jupyter_notebook_config.py) is needed to customize the behaviour of Jupyter Notebook.

$ mkdir certs

This command creates a directory named “certs” to store the SSL/TLS certificate files.

$ cd certs

Changes our current working directory to the newly created “certs” directory.

$ sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

We use the OpenSSL tool to generate a self-signed SSL/TLS certificate (mycert.pem). This certificate is necessary for encrypting the communication between our web browser and the Jupyter Notebook server, providing a secure HTTPS connection. The command includes options to create a certificate that is valid for 365 days (-days 365), is not encrypted with a passphrase (-nodes), and uses a 1024-bit RSA private key (-newkey rsa:1024). The private key (mycert.pem) is generated without a passphrase, making it more suitable for automated or non-interactive usage.

$ sudo chmod 777 mycert.pem

Modifies the permissions of the generated SSL/TLS certificate (mycert.pem) to allow broad read, write, and execute permissions. This may not be the most secure setting but is done here to ensure that Jupyter Notebook can read the certificate.

$ cd ~/.jupyter/

Changes the current working directory to the Jupyter configuration directory.

$ vi jupyter_notebook_config.py

Opens the Jupyter Notebook configuration file (jupyter_notebook_config.py) using the vi text editor. You would typically make manual edits to this file to configure Jupyter Notebook settings, including specifying the path to the SSL/TLS certificate and enabling HTTPS.

opening config file

After you open the configuration file, Press “i” to insert the following snippet.

# Configuration file for jupyter-notebook.

c = get_config()

# Notebook config this is where you saved your pem cert

c.NotebookApp.certfile = u’/home/ubuntu/certs/mycert.pem’

# Listen on all IPs

c.NotebookApp.ip = ‘0.0.0.0’

# Allow all origins

c.NotebookApp.allow_origin = ‘*’

# Do not open browser by default

c.NotebookApp.open_browser = False

# Fix port to 8891

c.NotebookApp.port = 8891

config file

After adding this press on Esc and hit “: wq” which will write the file and quit from it.

Get back to main directory:

Now come out to main home directory and run Jupyter Notebook command:

$ cd # to get back to the main directory from config file

# To run the Jupyter Notebook run the below command

$ jupyter notebook

Now open your browser, in my case I used Google Chrome you can use as your preferance. If you use Google Chrome, this screen will appear.

The url to reach out virtual machine is:

https://ec2–18–206–121–52.compute-1.amazonaws.com:8891 (for me)

where,

https:// is common,

The latter part is your instance’s public IPv4 DNS copy it from your instance menu.

After the public IPv4 DNS put the port number you particularly specified within your config.py. I hereby used 8891 port/

Click on proceed (unsafe) to go to jupyter notebook.

After you open the jupyter notebook, it will ask for a token authentication to proceed into jupyter notebook. The authentication screen will look like this:

You can find your unique token in the command prompt copy it and paste it in the textbox to login.

without pyspark

The above image depicts how PySpark was not installed so there was an import error. Our aim was to install PySpark so the following steps are done to acheive it.

We need to install JAVA and Scala for further proceedings.

The command to do the same has been listed below.

$ sudo apt-get update

sudo apt-get update is used to update the local package index with the latest information from the repositories, ensuring that the system is aware of the most recent package versions and dependencies.

$ sudo apt-get install default-jre

sudo apt-get install default-jre is used to install the default Java Runtime Environment (JRE) on a Debian-based Linux system.

$ java -version

To check the version of Java that has been installed.

$ sudo apt-get install scala

sudo apt-get install scala is used to install the Scala programming language on a Debian-based Linux system, such as Ubuntu.

$ scala –version

To check the version of Scala that has been installed.

Now we have to install PIP:

export PATH=$PATH:$HOME/anaconda3/bin

conda install pip

These commands set up the Anaconda environment to include the Anaconda bin directory in the system's PATH, and then use Conda to install pip within the Anaconda environment.

Note: During the execution, if prompted to upgrade Conda, please decline the upgrade by entering “No” when the prompt appears.

$ pip install py4j

The command pip install py4j is used to install the Py4J Python library. Py4J is a Python library that enables Python programs running in a Python interpreter to dynamically access Java objects in a Java virtual machine (JVM). This library facilitates communication and interaction between Python and Java code, allowing them to work together seamlessly.

Py4J is commonly used in environments where there’s a need to integrate Python and Java components, such as when using Apache Spark. It allows Python code to interact with and leverage Java libraries and functionalities, enabling interoperability between the two programming languages.

Now install Hadoop & Spark

$ wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

The link to reach the hadoop version while you download has been linked cause this version can become inconsistent one day.

https://spark.apache.org/downloads.html Keep the defaults and simply click on the Download Spark link to reach the download link.

wget is a command-line utility for downloading files from the internet

Extracting the files downloaded:

$ sudo tar -zxvf spark-3.5.1-bin-hadoop3.tgz

We will not be able to install Spark & Hadoop directly after wget, we have to extract the contents of the download. The overall purpose of the command is to extract the contents of the Spark distribution archive (spark-3.5.1-bin-hadoop3.tgz) into the current directory. This is a common step when installing Spark on a Unix-like system. The extracted contents will typically include the Spark binaries, libraries, and other necessary files and directories for running Apache Spark on the system. After extraction, you might need to configure Spark and set environment variables as needed for your specific use case.

Setting the ENV for our use:

$ export SPARK_HOME=’/home/ubuntu/spark-3.5.1-bin-hadoop3'

$ export PATH=$SPARK_HOME:$PATH

$ export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

In summary, these commands are part of the setup process for configuring the environment to work with Apache Spark. They ensure that the system knows where Spark is installed (SPARK_HOME), include Spark binaries in the system's executable path (PATH), and include Spark's Python modules in the Python path (PYTHONPATH). These configurations are essential for running and interacting with Apache Spark on your system.

Opening Jupyter Notebook:

FINALLY, we have installed Spark & Hadoop on Jupyter Notebook.

$ jupyter notebook

Run the command to open jupyter notebook we are to use for manipulation of BIG DATA.

After you open the Jupyter Notebook the home tree will look like this:

Home Page

Open a new notebook and run these commands

from pyspark import SparkContext

sc = SparkContext()

If we have those cells run we have successfully installed for you.

Setup Success

Yaay! We have successfully installed our needs to manipulate BIG Data in distributed computing.

I have further attached my github repository which consists of some PySpark Manipulations on data.

Github Link- https://github.com/kaushik-5/PySpark

Thank you for reading this, this far! 🙌❤️

--

--

Kaushik G

My blog serves as a chronicle of insights and discoveries in fascinating fields with data. Join me in exploring the cutting edge, one blog post at a time!"