PySpark is a Python API for Apache Spark. It allows users to write Python and SQL-like commands to analyse and manipulate data in a distributed processing environment 💻.
PySpark excels at handling enormous, complex datasets, particularly those emerging from new sources that can quickly accumulate gigabytes or more. While you can run it on your local computer, its true potential is unlocked when utilized in a distributed computing environment, like the one provided by Amazon Web Services (AWS) ☁️. Installing and experimenting with PySpark locally is comparable to owning a sports car limited by a track’s constraints. Achieving unparalleled performance on massive datasets requires a distributed system like AWS to harness PySpark’s full potential.
The following outlines the steps for installing PySpark on AWS:
To use PySpark with the free tier, we use AWS services to achieve this.
The details for setting up the EC2 instance have been shared in a separate story. Please find the link to access the EC2 setup information.
After following the steps provided in the EC2 setup story, we can proceed to install PySpark.
Following the successfull creation of a instance now we need to access the instance to install PySpark and it’s components. After establishing the connection with the Ubuntu AMI as specified in the setup information with the ssh-i pysparksetup.pem ec2–44–206–224–236.compute-1.amazonaws.com varies to individuals.
Run the following commands in the cmd where we have connected with our virtual Ubuntu AMI.
Downloading Anaconda:
$ wget http://repo.continuum.io/archive/Anaconda3-2020.02-Linux-x86_64.sh
wget
is a command-line utility for downloading files from the web.
This helps us to access JUPYTER NOTEBOOK virtually.
Installation Command:
$ bash Anaconda3–2020.02-Linux-x86_64.sh
Once the installer script is downloaded using wget
, the command ($ bash Anaconda3-2020.02-Linux-x86_64.sh
) is used to execute the Anaconda installer script. This initiates the installation process for Anaconda on the Linux system.
Changing Environment:
$ source ~/.bashrc
When you run $ source ~/.bashrc
, it tells the current Bash session to re-read and execute the commands from the .bashrc
file. This is useful when you've made changes to the .bashrc
file, and you want those changes to take effect without having to close and reopen the terminal.
Alternatively, you can use the .
(dot) command as a shorthand for source
, so the command can also be written as:
$ . ~/.bashrc
Both commands achieve the same result of applying changes from the .bashrc
file to the current Bash session.
Check the version of Python Installed with Anaconda:
$ which python
/home/ubuntu/anaconda3/bin/python
The path indicates that the Python interpreter is located in the specified path within the Anaconda distribution.
$ python — version
This command is used to display the version of the Python interpreter currently installed on your system. We got to ensure that Python is installed because PySpark is a Python API which requires Python to be installed.
Setting up SSL/TLS encryption for Jupyter Notebook:
$ jupyter notebook --generate-config
This command generates a default Jupyter Notebook configuration file if it doesn’t already exist. The configuration file (jupyter_notebook_config.py
) is needed to customize the behaviour of Jupyter Notebook.
$ mkdir certs
This command creates a directory named “certs” to store the SSL/TLS certificate files.
$ cd certs
Changes our current working directory to the newly created “certs” directory.
$ sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
We use the OpenSSL tool to generate a self-signed SSL/TLS certificate (mycert.pem
). This certificate is necessary for encrypting the communication between our web browser and the Jupyter Notebook server, providing a secure HTTPS connection. The command includes options to create a certificate that is valid for 365 days (-days 365
), is not encrypted with a passphrase (-nodes
), and uses a 1024-bit RSA private key (-newkey rsa:1024
). The private key (mycert.pem
) is generated without a passphrase, making it more suitable for automated or non-interactive usage.
$ sudo chmod 777 mycert.pem
Modifies the permissions of the generated SSL/TLS certificate (mycert.pem
) to allow broad read, write, and execute permissions. This may not be the most secure setting but is done here to ensure that Jupyter Notebook can read the certificate.
$ cd ~/.jupyter/
Changes the current working directory to the Jupyter configuration directory.
$ vi jupyter_notebook_config.py
Opens the Jupyter Notebook configuration file (jupyter_notebook_config.py
) using the vi
text editor. You would typically make manual edits to this file to configure Jupyter Notebook settings, including specifying the path to the SSL/TLS certificate and enabling HTTPS.
After you open the configuration file, Press “i” to insert the following snippet.
# Configuration file for jupyter-notebook.
c = get_config()
# Notebook config this is where you saved your pem cert
c.NotebookApp.certfile = u’/home/ubuntu/certs/mycert.pem’
# Listen on all IPs
c.NotebookApp.ip = ‘0.0.0.0’
# Allow all origins
c.NotebookApp.allow_origin = ‘*’
# Do not open browser by default
c.NotebookApp.open_browser = False
# Fix port to 8891
c.NotebookApp.port = 8891
After adding this press on Esc and hit “: wq” which will write the file and quit from it.
Get back to main directory:
Now come out to main home directory and run Jupyter Notebook command:
$ cd # to get back to the main directory from config file
# To run the Jupyter Notebook run the below command
$ jupyter notebook
Now open your browser, in my case I used Google Chrome you can use as your preferance. If you use Google Chrome, this screen will appear.
The url to reach out virtual machine is:
https://ec2–18–206–121–52.compute-1.amazonaws.com:8891 (for me)
where,
https:// is common,
The latter part is your instance’s public IPv4 DNS copy it from your instance menu.
After the public IPv4 DNS put the port number you particularly specified within your config.py. I hereby used 8891 port/
Click on proceed (unsafe) to go to jupyter notebook.
After you open the jupyter notebook, it will ask for a token authentication to proceed into jupyter notebook. The authentication screen will look like this:
You can find your unique token in the command prompt copy it and paste it in the textbox to login.
The above image depicts how PySpark was not installed so there was an import error. Our aim was to install PySpark so the following steps are done to acheive it.
We need to install JAVA and Scala for further proceedings.
The command to do the same has been listed below.
$ sudo apt-get update
sudo apt-get update
is used to update the local package index with the latest information from the repositories, ensuring that the system is aware of the most recent package versions and dependencies.
$ sudo apt-get install default-jre
sudo apt-get install default-jre
is used to install the default Java Runtime Environment (JRE) on a Debian-based Linux system.
$ java -version
To check the version of Java that has been installed.
$ sudo apt-get install scala
sudo apt-get install scala
is used to install the Scala programming language on a Debian-based Linux system, such as Ubuntu.
$ scala –version
To check the version of Scala that has been installed.
Now we have to install PIP:
export PATH=$PATH:$HOME/anaconda3/bin
conda install pip
These commands set up the Anaconda environment to include the Anaconda bin
directory in the system's PATH
, and then use Conda to install pip
within the Anaconda environment.
Note: During the execution, if prompted to upgrade Conda, please decline the upgrade by entering “No” when the prompt appears.
$ pip install py4j
The command pip install py4j
is used to install the Py4J Python library. Py4J is a Python library that enables Python programs running in a Python interpreter to dynamically access Java objects in a Java virtual machine (JVM). This library facilitates communication and interaction between Python and Java code, allowing them to work together seamlessly.
Py4J is commonly used in environments where there’s a need to integrate Python and Java components, such as when using Apache Spark. It allows Python code to interact with and leverage Java libraries and functionalities, enabling interoperability between the two programming languages.
Now install Hadoop & Spark
$ wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
The link to reach the hadoop version while you download has been linked cause this version can become inconsistent one day.
https://spark.apache.org/downloads.html Keep the defaults and simply click on the Download Spark link to reach the download link.
wget
is a command-line utility for downloading files from the internet
Extracting the files downloaded:
$ sudo tar -zxvf spark-3.5.1-bin-hadoop3.tgz
We will not be able to install Spark & Hadoop directly after wget, we have to extract the contents of the download. The overall purpose of the command is to extract the contents of the Spark distribution archive (spark-3.5.1-bin-hadoop3.tgz
) into the current directory. This is a common step when installing Spark on a Unix-like system. The extracted contents will typically include the Spark binaries, libraries, and other necessary files and directories for running Apache Spark on the system. After extraction, you might need to configure Spark and set environment variables as needed for your specific use case.
Setting the ENV for our use:
$ export SPARK_HOME=’/home/ubuntu/spark-3.5.1-bin-hadoop3'
$ export PATH=$SPARK_HOME:$PATH
$ export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
In summary, these commands are part of the setup process for configuring the environment to work with Apache Spark. They ensure that the system knows where Spark is installed (SPARK_HOME
), include Spark binaries in the system's executable path (PATH
), and include Spark's Python modules in the Python path (PYTHONPATH
). These configurations are essential for running and interacting with Apache Spark on your system.
Opening Jupyter Notebook:
FINALLY, we have installed Spark & Hadoop on Jupyter Notebook.
$ jupyter notebook
Run the command to open jupyter notebook we are to use for manipulation of BIG DATA.
After you open the Jupyter Notebook the home tree will look like this:
Open a new notebook and run these commands
from pyspark import SparkContext
sc = SparkContext()
If we have those cells run we have successfully installed for you.
Yaay! We have successfully installed our needs to manipulate BIG Data in distributed computing.
I have further attached my github repository which consists of some PySpark Manipulations on data.
Github Link- https://github.com/kaushik-5/PySpark
Thank you for reading this, this far! 🙌❤️