Build ETL Pipeline With PySpark on AWS EC2 (1) — Setup PySpark Environment

Chuan Zhang
The Startup
Published in
8 min readOct 31, 2020

Apache Spark is a very powerful analytic engine for big data processing, and has become more and more popular in different areas of big data industry very recently. Spark includes built-in modules for streaming, SQL, machine learning and graph processing. It provides high-level APIs in Java, Scala, Python and R.

The package PySpark is a Python API for Spark. It is great for performing exploratory data analysis at scale, building machine learning pipelines, creating ETL pipelines for data platforms, and many more. In this post, I demonstrate how to setup PySpark environment on an AWS EC2 instance, and in next several posts, I am going to demonstrate how build ETL pipelines on AWS.

Setup PySpark on AWS EC2

On AWS EC2 various instance types are available. Here I choose SLES 15.2 image. The process for other Linux images is very similar.

Install Spark on AWS EC2

First, we need to install Java and Scala. After installing Java using the commands below,

ec2-user@ip-172-33-1-122:~> cd /usr/local/
ec2-user@ip-172-33-1-122:/usr/local>
ec2-user@ip-172-33-1-122:/usr/local> sudo wget --no-cookies --no-check-certificate --header "Cookie: oraclelicense=accept-securebackup-cookie" https://javadl.oracle.com/webapps/download/AutoDL?BundleId=242980_a4634525489241b9a9e1aa73d9e118e6 -O jre-8u261-linux-x64.tar.gz
ec2-user@ip-172-33-1-122:/usr/local> sudo tar -xzvf jre-8u261-linux-x64.tar.gz
ec2-user@ip-172-33-1-122:/usr/local> sudo mv jre1.8.0_261/ java
ec2-user@ip-172-33-1-122:/usr/local> sudo rm jre-8u261-linux-x64.tar.gz

we add the JAVA_HOME into the /etc/profile file

ec2-user@ip-172-33-1-122:/usr/local> cd /etc/
ec2-user@ip-172-33-1-122:/etc> sudo vi profile
ec2-user@ip-172-33-1-122:/etc> tail profile
# yast in Public Cloud images fix
NCURSES_NO_UTF8_ACS=1
export NCURSES_NO_UTF8_ACS
# include JAVA_HOME
JAVA_HOME=/usr/local/java
export JAVA_HOME
PATH=$PATH:$JAVA_HOME/bin
export PATH

Next, we download and install Scala and SBT (Simple Build Tool for Scala and Java projects) using the following commands

ec2-user@ip-172-33-1-122:~/Downloads> wget http://downloads.lightbend.com/scala/2.12.1/scala-2.12.1.rpm
ec2-user@ip-172-33-1-122:~/Downloads> sudo zypper in scala-2.12.1.rpm
ec2-user@ip-172-33-1-122:~/Downloads> wget https://dl.bintray.com/sbt/native-packages/sbt/0.13.13/sbt-0.13.13.tgz
ec2-user@ip-172-33-1-122:~/Downloads> tar -xvf sbt-0.13.13.tgz
ec2-user@ip-172-33-1-122:~/Downloads> mv sbt-launcher-packaging-0.13.13/ sbt
ec2-user@ip-172-33-1-122:~/Downloads> sudo cp -r sbt /opt
ec2-user@ip-172-33-1-122:~/Downloads> sudo ln -s /opt/sbt/bin/sbt /usr/local/bin

After all the above three are installed, we are ready to download Spark, extract and move the files into the /usr/local/spark folder.

ec2-user@ip-172-33-1-122:~/Downloads> wget http://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz
ec2-user@ip-172-33-1-122:~/Downloads> sudo tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz
ec2-user@ip-172-33-1-122:~/Downloads> sudo mv spark-2.1.1-bin-hadoop2.7/ /usr/local/spark
ec2-user@ip-172-33-1-122:/usr/local/spark> sudo vi .bashrc
ec2-user@ip-172-33-1-122:/usr/local/spark> cat .bashrc
export PATH=$PATH:/usr/local/spark/bin:/usr/local/spark/sbin
ec2-user@ip-172-33-1-122:/usr/local/spark>

Then we add group and user spark, switch to user spark and generate ssh key pair for the user spark.

ec2-user@ip-172-33-1-122:~> sudo groupadd spark && sudo useradd -M -g spark -d /usr/local/spark spark
Group 'mail' not found. Creating the user mailbox file with 0600 mode.
ec2-user@ip-172-33-1-122:~>
ec2-user@ip-172-33-1-122:~> sudo chown -R spark:spark /usr/local/spark
ec2-user@ip-172-33-1-122:~>
ec2-user@ip-172-33-1-122:~> sudo su spark
spark@ip-172-33-1-122:~> ssh-keygen -t rsa -P ""
spark@ip-172-33-1-122:~> cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
spark@ip-172-33-1-122:~>

The commands below respectively verify that both Scala and Spark have been installed successfully.

spark@ip-172-33-1-122:~> scala -version
Scala code runner version 2.12.1 -- Copyright 2002-2016, LAMP/EPFL and Lightbend, Inc.
spark@ip-172-33-1-122:~>
spark@ip-172-33-1-122:~> spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/10/29 08:26:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/10/29 08:26:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/10/29 08:26:47 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://172.33.1.122:4041
Spark context available as 'sc' (master = local[*], app id = local-1603960000539).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_261)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
scala> :quit
spark@ip-172-33-1-122:~>

Install PySpark and Jupyter

Before installing PySpark, Python needs to be installed. Next I demonstrate how to install both Python 3.7 and Jupyter on the AWS EC2 instance.

First, install some necessary tools

ec2-user@ip-172-33-1-122:~> sudo zypper in zlib-devel bzip2 libbz2-devel libffi-devel libopenssl-devel readline-devel sqlite3 sqlite3-devel xz xz-devel gcc make

then download, build and configure Python 3.7 environment.

ec2-user@ip-172-33-1-122:~> cd Downloads/
ec2-user@ip-172-33-1-122:~/Downloads> wget https://www.python.org/ftp/python/3.7.1/Python-3.7.1.tar.xz
ec2-user@ip-172-33-1-122:~/Downloads> tar -xf Python-3.7.1.tar.xz
ec2-user@ip-172-33-1-122:~/Downloads> cd Python-3.7.1/
ec2-user@ip-172-33-1-122:~/Downloads/Python-3.7.1> ./configure
ec2-user@ip-172-33-1-122:~/Downloads/Python-3.7.1> make
ec2-user@ip-172-33-1-122:~/Downloads/Python-3.7.1> sudo make altinstall
ec2-user@ip-172-33-1-122:~/Downloads/Python-3.7.1> sudo ln -s /usr/local/lib64/python3.7/lib-dynload /usr/local/lib/python3.7/lib-dynload
ec2-user@ip-172-33-1-122:/usr/bin> sudo ln -s -f /usr/local/bin/python3.7 python
ec2-user@ip-172-33-1-122:~> sudo ln -s -f /usr/local/bin/pip3.7 /usr/bin/pip

Next, we install Py4J, findspark, and PySpark respectively

ec2-user@ip-172-33-1-122:~> sudo pip install py4j
ec2-user@ip-172-33-1-122:~> sudo pip install findspark
ec2-user@ip-172-33-1-122:~> sudo pip install pyspark

Being able to successfully import the PySpark package in python verifies that PySpark has been installed successfully.

spark@ip-172-33-1-122:~> python
Python 3.7.1 (default, Oct 29 2020, 01:55:39)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import findspark
>>> findspark.init('/usr/local/spark')
>>> import pyspark
>>> exit()
spark@ip-172-33-1-122:~>

Or you may verify by running some simple tasks, such as loading a csv file, in pyspark directly.

spark@ip-172-33-1-122:~> pyspark
Python 3.7.1 (default, Oct 29 2020, 01:55:39)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/10/31 05:16:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/10/31 05:16:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Python version 3.7.1 (default, Oct 29 2020 01:55:39)
SparkSession available as 'spark'.
>>>
>>> from pyspark.sql import SparkSession as ss
>>> from pyspark.sql import SQLContext as sqlc
>>>
>>> spark = ss.builder.appName('reading csv').getOrCreate()
>>> csv_file = '/usr/local/spark/Notebooks/SampleData/health_effects.csv'
>>> spark_df = spark.read.csv(csv_file, header=True, sep=',').cache()
>>> spark_df.printSchema()
root
|-- column_a: string (nullable = true)
|-- column_b: string (nullable = true)
|-- column_c: string (nullable = true)
|-- column_d: string (nullable = true)
>>> spark_df.describe().show()
20/10/31 05:17:42 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+-------+------------------+--------------------+--------------------+--------------------+
|summary| column_a| column_b| column_c| column_d|
+-------+------------------+--------------------+--------------------+--------------------+
| count| 2049| 1435| 1435| 821|
| mean| 718.0| null| null| null|
| stddev|414.39313057369407| null| null| null|
| min| ,NULL| (+)-inotropic|A DNA binding ant...|"A substance used...|
| max| 999|xanthine oxidase ...|Vitamin A funcito...|The role played b...|
+-------+------------------+--------------------+--------------------+--------------------+
>>>

Next, we install and configure Jupyter. The commands below install Jupyter and verify the installation is successful.

ec2-user@ip-172-33-1-122:~> sudo pip install jupyter
ec2-user@ip-172-33-1-122:~> jupyter --version
jupyter core : 4.6.3
jupyter-notebook : 6.1.4
qtconsole : 4.7.7
ipython : 7.18.1
ipykernel : 5.3.4
jupyter client : 6.1.7
jupyter lab : not installed
nbconvert : 6.0.7
ipywidgets : 7.5.1
nbformat : 5.0.8
traitlets : 5.0.5
ec2-user@ip-172-33-1-122:~>

After generating ssl key pair for the user spark

ec2-user@ip-172-33-1-122:/usr/local/spark> mkdir certs && cd certs
ec2-user@ip-172-33-1-122:/usr/local/spark/certs> sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
ec2-user@ip-172-33-1-122:/usr/local/spark/certs> sudo chown spark:spark mycert.pem

we switch to the user spark, generate the jupyter notebook config file

ec2-user@ip-172-33-1-122:/usr/local/spark> sudo su spark
spark@ip-172-33-1-122:~>
spark@ip-172-33-1-122:~> jupyter notebook --generate-config
Writing default config to: /usr/local/spark/.jupyter/jupyter_notebook_config.py
spark@ip-172-33-1-122:~>

and add the five lines shown in the command line output below into the jupyter notebook config file

spark@ip-172-33-1-122:~> vi .jupyter/jupyter_notebook_config.py
spark@ip-172-33-1-122:~> head .jupyter/jupyter_notebook_config.py
# Configuration file for jupyter-notebook.
c = get_config()
c.NotebookApp.certfile = u'/usr/local/spark/certs/mycert.pem'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
#------------------------------------------------------------------------------
# Application(SingletonConfigurable) configuration
spark@ip-172-33-1-122:~>

Now, we are ready to use PySpark in jupyter notebook. First, start the jupyter notebook service on AWS EC2 instance.

spark@ip-172-33-1-122:~>
spark@ip-172-33-1-122:~> jupyter notebook
[I 23:51:28.716 NotebookApp] Serving notebooks from local directory: /usr/local/spark
[I 23:51:28.716 NotebookApp] Jupyter Notebook 6.1.4 is running at:
[I 23:51:28.716 NotebookApp] https://ip-172-33-1-122:8888/?token=fd2e8e2cc5954a7222b04f13c3a24bdf5fcd12c552374c83
[I 23:51:28.716 NotebookApp] or https://127.0.0.1:8888/?token=fd2e8e2cc5954a7222b04f13c3a24bdf5fcd12c552374c83
[I 23:51:28.716 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 23:51:28.721 NotebookApp]
To access the notebook, open this file in a browser:
file:///usr/local/spark/.local/share/jupyter/runtime/nbserver-2439-open.html
Or copy and paste one of these URLs:
https://ip-172-33-1-122:8888/?token=fd2e8e2cc5954a7222b04f13c3a24bdf5fcd12c552374c83
or https://127.0.0.1:8888/?token=fd2e8e2cc5954a7222b04f13c3a24bdf5fcd12c552374c83

Then copy and paste one of the two URLs in the command lines output (see the above snippet for example) into the web browser on your local machine, and replace the domain name or IP address (in the example above, they are ip-172–33–1–122 and 127.0.0.1) by the public IP address of your AWS EC2 instance, then you will be able to create jupyter notebooks and work with PySpark in your web browser on your local machine. In the screenshot below, a csv file is successfully loaded using PySpark.

Conclusion

In this post, I have demonstrated how to setup PySpark environment on an AWS EC2 instance. In next post, I am going to demonstrate how to build an ETL pipeline using PySpark.

--

--