How to install PySpark locally: Connecting to AWS S3 & Redshift
Many find installing PySpark locally and reading/writing to AWS S3 or connecting to Redshift challenging. Here, we have written a guide on how to do so and below we are sharing these steps with you. Note: these steps are only tested on macOS.
Prerequisite
- Install homebrew if you don’t have it, otherwise update your brew by,
brew update
and thenbrew upgrade
. - We recommend using a Python virtual environment. Here we will be using conda.
- Here we have used Python 3.9.7, make sure you have it installed in your activated virtual environment.
- Here we have used an IAM role with read/write privileges to AWS Redshift and S3. Ensure that you have the permissions required to perform the operations, else you will receive
Access Denied
errors. - For connecting to AWS make sure you have the correct access and have the values for
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
andAWS_SESSION_TOKEN
. Here we generate these values by logging into AWS using the shell. These data are then stored in a file in~/.aws/credentials
. The format of this file looks like (for more info on this have a look here):
[default]
aws_access_key_id = <KEY_INFO>
aws_secret_access_key = <KEY_INFO>
aws_security_token = <TOKEN_INFO>
aws_session_expiration = <DATE_TIME>
aws_session_token = <TOKEN_INFO>
- For connecting to AWS Redshift you need to have the information on the redshift database name, port number, hostname, cluster identifier, username and password. Some of this information can be found if you log in to the AWS Redshift console and look at the JDBC URL.
Step 1: Install required packages
Install the packages below one by one, this should install Java, Scala, and PySpark using Homebrew.
brew install openjdk@11
brew install scala
brew install apache-spark
Using these commands we have installed and tested the current versions of Java, Scala, and PySpark which are: PySpark 3.3.0
, Scala 2.13.8
and openjdk 11.0.15
.
Now, open the shell and activate your Python environment (example below) and use that for the rest of this installation:
conda activate <enviroment_name>
Now let's validate PySpark installation through the shell. In terminal type pyspark
, you should see something like the image below:
Then try the commands below:
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")] df = spark.createDataFrame(data)
df.show()
You should be able to see the output (image below):
Now, that you have validated PySpark installation from the shell, let's move on to setting it up to use with Python.
Step 2: Configure Spark with Python (PySpark)
Here we will install and configure PySpark for Python (tested for Python 3.9.7). Execute the below commands in the terminal to configure the environmental variables.
echo 'export HOMEBREW_OPT="/opt/homebrew/opt"' >> ~/.zshrc
echo 'export JAVA_HOME="$HOMEBREW_OPT/openjdk/"' >> ~/.zshrc
echo 'export SPARK_HOME="$HOMEBREW_OPT/apache-spark/libexec"' >> ~/.zshrc
echo 'export PATH="$JAVA_HOME:$SPARK_HOME:$PATH"' >> ~/.zshrc
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
Make sure you are in the correct Python environment (e.g. conda activate <enviroment_name>
), and then install PySpark package for Python:
python -m pip install pyspark
After installation, in shell start python
and see if the code below works:
import pyspark
from pyspark import SparkContext
sc = SparkContext()
n = sc.parallelize([4,10,9,7])
n.take(3)
You should be able to see an output similar to the image below:
Step 3: Configuration for Connecting PySpark to AWS
Until now we have installed PySpark locally, but the aim of this article is to connect PySpark to AWS in order to read/write to S3 and also read data from Redshift. Let's begin some setup steps for this:
First, download the jar files below:
- https://repo1.maven.org/maven2/com/google/guava/guava/23.1-jre/guava-23.1-jre.jar
- https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.3.0/spark-hadoop-cloud_2.12-3.3.0.jar
- https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar
- https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.2/hadoop-aws-3.3.2.jar
- https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.36.1060/RedshiftJDBC42-no-awssdk-1.2.36.1060.jar
- https://repo1.maven.org/maven2/com/eclipsesource/minimal-json/minimal-json/0.9.4/minimal-json-0.9.4.jar
Then take all the jar files above to the $SPARK_HOME/jars
folder, but read the Important section below before moving anything.
Important: If there are any other versions of guava
, hadoop-aws-x.x.x
, aws-java-sdk-bundle-x.xx.xxx
in the $SPARK_HOME/jars
folder delete them and place the new jar files from above in this folder.
Re-run the two test codes above (in Step 1 and Step 2) and make sure they are still working.
Step4: Setup AWS Credentials
In $SPARK_HOME/conf
folder create a new file called spark-env.sh
. If a template already exist in that folder do: cp spark-env.sh.template spark-env.sh
This file reads the ~/.aws/credentials
file and assigns the AWS login environment variables. If your AWS login details are not temporary you can remove the first part (reading from ~/.aws/credentials
) and simply assign those variables directly below.
Step 5: Validate Reading Parquet Files from S3
Let’s validate that we can read parquet files from AWS S3 bucket using s3a protocol. Add the following lines to a Python file called test_aws_pyspark.py
and make sure you add the correct path forPATH_TO_S3_PARQUET_FOLDER
. In the shell in the correct Python environment run python test_aws_pyspark.py
. You should be able to see the top 3 rows of the table.
Step 6: Validate Writing Parquet Files to S3
Let’s check if writing to S3 works, add the following lines to a Python file called test_aws_pyspark_write.py
and define the correct path inWRITE_PATH
. Then in the shell in the correct conda environment run python test_aws_pyspark_write.py
. You should be able to see the parquet files in the AWS console in the path forWRITE_PATH
.
Step 7: Validate Connection to Redshift
Let’s check if we can run SQL queries by connecting to Redshift. Add the following lines to a Python file called test_redshift_sql.py
. Then in the shell in the correct conda environment run python test_redshift_sql.py
. You should be able to see the output of your query in the shell.
Make sure you update the variables at the start of this .py file using the correct redshift information.
Summary
Here we first gave the instruction on how to set up PySpark for Python. We then took you through steps for connecting to AWS S3 in order to read and then write parquet files from AWS S3 buckets. We then provided examples to test running SQL queries against Redshift.
References
- https://sparkbyexamples.com/pyspark/
- https://sparkbyexamples.com/pyspark-tutorial/
- https://maelfabien.github.io/bigdata/SparkInstall/#step-6-modify-your-bashrc
- https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-redshift.html
- https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html