How to install PySpark locally: Connecting to AWS S3 & Redshift

Published in

Attest Product & Technology

5 min readJul 18, 2022

--

Many find installing PySpark locally and reading/writing to AWS S3 or connecting to Redshift challenging. Here, we have written a guide on how to do so and below we are sharing these steps with you. Note: these steps are only tested on macOS.

Prerequisite

Install homebrew if you don’t have it, otherwise update your brew by, brew update and then brew upgrade.
We recommend using a Python virtual environment. Here we will be using conda.
Here we have used Python 3.9.7, make sure you have it installed in your activated virtual environment.
Here we have used an IAM role with read/write privileges to AWS Redshift and S3. Ensure that you have the permissions required to perform the operations, else you will receive Access Denied errors.
For connecting to AWS make sure you have the correct access and have the values for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN. Here we generate these values by logging into AWS using the shell. These data are then stored in a file in ~/.aws/credentials. The format of this file looks like (for more info on this have a look here):

[default]
aws_access_key_id = <KEY_INFO>
aws_secret_access_key = <KEY_INFO>
aws_security_token = <TOKEN_INFO>
aws_session_expiration = <DATE_TIME>
aws_session_token = <TOKEN_INFO>

For connecting to AWS Redshift you need to have the information on the redshift database name, port number, hostname, cluster identifier, username and password. Some of this information can be found if you log in to the AWS Redshift console and look at the JDBC URL.

Step 1: Install required packages

Install the packages below one by one, this should install Java, Scala, and PySpark using Homebrew.

brew install openjdk@11

brew install scala

brew install apache-spark

Using these commands we have installed and tested the current versions of Java, Scala, and PySpark which are: PySpark 3.3.0, Scala 2.13.8 and openjdk 11.0.15.

Now, open the shell and activate your Python environment (example below) and use that for the rest of this installation:

conda activate <enviroment_name>

Now let's validate PySpark installation through the shell. In terminal type pyspark, you should see something like the image below:

Then try the commands below:

data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")] df = spark.createDataFrame(data)
df.show()

You should be able to see the output (image below):

Now, that you have validated PySpark installation from the shell, let's move on to setting it up to use with Python.

Step 2: Configure Spark with Python (PySpark)

Here we will install and configure PySpark for Python (tested for Python 3.9.7). Execute the below commands in the terminal to configure the environmental variables.

echo 'export HOMEBREW_OPT="/opt/homebrew/opt"' >> ~/.zshrc 
echo 'export JAVA_HOME="$HOMEBREW_OPT/openjdk/"' >> ~/.zshrc 
echo 'export SPARK_HOME="$HOMEBREW_OPT/apache-spark/libexec"' >> ~/.zshrc 
echo 'export PATH="$JAVA_HOME:$SPARK_HOME:$PATH"' >> ~/.zshrc 
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc 
source ~/.zshrc

Make sure you are in the correct Python environment (e.g. conda activate <enviroment_name>), and then install PySpark package for Python:

python -m pip install pyspark

After installation, in shell start python and see if the code below works:

import pyspark 
from pyspark import SparkContext 
sc = SparkContext() 
n = sc.parallelize([4,10,9,7]) 
n.take(3)

You should be able to see an output similar to the image below:

Step 3: Configuration for Connecting PySpark to AWS

Until now we have installed PySpark locally, but the aim of this article is to connect PySpark to AWS in order to read/write to S3 and also read data from Redshift. Let's begin some setup steps for this:

First, download the jar files below:

Then take all the jar files above to the $SPARK_HOME/jars folder, but read the Important section below before moving anything.

Important: If there are any other versions of guava, hadoop-aws-x.x.x, aws-java-sdk-bundle-x.xx.xxx in the $SPARK_HOME/jars folder delete them and place the new jar files from above in this folder.

Re-run the two test codes above (in Step 1 and Step 2) and make sure they are still working.

Step4: Setup AWS Credentials

In $SPARK_HOME/conf folder create a new file called spark-env.sh. If a template already exist in that folder do: cp spark-env.sh.template spark-env.sh

This file reads the ~/.aws/credentials file and assigns the AWS login environment variables. If your AWS login details are not temporary you can remove the first part (reading from ~/.aws/credentials) and simply assign those variables directly below.

Step 5: Validate Reading Parquet Files from S3

Let’s validate that we can read parquet files from AWS S3 bucket using s3a protocol. Add the following lines to a Python file called test_aws_pyspark.py and make sure you add the correct path forPATH_TO_S3_PARQUET_FOLDER. In the shell in the correct Python environment run python test_aws_pyspark.py. You should be able to see the top 3 rows of the table.

Step 6: Validate Writing Parquet Files to S3

Let’s check if writing to S3 works, add the following lines to a Python file called test_aws_pyspark_write.py and define the correct path inWRITE_PATH. Then in the shell in the correct conda environment run python test_aws_pyspark_write.py. You should be able to see the parquet files in the AWS console in the path forWRITE_PATH.

Step 7: Validate Connection to Redshift

Let’s check if we can run SQL queries by connecting to Redshift. Add the following lines to a Python file called test_redshift_sql.py. Then in the shell in the correct conda environment run python test_redshift_sql.py. You should be able to see the output of your query in the shell.

Make sure you update the variables at the start of this .py file using the correct redshift information.

Summary

Here we first gave the instruction on how to set up PySpark for Python. We then took you through steps for connecting to AWS S3 in order to read and then write parquet files from AWS S3 buckets. We then provided examples to test running SQL queries against Redshift.