Install Spark on MacOS and Load Data from AWS S3 with PySpark
In this post, we walk through how to get started with Spark on our local MacOS machine to begin exploring and analyzing data using PySpark using a Jupyter Notebook. We also explore how to configure PySpark to interact with AWS S3 for reading and writing data. PySpark is a beautiful tool for performing large-scale data analytics and ETL processes. Those with pandas experience will find the programmatic interface very similar. As the volume of data grows and the demand for engineers with big data skills increases at least proportionally, it is valuable to have Spark and PySpark experience. This post aims to get started with PySpark on a local machine to begin analyzing somewhat large datasets without needing to pay for an AWS EMR or other cloud Spark providers.
Install Java
Java 8 or higher is required for Spark to run on your local machine. To download and install Java, go to the Java download page and install the JRE (Java Runtime Environment). The JDK (Java Development Kit) is also required for Spark to run successfully. Download and install the macOS Installer available on the JDK downloads page. With the JRE and JDK installed, we can proceed to downloading and installing Spark.
Download and Install Spark
With Java installed, Spark can now be installed on our local machine. Proceed to the Spark download page, select the latest release (3.1.2 at the time of this writing) with the pre-built Hadoop package, and download. This will give us a “basic” Spark installation with just the necessary components for running Spark and PySpark on our local machine. For more tool support, such as Hive and Kubernetes, the Spark distribution must be built manually.
In a terminal, go to the directory with the downloaded Spark distribution. Unzip the downloaded Spark installation and copy it to your desired installation location. In the below example, we assume the Spark distribution is in the Downloads/
directory and opt to place the Spark distribution in our user folder.
cd ~/Downloads
tar -xzf spark-3.1.2-bin-hadoop3.2.tgz
mv spark-3.1.2-bin-hadoop3.2 /Users/aaronschlegel/spark-3.1.2
Although not strictly necessary, creating a symbolic link to the installation will allow us to have multiple versions of Spark.
ln -s /Users/aaronschlegel/spark-3.1.2 /Users/aaronschlegel/spark
We lastly set the needed environment and $PATH
variables, so our system knows where to find the installed Java and Spark locations in the next section.
Setting the Environment Variables
Before running Spark and PySpark on our local machine, we first need to set several $PATH
variables for our system to find the Spark and Java installations. In your ~/.bashrc
or similar file, set the SPARK_HOME
variable to the symbolic link directory we created earlier and the JAVA_HOME
variable to the location where the JRE (Java Runtime Environment) was installed in a previous step. This location is typically /Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/
.
export SPARK_HOME=/Users/aaronschlegel/spark
export JAVA_HOME=”/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home”
The created environment variables are then added to the path.
export PATH=$PATH:$JAVA_HOME
export PATH=$SPARK_HOME/bin:$PATH
Lastly, to run PySpark in a Jupyter Notebook, we add a few more environment variables to update the default PySpark driver variables.
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=’notebook’
We are now ready to run PySpark on our local MacOS machine and in a Jupyter Notebook! In the next section, we walk through an example of using PySpark to read data from AWS S3 in a notebook.
Running PySpark in Jupyter Notebook
Either source the updated ~/.bashrc
file or open a new terminal and install the pyspark
and findspark
libraries. The findspark
library allows us to start PySpark based on the SPARK_HOME
environment variable we set in the previous section.
pip install pyspark
pip install findspark
Open a Jupyter Notebook as usual (generally by running jupyter notebook
in a terminal) and import findspark
and pyspark
.
import findspark
findspark.init()import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
Example: Reading Data from AWS S3 with PySpark
Before we can read data stored on AWS S3, we must first add the hadoop-aws
package in the spark-submit
commands when running PySpark in a Jupyter Notebook. One approach is to set the PYSPARK_SUBMIT_ARGS
environment variable using the os. environment
method.
import os
# The local installation of Hadoop is version 3.2.0, therefore the environment variable PYSPARK_SUBMIT_ARGS is set to
# use that version.
os.environ[‘PYSPARK_SUBMIT_ARGS’] = ‘ — packages “org.apache.hadoop:hadoop-aws:3.2.0” pyspark-shell’
S3 uses an authentication method known as AWS Signature Version 4 to authenticate requests made to buckets. When using AWS SDKs such as boto3
, the authentication steps are done automatically; however, as we are not using an AWS SDK to read data from S3 with PySpark, we must first pass an additional configuration option chained to the SparkConf() configuration object. The configuration option is com.amazonaws.services.s3.enableV4=true
for both the Spark executor and driver.
conf = SparkConf().\
set(‘spark.executor.extraJavaOptions’,’-Dcom.amazonaws.services.s3.enableV4=true’).\
set(‘spark.driver.extraJavaOptions’,’-Dcom.amazonaws.services.s3.enableV4=true’).\
setAppName(‘read_data_from_s3_example’).setMaster(‘local[*]’)sc = SparkContext(conf=conf)
Lastly, we set some Hadoop API configuration variables as the reading and writing from AWS S3 to PySpark is handled by Hadoop APIs internally. The variables we need to set are the AWS access and secret keys provided when creating a user in AWS, the region of where the bucket is located, and the desired filesystem to use when interacting with S3. We use the s3a
filesystem as it is generally the most performant, can support file sizes up to 5TB, and is intended to replace the native filesystem s3n
. Please see the documentation for much more detail on the different filesystems available for working with S3 through Hadoop/Spark.
aws_access_key = os.environ.get(‘AWS_SECRET_ID’)
aws_secret_key = os.environ.get(‘AWS_SECRET_ACCESS_KEY’)hadoopConf = sc._jsc.hadoopConfiguration()hadoopConf.set(‘fs.s3a.access.key’, aws_access_key)
hadoopConf.set(‘fs.s3a.secret.key’, aws_secret_key)
hadoopConf.set(‘fs.s3a.endpoint’, ‘s3-us-west-2.amazonaws.com’)
hadoopConf.set(‘fs.s3a.impl’, ‘org.apache.hadoop.fs.s3a.S3AFileSystem’)
Note, the settings above can also be added to the spark-defaults.conf
file located in your Spark installation’s conf/
directory to avoid having to set the configurations for each new notebook.
Now we are ready to read data from S3! To start, we create a SparkSession object to act as the main entry point to programming with Spark.
spark = SparkSession(sc)
The data that we will be loading is a nearly complete set of all animal welfare organizations in the United States taken from the Petfinder.com API. The data is stored in AWS S3 as several JSON files with the same data schema. To load all files at once into a DataFrame, we can use a wildcard (*
) in place of the actual file names in the directory.
organizations = spark.read.json(‘s3a://pethub-data/petfinder/organizations/*.json’)
To display the inferred schema from the loaded JSON files, we can use the printSchema()
method.
organizations.printSchema()root
| — _links: struct (nullable = true)
| | — animals: struct (nullable = true)
| | | — href: string (nullable = true)
| | — self: struct (nullable = true)
| | | — href: string (nullable = true)
| — address: struct (nullable = true)
| | — address1: string (nullable = true)
| | — address2: string (nullable = true)
| | — city: string (nullable = true)
| | — country: string (nullable = true)
| | — org_id: string (nullable = true)
| | — postcode: string (nullable = true)
| | — state: string (nullable = true)
| — adoption: struct (nullable = true)
| | — org_id: string (nullable = true)
| | — policy: string (nullable = true)
| | — url: string (nullable = true)
| — distance: string (nullable = true)
| — email: string (nullable = true)
| — hours: struct (nullable = true)
| | — friday: string (nullable = true)
| | — monday: string (nullable = true)
| | — saturday: string (nullable = true)
| | — sunday: string (nullable = true)
| | — thursday: string (nullable = true)
| | — tuesday: string (nullable = true)
| | — wednesday: string (nullable = true)
| — id: string (nullable = true)
| — mission_statement: string (nullable = true)
| — name: string (nullable = true)
| — phone: string (nullable = true)
| — photos: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — full: string (nullable = true)
| | | — large: string (nullable = true)
| | | — medium: string (nullable = true)
| | | — org_id: string (nullable = true)
| | | — small: string (nullable = true)
| — social_media: struct (nullable = true)
| | — facebook: string (nullable = true)
| | — instagram: string (nullable = true)
| | — pinterest: string (nullable = true)
| | — twitter: string (nullable = true)
| | — youtube: string (nullable = true)
| — url: string (nullable = true)
| — website: string (nullable = true)
We have just read several JSON files at once into a single PySpark DataFrame! As a final example, we will introduce how to query data from the DataFrame. First, we create a temporary view that will only exist while the `SparkSession` object is active.
organizations.createOrReplaceTempView(“organizations”)
We can now use general SQL statements to query data that was saved as JSON a few moments ago! For example, we wanted to query all of the addresses in the address
node where the address1
value is not null
.
```
spark.sql(“SELECT address.address1 FROM organizations WHERE address.address1 IS NOT NULL”).show()
+ — — — — — — — — — — +
| address1|
+ — — — — — — — — — — +
| P.O.Box 770992|
| 915 Benton Street|
| PO Box 938|
| P.O. Box 1321|
| 1023 Shirley St.|
| P.O. Box 117|
| 2916 Wyola Ave.|
| P. O. Box 1791|
|15050 County Road 49|
| P O BOX 41551|
| P.O. Box 8727|
| PO BOX 1613|
|1555 S. Wincheste…|
| PO Box 364|
| PO Box 1135|
| P.O. Box 126|
| P.O. Box 2285|
| PO Box 8014|
| Mohegan Park|
| P.O. Box 1303|
+ — — — — — — — — — — +
only showing top 20 rows
Conclusion
In this post, we explored installing and using a local PySpark environment for development and local testing purposes. PySpark’s ability to interact with and read data from S3 was also introduced with several examples on how to read and query JSON data from S3 using general SQL statements. Of course, this is only the tip of the iceberg of the capabilities offered by Spark/PySpark and other Apache tools. Still, I hoped it helped show that getting started with PySpark doesn’t need to be complicated or expensive.