Install Spark on MacOS and Load Data from AWS S3 with PySpark

Aaron Schlegel
7 min readAug 20, 2021

In this post, we walk through how to get started with Spark on our local MacOS machine to begin exploring and analyzing data using PySpark using a Jupyter Notebook. We also explore how to configure PySpark to interact with AWS S3 for reading and writing data. PySpark is a beautiful tool for performing large-scale data analytics and ETL processes. Those with pandas experience will find the programmatic interface very similar. As the volume of data grows and the demand for engineers with big data skills increases at least proportionally, it is valuable to have Spark and PySpark experience. This post aims to get started with PySpark on a local machine to begin analyzing somewhat large datasets without needing to pay for an AWS EMR or other cloud Spark providers.

Install Java

Java 8 or higher is required for Spark to run on your local machine. To download and install Java, go to the Java download page and install the JRE (Java Runtime Environment). The JDK (Java Development Kit) is also required for Spark to run successfully. Download and install the macOS Installer available on the JDK downloads page. With the JRE and JDK installed, we can proceed to downloading and installing Spark.

Download and Install Spark

With Java installed, Spark can now be installed on our local machine. Proceed to the Spark download page, select the latest release (3.1.2 at the time of this writing) with the pre-built Hadoop package, and download. This will give us a “basic” Spark installation with just the necessary components for running Spark and PySpark on our local machine. For more tool support, such as Hive and Kubernetes, the Spark distribution must be built manually.

In a terminal, go to the directory with the downloaded Spark distribution. Unzip the downloaded Spark installation and copy it to your desired installation location. In the below example, we assume the Spark distribution is in the Downloads/ directory and opt to place the Spark distribution in our user folder.

cd ~/Downloads
tar -xzf spark-3.1.2-bin-hadoop3.2.tgz
mv spark-3.1.2-bin-hadoop3.2 /Users/aaronschlegel/spark-3.1.2

Although not strictly necessary, creating a symbolic link to the installation will allow us to have multiple versions of Spark.

ln -s /Users/aaronschlegel/spark-3.1.2 /Users/aaronschlegel/spark

We lastly set the needed environment and $PATH variables, so our system knows where to find the installed Java and Spark locations in the next section.

Setting the Environment Variables

Before running Spark and PySpark on our local machine, we first need to set several $PATH variables for our system to find the Spark and Java installations. In your ~/.bashrc or similar file, set the SPARK_HOME variable to the symbolic link directory we created earlier and the JAVA_HOME variable to the location where the JRE (Java Runtime Environment) was installed in a previous step. This location is typically /Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/.

export SPARK_HOME=/Users/aaronschlegel/spark
export JAVA_HOME=”/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home”

The created environment variables are then added to the path.

export PATH=$PATH:$JAVA_HOME
export PATH=$SPARK_HOME/bin:$PATH

Lastly, to run PySpark in a Jupyter Notebook, we add a few more environment variables to update the default PySpark driver variables.

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=’notebook’

We are now ready to run PySpark on our local MacOS machine and in a Jupyter Notebook! In the next section, we walk through an example of using PySpark to read data from AWS S3 in a notebook.

Running PySpark in Jupyter Notebook

Either source the updated ~/.bashrc file or open a new terminal and install the pyspark and findspark libraries. The findspark library allows us to start PySpark based on the SPARK_HOME environment variable we set in the previous section.

pip install pyspark
pip install findspark

Open a Jupyter Notebook as usual (generally by running jupyter notebook in a terminal) and import findspark and pyspark.

import findspark
findspark.init()
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

Example: Reading Data from AWS S3 with PySpark

Before we can read data stored on AWS S3, we must first add the hadoop-aws package in the spark-submit commands when running PySpark in a Jupyter Notebook. One approach is to set the PYSPARK_SUBMIT_ARGS environment variable using the os. environment method.

import os
# The local installation of Hadoop is version 3.2.0, therefore the environment variable PYSPARK_SUBMIT_ARGS is set to
# use that version.
os.environ[‘PYSPARK_SUBMIT_ARGS’] = ‘ — packages “org.apache.hadoop:hadoop-aws:3.2.0” pyspark-shell’

S3 uses an authentication method known as AWS Signature Version 4 to authenticate requests made to buckets. When using AWS SDKs such as boto3, the authentication steps are done automatically; however, as we are not using an AWS SDK to read data from S3 with PySpark, we must first pass an additional configuration option chained to the SparkConf() configuration object. The configuration option is com.amazonaws.services.s3.enableV4=true for both the Spark executor and driver.

conf = SparkConf().\
set(‘spark.executor.extraJavaOptions’,’-Dcom.amazonaws.services.s3.enableV4=true’).\
set(‘spark.driver.extraJavaOptions’,’-Dcom.amazonaws.services.s3.enableV4=true’).\
setAppName(‘read_data_from_s3_example’).setMaster(‘local[*]’)
sc = SparkContext(conf=conf)

Lastly, we set some Hadoop API configuration variables as the reading and writing from AWS S3 to PySpark is handled by Hadoop APIs internally. The variables we need to set are the AWS access and secret keys provided when creating a user in AWS, the region of where the bucket is located, and the desired filesystem to use when interacting with S3. We use the s3a filesystem as it is generally the most performant, can support file sizes up to 5TB, and is intended to replace the native filesystem s3n. Please see the documentation for much more detail on the different filesystems available for working with S3 through Hadoop/Spark.

aws_access_key = os.environ.get(‘AWS_SECRET_ID’)
aws_secret_key = os.environ.get(‘AWS_SECRET_ACCESS_KEY’)
hadoopConf = sc._jsc.hadoopConfiguration()hadoopConf.set(‘fs.s3a.access.key’, aws_access_key)
hadoopConf.set(‘fs.s3a.secret.key’, aws_secret_key)
hadoopConf.set(‘fs.s3a.endpoint’, ‘s3-us-west-2.amazonaws.com’)
hadoopConf.set(‘fs.s3a.impl’, ‘org.apache.hadoop.fs.s3a.S3AFileSystem’)

Note, the settings above can also be added to the spark-defaults.conf file located in your Spark installation’s conf/ directory to avoid having to set the configurations for each new notebook.

Now we are ready to read data from S3! To start, we create a SparkSession object to act as the main entry point to programming with Spark.

spark = SparkSession(sc)

The data that we will be loading is a nearly complete set of all animal welfare organizations in the United States taken from the Petfinder.com API. The data is stored in AWS S3 as several JSON files with the same data schema. To load all files at once into a DataFrame, we can use a wildcard (*) in place of the actual file names in the directory.

organizations = spark.read.json(‘s3a://pethub-data/petfinder/organizations/*.json’)

To display the inferred schema from the loaded JSON files, we can use the printSchema() method.

organizations.printSchema()root
| — _links: struct (nullable = true)
| | — animals: struct (nullable = true)
| | | — href: string (nullable = true)
| | — self: struct (nullable = true)
| | | — href: string (nullable = true)
| — address: struct (nullable = true)
| | — address1: string (nullable = true)
| | — address2: string (nullable = true)
| | — city: string (nullable = true)
| | — country: string (nullable = true)
| | — org_id: string (nullable = true)
| | — postcode: string (nullable = true)
| | — state: string (nullable = true)
| — adoption: struct (nullable = true)
| | — org_id: string (nullable = true)
| | — policy: string (nullable = true)
| | — url: string (nullable = true)
| — distance: string (nullable = true)
| — email: string (nullable = true)
| — hours: struct (nullable = true)
| | — friday: string (nullable = true)
| | — monday: string (nullable = true)
| | — saturday: string (nullable = true)
| | — sunday: string (nullable = true)
| | — thursday: string (nullable = true)
| | — tuesday: string (nullable = true)
| | — wednesday: string (nullable = true)
| — id: string (nullable = true)
| — mission_statement: string (nullable = true)
| — name: string (nullable = true)
| — phone: string (nullable = true)
| — photos: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — full: string (nullable = true)
| | | — large: string (nullable = true)
| | | — medium: string (nullable = true)
| | | — org_id: string (nullable = true)
| | | — small: string (nullable = true)
| — social_media: struct (nullable = true)
| | — facebook: string (nullable = true)
| | — instagram: string (nullable = true)
| | — pinterest: string (nullable = true)
| | — twitter: string (nullable = true)
| | — youtube: string (nullable = true)
| — url: string (nullable = true)
| — website: string (nullable = true)

We have just read several JSON files at once into a single PySpark DataFrame! As a final example, we will introduce how to query data from the DataFrame. First, we create a temporary view that will only exist while the `SparkSession` object is active.

organizations.createOrReplaceTempView(“organizations”)

We can now use general SQL statements to query data that was saved as JSON a few moments ago! For example, we wanted to query all of the addresses in the address node where the address1 value is not null.

```
spark.sql(“SELECT address.address1 FROM organizations WHERE address.address1 IS NOT NULL”).show()

+ — — — — — — — — — — +
| address1|
+ — — — — — — — — — — +
| P.O.Box 770992|
| 915 Benton Street|
| PO Box 938|
| P.O. Box 1321|
| 1023 Shirley St.|
| P.O. Box 117|
| 2916 Wyola Ave.|
| P. O. Box 1791|
|15050 County Road 49|
| P O BOX 41551|
| P.O. Box 8727|
| PO BOX 1613|
|1555 S. Wincheste…|
| PO Box 364|
| PO Box 1135|
| P.O. Box 126|
| P.O. Box 2285|
| PO Box 8014|
| Mohegan Park|
| P.O. Box 1303|
+ — — — — — — — — — — +
only showing top 20 rows

Conclusion

In this post, we explored installing and using a local PySpark environment for development and local testing purposes. PySpark’s ability to interact with and read data from S3 was also introduced with several examples on how to read and query JSON data from S3 using general SQL statements. Of course, this is only the tip of the iceberg of the capabilities offered by Spark/PySpark and other Apache tools. Still, I hoped it helped show that getting started with PySpark doesn’t need to be complicated or expensive.

--

--

Aaron Schlegel

Seeker of knowledge, truth pilgrim, builder of things, science fiction lover, math and stats aficionado, data scientist. https://aaronschlegel.me