Accessing AWS S3 from PySpark Standalone Cluster

Siva Chaitanya
2 min readMay 22, 2018

Before you proceed, ensure that you have installed and configured PySpark and Hadoop correctly. To cross-check, you can visit this link. While reading from AWS EMR is quite simple, this was not the case using a standalone cluster. The information though available on the web is scattered across multiple posts and I am merely compiling it to make it easier for the user.

Scope of this article:

Using s3a to read: Currently, there are three ways one can read files: s3, s3n and s3a. In this post, we would be dealing with s3a only as it is the fastest. Please note that s3 would not be available in future releases.

v4 authentication: AWS S3 supports two versions of authentication — v2 and v4, new regions like Mumbai and Frankfurt only have v4.

Importing necessary modules in pyspark:

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf

Attributes to be added to SparkConf:

conf = (SparkConf().set(“spark.executor.extraJavaOptions”,”-Dcom.amazonaws.services.s3.enableV4=true”).set(“spark.driver.extraJavaOptions”,”-Dcom.amazonaws.services.s3.enableV4=true”))

Without these, the job would not run in a cluster mode. If running on a local cluster, you need not specify these.

Setting up SparkContext:

scT=SparkContext(conf=conf)
scT.setSystemProperty(“com.amazonaws.services.s3.enableV4”, “true”)

We need hadoop installed on master and all nodes of the cluster and needs to be configured as follows:

hadoopConf = scT._jsc.hadoopConfiguration()
hadoopConf.set(“fs.s3a.awsAccessKeyId”, “XXXXX”)
hadoopConf.set(“fs.s3a.awsSecretAccessKey”, “XXXXX”)
hadoopConf.set(“fs.s3a.endpoint”, “s3-ap-south-1.amazonaws.com”)
hadoopConf.set(“com.amazonaws.services.s3a.enableV4”, “true”)
hadoopConf.set(“fs.s3a.impl”, “org.apache.hadoop.fs.s3a.S3AFileSystem”)

The preferred way is to read the aws credentials set through awscli/hadoop and retrieve them in your script rather than explicitly mentioning them. Refer to this link for the steps of the same.

sql = SparkSession(scT)
csv_df = sql.read.csv(‘s3a://bucket/file.csv’)

This should help you in using your cluster to read files from S3.

--

--