Use IBM Data Science Experience to Read and Write Data Stored on Amazon S3

In this post, we will go through how to read and write data from and to Amazon S3 using Python 2 with Spark 2.0. Scala also has a similar api. You can use the code below in IBM Data Science Experience notebooks.

Prerequisite

You have an Amazon S3 account with credentials generated. You will need your Access Key ID and Secret Access Key from your AWS account.

Your s3a credentials are set in SparkContext’s hadoopConfiguration api so that SparkContext can use these credentials when reading and writing data using the s3a protocol.

#Replace Accesskey with your Amazon AccessKey and Secret with amazon secret hconf = sc._jsc.hadoopConfiguration() hconf.set("fs.s3a.access.key", "<put-your-access-key>") hconf.set("fs.s3a.secret.key", "<put-your-secret-key>")

Reading from Amazon S3 Bucket

Use the SparkSession api introduced in Spark 2.0 to create your spark session variable. You can use the spark.read api to read csv, parquet and other supported file types in Spark DataFrame. The load api will let you specify the path to your S3 bucket file. Replace your-bucket-name, foldername and file-name according to your S3 Bucket. Once the DataFrame is created, test it by checking the first 5 rows.

spark = SparkSession.builder.getOrCreate() df_data_1 = spark.read\ .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\ .option('header', 'true')\ .load('s3a://<your-bucket-name>/<foldername>/<filename>.csv') df_data_1.take(5)
[Row(id=u'10001', name=u'Tony'), Row(id=u'10002', name=u'Mike'), Row(id=u'10003', name=u'Pat'), Row(id=u'10004', name=u'Chris'), Row(id=u'10005', name=u'Paco')]

We will also check schema inferred using printSchema().

df_data_1.printSchema()
root |-- id: string (nullable = true) |-- name: string (nullable = true)

Writing to Amazon S3 Bucket

Use the same DataFrame to write to another bucket on Amazon S3, but instead of saving it in csv format, save it as parquet. The path format is similar.

df_data_1.write.save("s3a://charlesbuckets31/FolderB/users.parquet")

Read the parquet file to test that the write was successful and correct.

df_data_2 = spark.read\ .format('org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat')\ .option('header', 'true')\ .load('s3a://charlesbuckets31/FolderB/users.parquet') df_data_2.take(5)
[Row(id=u'10001', name=u'Tony'), Row(id=u'10002', name=u'Mike'), Row(id=u'10003', name=u'Pat'), Row(id=u'10004', name=u'Chris'), Row(id=u'10005', name=u'Paco')]

Reference to complete notebook


Originally published at datascience.ibm.com on January 17, 2017.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.