How to read JSON files from S3 using PySpark and the Jupyter notebook

1 min readApr 30, 2018

This is a quick step by step tutorial on how to read JSON files from S3.

Prerequisites for this guide are pyspark and Jupyter installed on your system. Please follow this medium post on how to install and configure them.

Step 1

First, we need to make sure the Hadoop aws package is available when we load spark:

import osos.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"

Step 2

Next, we need to make pyspark available in the jupyter notebook:

import findspark
findspark.init()from pyspark.sql import SparkSession

Step 3

We need the aws credentials in order to be able to access the s3 bucket. We can use the configparser package to read the credentials from the standard aws file.

import configparserconfig = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))access_id = config.get(aws_profile, "aws_access_key_id") 
access_key = config.get(aws_profile, "aws_secret_access_key")

Step 4

We can start the spark session and pass the aws credentials to the hadoop configuration:

sc=spark.sparkContext

How to read JSON files from S3 using PySpark and the Jupyter notebook

Step 1

Step 2

Step 3

Step 4

Written by Bogdan Cojocar