What a simple task. Simply accessing data from S3 through PySpark and while assuming an AWS role. Well, I found that it was not that straight forward due to Hadoop dependency versions that are commonly used by all of us Spark users. I decided to write up a tutorial that will hopefully help many data engineers and architects out there that are enduring the same struggles that I went through.
This all started when a data scientist from my company asked me for assistance with accessing data off of S3 using Pyspark. Easy enough right? Well, I made the mistake of telling him “No problem, we can solve that within the hour”. For all you new engineers in the IT field, never make a promise with a timeline attached. You will find yourself awake at 1 in the morning, unable to sleep, typing up a medium article, because there is so much adrenaline running through your body because you just solved the problem.
There was a lot more than just simply accessing data off of S3 using Pyspark and I had completely overlooked all those variables. For one, my organization has multiple AWS accounts and we have been pretty good about following best practices. We have multiple accounts with roles defined on each that controls access to all the resources. These roles can be assumed if you are given access to do so. However, accessing data in S3 by assuming roles is a little bit different than just submitting your access key and secret key. Add Spark to the mix and you now have another application you need to respect.
There are a lot of variables at play when dealing with Spark, AWS, and Hadoop as it is. Error messages that we receive are not always very clear, leaving us chasing solutions that are irrelevant to our problem. You will find yourself googling the problem over and over until every link is purple and you have no idea what to do next. So let’s work through this together. I want to explain it in great detail because I believe understanding the solution will also help understand how these complex libraries are actually working. To an extent that is.
Alright, so let’s lay out the problems that I faced. The first problem is with Hadoop 2.7.3. Hadoop version 2.7.3 is the default version that is packaged with Spark, but unfortunately using temporary credentials to access S3 over the S3a protocol was not supported until version 2.8.0.
You can try to include the credentials in the URL(don’t do this anyways) or even set them as environment variables, but it will not work. I suspect that temporary credentials that are retrieved by assuming a role are handled differently on the back end than the regular access keys that we can create on AWS for our individual accounts. So let’s just use the later versions of Hadoop right? Correct.
So the next problem encountered was the fact that you need to make sure to use the correct aws-java-sdk version that matches the Hadoop version being used. This is not so much of a problem. This might be obvious to most, but I need to include this fact in this article in case someone missed this. You will receive various ClassNotFoundExceptions with no straight forward explanation of how to solve the problem. I will explain how to figure out the correct version below.
The next problem is installing Pyspark. A common way to install Pyspark is by doing a pip install Pyspark. We do it this way because we are usually developing within an IDE and want to be able to import the package easily. Well, unfortunately we are a little bit limited by installing spark this way.
While we are going to enable accessing data from S3 using Spark while running on our local in this example, be very careful with which data you choose to pull to your machine. We should not be pulling anything with sensitive data to our local machines. Whether you use this guide or not, you should only be using this to work with dev or non-sensitive data.
In a high level view, the solution is to install Spark using the version they offer that requires user defined Hadoop libraries and to put the dependency jars along side the installation manually. Once we do that, we will have upgraded the Hadoop version to one that can leverage the use of temporary credentials to use with the S3A protocol.
First, let’s install Hadoop on our machine. I am running on a Mac so I used Homebrew to install Hadoop:
Brew install Hadoop
Next we need to get spark installed the correct way. Doing a pip install of Pyspark does not give us the version of Pyspark that allows us to provide our own Hadoop libraries. It’s not impossible to upgrade the versions, but it can cause issues if not everything gets upgraded to the correct version. Fortunately, Spark offers a pre built version with user defined Hadoop libraries. We can find that here: https://spark.apache.org/downloads.html
Extract the contents to a directory of your choosing.
Next we need to configure the following environment variables so that everyone knows where everyone is on the machine and able to access each other.
In a terminal window, you can simply use the following commands, but you will end up having to do it for each new terminal window. You can create a bash profile and add these 3 lines to make the environment variables more permanent.
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
Nice, now Spark and Hadoop are installed and configured. If we look at ~/Downloads/spark-2.4.3-bin-without-hadoop/jars we will notice there are no Hadoop jars as it is referencing our Hadoop installation for all of those instructions.
However, we are missing hadoop-aws and it’s dependencies.
So, let’s go download them, but how do we know which versions we need? We head over to https://mvnrepository.com/ and look for the hadoop-aws. There will be many versions, let’s choose one after 2.8.0. I chose 3.1.2 in my example as that was the version of Hadoop I installed with Homebrew.
When we look at hadoop-aws on mvnrepository, we will notice this dependency listed with the version number:
Great, so we now know which version of aws-java-sdk-bundle the hadoop-aws library depends on. Let’s go get that as well. You can simply click on View Files to manually download these two jar’s:
Now, let’s place them in the jars directory of our spark installation:
At this point, we have installed Spark 2.4.3, Hadoop 3.1.2, and Hadoop AWS 3.1.2 libraries. We can now start writing our code to use temporary credentials provided by assuming a role to access S3.
For this example, we will start pyspark in terminal and write our code there. If you created a new window, don’t forget your environment variables will need to be set again.
I have an AWS profile with access to the account that contains my user, but the data I need is on a different account. I have access to assume a role on that account that has permissions to access the data.
Here is the code snippet in a text editor(I’ll post the code below to make it copy paste friendly):
As you can see, we create a session using our user profile. Then we assume the role. We then tell Hadoop that we are going to use TemporaryAWSCredentialsProvider and pass in our AccessKeyID, SecretAccessKey, and SessionToken.
Running the code above gives us our beautiful dev Dataframe containing non-sensitive data:
Now that we have it working, it feels like it was not that difficult of a task. Technically, it really wasn’t. We simply just made sure we were using the correct versions of the dependencies we were leveraging. However, at the time of writing this article/guide, I could not find a detailed walkthrough that explained all of the steps needed to make this work. I could find snippets here and there that explained certain sections, but nothing complete. There are so many different versions and configurations out there that you can actually do more damage than good when making changes. Hopefully this helps others out there that are trying to perform similar functionality.
session = boto3.session.Session(profile_name=’MyUserProfile’)
sts_connection = session.client(‘sts’)
response = sts_connection.assume_role(RoleArn=’ARN_OF_THE_ROLE_TO_ASSUME’, RoleSessionName=’THIS_SESSIONS_NAME’,DurationSeconds=3600)
credentials = response[‘Credentials’]
url = str(‘s3a://bucket/key/data.csv’)