Setting and accessing enviroment variables in Spark jobs on EMR

didier deshommes
Aug 25, 2017 · 1 min read

Our use case is: we want to access a custom environment variable in our Spark job that we set previously. This use case seemed pretty simple, but I’m surprised how involved it got. In our case, our Spark job used this environment variable to access some paths protected by a secret key.

First, let’s go through the options that didn’t work for us:

  • Setting the environment variable at bootstrap time, ie export SECRET_KEY=xxxx. This looks like it should work, but doesn't probably of how hadoop/spark jobs are launched.
  • Submitting the Spark job with spark.executorEnv.SECRET_KEY=xxx and spark.yarn.appMasterEnv.SECRET_KEY=xxxx. These options are specified in the documentation, soit it looks like they should work. Unfortunately we struck out here also. There's even a Stack Overflow thread about it.

Here’s what ended up working for us: at bootstrap time, set SECRET_KEY=xxxx in both hadoop-env.sh and spark-env.sh. You can do this using the EMR API. The json request will look like this:

[ { "Classification": "hadoop-env", 
"Properties": { },
"Configurations": [
{ "Classification": "export",
"Properties": {
"SECRET_KEY": "xxxx" },
"Configurations": [ ] } ]},
{ "Classification": "spark-env",
"Properties": { },
"Configurations": [
{ "Classification": "export",
"Properties": {
"SECRET_KEY": "xxxx", },
"Configurations": [ ] } ]}]

The obvious drawback with this approach is that to change the value of this variable, we’d have to log into each node and edit spark-env.sh and hadoop-env.sh.

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade