GCS Authentication Using Apache Hadoop Credential Provider in Dataproc

Jordan Hambleton
Google Cloud - Community
6 min readMay 18, 2020

This post provides an overview and example of GCS authentication using the Apache Hadoop Credential Provider with service account credentials in Dataproc. For Hadoop users’ applications, this technique enables seamless retrieval of secrets in credential files that have the required access to datasets in Cloud Storage.

Overview

The GCS connector provides interoperability for Dataproc Hadoop applications (ie. Spark, MapReduce, Hive) to access datasets stored in external GCS buckets. By default, the authentication and access to GCS is done not by the Hadoop user submitting the job, but through the host service account of the VM instances running the Dataproc cluster.

The Apache Hadoop Credential Provider enables the mechanism to store and retrieve secrets in encrypted files (ie. JCEKS files) on a local or Hadoop filesystem. Further, the GCS connector supports being able to retrieve and authenticate using a securely stored credential by a Hadoop application when accessing GCS.

[diagram 1 — high-level flow]

The above diagram provides two flows to distinguish data access to GCS requested by a Spark application through 1) the cluster’s default host service account and 2) access using service account secrets stored in the Hadoop Credential Provider. In this example, the default service account sa-dataproc-instance is not granted an IAM role to access data-bucket-a, but sa-data-access-a is and able to access data required to execute the job. The flow for sequence 1 is the default flow when submitting jobs and accessing datasets stored in GCS. In sequence 2, the spark application defines the hadoop.security.credential.provider.path and when the application requests to read data from the Cloud Storage Bucket, the GCS Connector retrieves the service account credentials using the credential provider and authenticates using those credentials to access data-bucket-a.

Later in this blog, we will walk through the exact steps to create the sa-data-access service account, store the credentials in an encrypted JCEKS file, and use it for accessing Google Cloud Storage.

Cluster Security Design

It is important to implement security measures to protect credentials that provide access to GCS data. In this section, we review high-level principles for locking these credentials down in HDFS.

Java KeyStore files store encrypted service account credentials that are created using the Hadoop Credential Utility. The utility can directly create the encrypted secrets and store them in HDFS with default permission 600. HDFS ACLs should be properly set on the file as well as parent directory to only permit Hadoop users that should have access to this credential.

If a Dataproc cluster is only used by a single tenant, standard GCP perimeter security measures may be implemented at the GCP Project level and Dataproc cluster to prevent unwarranted access to the cluster and secrets. This includes allowing only authorized users access to the project and Dataproc cluster that should have access to the credentials. Additionally, ensure that the Dataproc Jobs API and Hadoop APIs, such as the HDFS and YARN endpoints including WebHDFS, HTTP Rest API, etc are similarly only accessible to users that have authorized access to underlying credentials.

If a Dataproc cluster has multiple tenants, operators ensure that OSLogin is enabled for managing SSH access and sudo restrictions, but also will be required to enable Hadoop Secure Mode via Kerberos to enforce authentication of Hadoop users. Authenticated Hadoop users ensure that only the users that are permitted with the applied HDFS ACLs on the encrypted JCEKS files, have access. Without Kerberos authentication, Hadoop users can impersonate other users easily and bypass access controls.

Securing JCEKS files with HDFS permissions and ownership ensures that only authorized users can use the credentials when accessing GCS. Other authorization methods such as HDFS ACLs and Apache Ranger may also be used for protective measures.

* HDFS directory and file permissions/ownership can be enabled by setting dfs.permissions.enabled to true in hdfs-site.xml.

* HDFS ACLs extend the permissions/ownership model with the ability to apply additional permissions for other users and groups to access the same files/directories. ACLs can be enabled by setting dfs.namenode.acls.enabled to true in hdfs-site.xml.

Lastly, as an additional best practice, service account keys can be rotated on a periodic basis. This can be done simply by creating a new key for the same service account, generating and deploying a new JCEKS file with the updated credentials, and removing the old service account key after all Hadoop applications have switched over to use the latest credential file.

Creating and Using Credentials

This section walks through the following steps:

  1. Create service account sa-data-access-a
    * Authorize access through IAM to GCS data-bucket-a
  2. Create JCEKS encrypted file with service account credentials using the Hadoop Credential Utility
    * Secure JCEKS file in HDFS
  3. Use the JCEKS credentials to access GCS with the Hadoop fs command, MapReduce, Spark, or Hive applications.

1. Create Service Account and Authorize Access

In this first step, we create a service account and apply an IAM role for it to access the GCS bucket data-bucket-a.

Create service account and JSON key

PROJECT_ID=jh-data-sandbox
SERVICE_ACCOUNT=sa-data-access-a
gcloud iam service-accounts create ${SERVICE_ACCOUNT} \
--description="sa-data-access-a description" \
--display-name=${SERVICE_ACCOUNT}
gcloud iam service-accounts keys \
create ~/${SERVICE_ACCOUNT}-key.json \
--iam-account ${SERVICE_ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com

Authorize Access by applying IAM permissions

Authorize the service account storage.buckets.get permission to the Dataproc clusters’ GCS bucket:

BUCKET=`gcloud dataproc clusters describe dataproc-015 --format="value(config.configBucket)"`gsutil iam ch serviceAccount:${SERVICE_ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com:roles/storage.legacyBucketReader gs://${BUCKET}

Authorize service account permission to other GCS buckets required for data access:

gsutil iam ch serviceAccount:${SERVICE_ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com:roles/storage.legacyBucketWriter gs://data-bucket-a

2. Create Hadoop Credential JCEKS File

hadoop credential create fs.gs.auth.service.account.email  \
-provider jceks://hdfs/app/client-app-a/sa-data-access-a.jceks \
-value "sa-data-access-a@jh-data-sandbox.iam.gserviceaccount.com"
hadoop credential create fs.gs.auth.service.account.private.key.id \
-provider jceks://hdfs/app/client-app-a/sa-data-access-a.jceks \
-value "0a4e5cb521e2b7c75d082d7069f1cff75071f814"
hadoop credential create fs.gs.auth.service.account.private.key \
-provider jceks://hdfs/app/client-app-a/sa-data-access-a.jceks \
-value "-----BEGIN PRIVATE KEY-----\n redacted \n-----END PRIVATE KEY-----\n"
# apply HDFS ownership to appropriate users and groups requiring access
hadoop fs -chown -R ${USER}:${USER} /app/client-app-a
hadoop fs -chmod 500 /app/client-app-a
hadoop fs -chmod 400 /app/client-app-a/sa-data-access-a.jceks
# apply HDFS ACLs (extended acls for advanced setup) for read permission for users and groups requiring access

3. Using secrets from credential provider with the GCS connector

Hadoop client

hadoop fs -Dhadoop.security.credential.provider.path=jceks://hdfs/app/client-app-a/sa-data-access-a.jceks -ls gs://data-bucket-a/

MapReduce Job

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen -Dhadoop.security.credential.provider.path=jceks://hdfs/app/client-app-a/sa-data-access-a.jceks 100000 gs://data-bucket-a/write-data-test/

Spark

spark-submit --class org.apache.spark.examples.DFSReadWriteTest \
--master yarn \
--deploy-mode cluster \
--num-executors 3 \
--conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/app/client-app-a/sa-data-access-a.jceks \
/usr/lib/spark/examples/jars/spark-examples.jar \
/var/log/google-dataproc-agent.0.log \
gs://data-bucket-a/test-dfs-read-write/

Hive

# hive - table x is an external table stored on gs://data-bucket-a/x
hive --hiveconf hadoop.security.credential.provider.path=jceks://hdfs/app/client-app-a/sa-data-access-a.jceks -e 'select count(1) from x;'
# beeline
beeline -u "jdbc:hive2://dataproc-015-m:10000/default;principal=hive/dataproc-015-m@DATAPROC-015.ACME.COM" --hiveconf hadoop.security.credential.provider.path=jceks://hdfs/app/client-app-a/sa-data-access-a.jceks -e 'select count(1) from x;'

Summary

In conclusion, we provided an example to demonstrate using the Apache Hadoop Credential Provider to seamlessly access encrypted service account credentials stored in Java KeyStore files when executing Hadoop applications. By securing JCEKS files in HDFS with permissions and enforcing authentication through Kerberos, only applications and users that are allowed to use the secrets will be able to use them to access datasets in Cloud Storage. This technique provides a mechanism and additional option for using different credentials to authenticate and access data stored in GCS.

Resources

--

--