Reading and Writing Files with Scala Spark and Google Cloud Storage

Fari Qodri
Holy Dev
Published in
4 min readSep 15, 2022
Google Cloud Storage and Apache Spark

HDFS has been used as the main big data storage tool in years. Its distributed nature makes it easier to scale horizontally, meaning that you only need to add more servers to accomodate the increasing volume of data. However, as your data gets bigger, you will also have more servers, and managing large amount of servers is no easy task. As is the nature of any distributed systems, network issues sometimes occur between the servers and you need extra effort to anticipate them. Needless to say, managing lots of HDFS servers is an expensive and difficult for small-medium, or even big projects with limited resources.

One alternative that can be used to overcome this issue is object storage services, such as Google Cloud Storage (GCS). With GCS, you don’t need to manage any HDFS servers while still having high availability that Google has promised (99.95% of availability in SLA). Apart from that, GCS is also compatible with Hadoop ecosystem, such as Spark and Hive. In this article, I will give you an example of how to read and write files in GCS with Spark. The environment I use for this tutorial is the following:

  1. Spark 2.4.8 with Hadoop 2.7
  2. Scala 2.11
  3. WSL 2 Ubuntu
  4. Gradle as build tool

Setting Up Your Google Cloud

The first step to do is setting up your GCS bucket and service account that will be used by the Spark app to access the bucket.

First of all, you need to create the service account. You can do so by accessing the Service Account page in GCP IAM service. You can choose any service account name and ID and skip the access granting matters. After the service account is created, you need to get its JSON key. You can do so by going to the Keys tab inside the service account, create new key, choose JSON, and download the service account key. After this step, you should already have the access key that your Spark app will use to access the GCS bucket.

The second thing you need to do is creating the GCS bucket. You can follow the detail in this image below with a change in the name and region.

Create GCS Bucket

After the bucket is created, you need to allow the service account to access the objects in the bucket. You can do so by going to Permissions tab and add permission. You need to copy the service account email and paste it in the New Principals field, select Storage Object Admin as the Role, and save the permission. This permission means that you give the service account full control of the objects in this bucket only. After these 2 steps, you are pretty much ready to create your Spark application.

Service Account Bucket Permission

Setting Up Your Spark Application

I have written a simple Spark application example that demonstrates reading and writing a file in GCS. For the CSV dataset, I used a simple films dataset from this dataset repository.

As you can see in the code, nothing is special about this Spark app. The only indication that it reads/writes file in GCS is the “gs://” prefix in the file paths, instead of the usual “hdfs://” or “file://” prefix.

Submitting this Spark app requires 2 things: the service acount key JSON file, and the GCS Hadoop connector JAR. We already have the first item from the previous step. As for the JAR, you can refer to this link from Google and download the JAR that is suitable with your Hadoop version.

After you have the JSON key and the JAR file, you can execute a Spark submit command like this example:

spark-submit --class "com.example.MainApp" --master "local[*]" --conf spark.executor.extraClassPath=/home/fariqodri/Experiment/gcs-connector-hadoop2-2.2.7-shaded.jar --conf spark.driver.extraClassPath=/home/fariqodri/Experiment/gcs-connector-hadoop2-2.2.7-shaded.jar --conf spark.hadoop.google.cloud.auth.service.account.json.keyfile=/home/fariqodri/Experiment/data-experiment-21-418b180d6b34.json --conf spark.hadoop.fs.gs.implicit.dir.repair.enable=false build/libs/spark_gcs.jar

In the command, you can see that we include the JAR file in the driver and executor classpath. You can use another way to include the JAR file in the classpath, such as putting the JAR file in the Spark classpath directly or including it in your Spark app JAR. You can also see that we include service account key file in the Spark configuration. You can also use another way to include it in your Spark app, like using environment variable like the example below.

export GOOGLE_APPLICATION_CREDENTIALS="/home/fariqodri/Experiment/data-experiment-21-418b180d6b34.json"

The “spark.hadoop.fs.gs.implicit.dir.repair.enable” configuration is not required, but I disable it anyway because I observed that it prevented the Spark application from shutting down even after all the jobs are completed. You can enable to see the behavior yourself. The full configuration is already well documented in this page, so you may want to check it yourself.

Conclusion

The steps that we have conducted basically can be summarized into these points:

  1. Create GCS bucket and GCP service account.
  2. Allow the service account to access objects in the bucket by adding the service account as a principal with an adequate role.
  3. Create and download the service account key.
  4. Select and download the suitable GCS Hadoop connector JAR.
  5. Submit your Spark application by including the service account key file and the connector JAR.

In the next article, I will try something a little bit more complex by adding Hive into the equation. I will try to create a Spark application that writes/reads data from a Hive table, while the Hive table itself writes/reads data from GCS bucket. Stay tune for the next update!

--

--