Unit-testing AWS S3-integrated Scala / Spark components using local S3 mocking tools

Branden Smith
7 min readDec 18, 2019

--

AWS (center) with friends LocalStack (left) and Moto (right)

If your project relies on AWS S3 for object storage, you are likely to have encountered challenges unit-testing components which interface with Amazon’s cloud.

One approach, which helps to keep unit tests isolated from the cloud environment and minimizes AWS storage and transfer costs, is to substitute S3 during test runs with a local mocking framework such as Moto or LocalStack. Both frameworks provide the capability to stand up servers to emulate AWS features locally, for the benefit of functional testing.

This post aims to elucidate the steps required to perform local-mock unit-testing of S3-integrated components under 2 scenarios:

(1) JVM-based apps using the AWS Java API to interact with S3. Examples here use Scala, but should be generalizable to Java (and other JVM-based languages).

(2) Apps built atop a framework from the Hadoop ecosystem (such as Apache Spark) which perform distributed processing in conjunction with S3 as an input source and/or output destination. I use Spark here, but I would venture to guess that other Hadoop-integrated frameworks involve similar steps.

Under both scenarios, substituting a local mock for the S3 cloud requires some additional steps likely dissimilar from those in the production/cloud-facing app configuration.

tl;dr: sample code

The s3_mocktest_demo sample project contains a system-under-test consisting of:

(1) a Scala class SampleAppS3 which performs various S3 reading/writing actions via the AWS Java API, paired with a testing class SampleAppS3Spec which validates those operations

(2) a Spark app SampleSparkAppS3 which creates a Spark DataFrame, writes it to S3, then reads it back, paired with a testing class SampleSparkAppS3Spec.

The unit tests should run successfully using either Moto or LocalStack, provided that the mocking server’s S3 subsystem listens on localhost:9999.

Setting up the mocking server

You can use either Moto server mode or LocalStack as the S3 mocking server. Generally, I favor Moto for this use case, as I’ve found it slightly simpler to set up. (LocalStack requires a local Docker host, by default.)

Both tools are written in Python, and install via pip. If you don’t have a Python 3 + pip3 environment already in place, pip installation will look something like this (using MacOS + Homebrew; similar process for *nix, substituting your distro’s package manager for brew):

brew update
brew install python
curl https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py
chmod 755 /tmp/get-pip.py
python3 /tmp/get-pip.py

Option 1: moto_server

pip3 install moto[server]
moto_server s3 -p 9999

Option 2: LocalStack

pip3 install localstack
S3_PORT=9999 localstack start

Unit test implementation

For Java API components

If your application reads/writes S3 via the Java API (AmazonS3 / AmazonS3Client), then your test setup can construct an AmazonS3 instance which points to the S3 mocking server, then furnish that instance to the component-under-test prior to running tests:

AmazonS3TestUtil.scala: Creating a local-mock AmazonS3Client via AmazonS3ClientBuilder

AmazonS3TestUtil creates an AmazonS3 instance via AmazonS3ClientBuilder.standard() , using the following options for compatibility with the mocking server:

com.amazonaws.SdkClientException: Unable to execute HTTP request: bucketname.localhost
...
Cause: java.net.UnknownHostException: bucketname.localhost
com.amazonaws.SdkClientException: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: kC+90rHfDE9wtKXSNSXpMg== in base 64) didn't match hash (etag: 3414ca89abf4cd109e6d2cf2f827bc3c in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@7e8eff37, bucketName: 3e185334-bf80-445a-a6ad-a480bb32dbb5, key: abc)

(Thanks to StackOverflow user Shadowman for this answer, upon which this section is based.)

For Spark/Hadoop components

Hadoop-ecosystem applications interact with S3 via the Hadoop S3A client, so configurations to accommodate the limitations of the S3 local mock must be added to the Hadoop Configuration. For a Spark application, the SparkSession for unit testing might be created as follows:

Subsequently, the necessary Hadoop S3A properties may be added to the SparkSession’s associated Hadoop configuration as follows:

Note that all configuration values are of type String.

Some notes on the properties assigned above:

  • fs.s3.impl: Indicates the FileSystem implementation to use for S3; should use org.apache.hadoop.fs.s3a.S3AFileSystem (as prior implementations are now deprecated).
  • fs.s3a.endpoint: Indicates the URI of the local mock S3; equivalent to the withEndpointConfiguration method when using the AmazonS3 API, and accepts a URI with the same syntax (e.g. http://localhost:9999).
  • fs.s3a.access.key, fs.s3a.secret.key: Must each be assigned to some arbitrary non-empty string. Even though the mock S3 server does not require access credentials (and the mock-client AmazonS3 instance created in the previous section appears to not require them), it seems that the S3A layer currently performs a non-empty check on both the AWS access and secret keys, and will throw an Exception like the following if either one is absent (or set to empty string ""):
org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint

Beware: legacy S3-compatible object stores might not support this request.

Moto and LocalStack both appear to bear out this advisory; if multiobjectdelete is left with its default true setting, the following NullPointerException is likely to be thrown on writes (~substituted for package org.apache.hadoop.fs.s3a for readability):

java.lang.NullPointerException:
at ~.S3AUtils.translateMultiObjectDeleteException(S3AUtils.java:455)
at ~.S3AUtils.translateException(S3AUtils.java:269)
at ~.Invoker.retryUntranslated(Invoker.java:334)
at ~.Invoker.retryUntranslated(Invoker.java:285)
at ~.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
at ~.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
at ~.S3AFileSystem.deleteUnnecessaryFakeDirectories
(S3AFileSystem.java:2785)
at ~.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
at ~.S3AFileSystem.putObjectDirect(S3AFileSystem.java:1589)
at ~.S3AFileSystem.lambda$createEmptyObject$13
(S3AFileSystem.java:2835)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 3, localhost, executor driver): org.apache.hadoop.fs.s3a.NoVersionAttributeException: `s3a://a81479c3-9073-403d-9990-cf866a8c4690/SparkSqlS3Test/part-00000-8fe65a11-e4e1-4050-8365-679ec7b73481-c000.json': Change detection policy requires ETag
...
Cause: org.apache.hadoop.fs.s3a.NoVersionAttributeException: `s3a://a81479c3-9073-403d-9990-cf866a8c4690/SparkSqlS3Test/part-00000-8fe65a11-e4e1-4050-8365-679ec7b73481-c000.json': Change detection policy requires ETag
org.apache.hadoop.fs.s3a.AWSClientIOException:PUT 0-byte object on SparkSqlS3Test/_temporary/0/: com.amazonaws.SdkClientException: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: f7881095ab2dba4c1588ae614f9a269a in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3....
at ~.S3AUtils.translateException(S3AUtils.java:189)
at ~.Invoker.once(Invoker.java:111)
at ~.Invoker.lambda$retry$3(Invoker.java:265)
at ~.Invoker.retryUntranslated(Invoker.java:322)
at ~.Invoker.retry(Invoker.java:261)
at ~.Invoker.retry(Invoker.java:236)
at ~.S3AFileSystem.createEmptyObject(S3AFileSystem.java:2833)
at ~.S3AFileSystem.createFakeDirectory(S3AFileSystem.java:2808)
at ~.S3AFileSystem.innerMkdirs(S3AFileSystem.java:2129)
at ~.S3AFileSystem.mkdirs(S3AFileSystem.java:2062)
...Cause: com.amazonaws.SdkClientException: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: f7881095ab2dba4c1588ae614f9a269a in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@2241a17c, bucketName: fcbb9664-8dad-4111-8cbb-f0240ff3f958, key: SparkSqlS3Test/_temporary/0/)

(For readability above: some formatting applied, and ~substituted for package org.apache.hadoop.fs.s3a.)

There is an open JIRA case HADOOP-14695 to address this missing feature, but as yet, it does not have an assigned Fix Version, and I do not know how likely it is that any eventual implementation will be ported back to Hadoop 2.x versions which are still widely in use.

In the meantime, you can use the following workaround to disable chunked encoding within the S3A layer:

(1) S3A uses an S3ClientFactory in order to generate the internal AmazonS3 instance needed to communicate with the S3 endpoint. The default implementation is DefaultS3ClientFactory (instantiated here); extend this implementation and override createS3Client in order to apply the additional .disableChunkedEncoding option (as demonstrated in the “For Java API Components” section):

(2) Direct S3A to use the non-chunked-encoding implementation by setting the fs.s3a.s3.client.factory.impl property indicated previously to the new implementation’s fully-qualified class name:

The s3_mocktest_demo sample project contains such an implementation; please feel free to reuse, if you wish.

Further steps?

The above configurations were sufficient to allow for local-mock unit testing for the (rather simple) Spark-S3 code written for demonstration purposes. If you happen to find a use case where additional steps are required (or the above configuration does not work with the mock S3 endpoint), please let me know by leaving a comment and/or opening a GitHub issue on s3_mocktest_demo— would be happy to update this post with additional details where needed.

--

--