Integrating Spark with Localstack S3

David Smith
3 min readAug 16, 2019

--

Localstack is an open-source AWS cloud stack that can run offline from AWS on a local environment, allowing developers to easily code and test against many popular AWS services without needing to interact with the AWS cloud. At Evidation Health we developed a Big Data Platform using Apache Spark for high-volume, velocity, and variety data processing and AWS S3 buckets as a scalable and secure storage solution for our data lake.

Although we have our own AWS test subaccounts to develop new data pipelines before they go to production, we’ve been focused on containerizing our distributed architecture using Docker to improve our development cycle speed by allowing the team to easily interact with the product on our local environments. The combination of Docker and Localstack would be an easy win for us to work quickly on new projects and improve product quality.

Turns out it wasn’t as straightforward as we expected…

Path-style request support

Our Spark clusters are configured to run with Hadoop 2.7.3. This version does support custom endpoints, which is needed so Spark/Hadoop knows to make requests against our Localstack S3 endpoint url instead of the AWS cloud. However support for the path style requests (e.g. http://localstack:4572/my-bucket/file vs. http://my-bucket.localstack:4572/file) are not supported in 2.7, causing an error when Spark tries to make the dns bucket requests to our localstack endpoint. A code change was needed in the S3FileSystem.java from the hadoop-aws jar to add the support for path style requests present in later hadoop versions.

Avro Tools missing method

Loading avro files from Localstack S3 in this configuration also turned out to be problematic. When attempting to load the file into a Spark data frame in pyspark, I saw the following error:

java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;

Replacing the avro-tools jar in hadoop with version 1.9.0, which contains the method in the Schema class the caller was referencing, allowed the avro file to be read successfully.

Checksum mismatch after write

After I had verified avro files could be read in Spark from the Localstack S3 bucket, it was time to write the data frame back to S3. Unfortunately this did not work out of the box either as I could see from the Spark logs I was transferring the data to the Localstack S3 but once finished pyspark raised another exception:

Aborting task
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: InvalidDigest, AWS Error Message: The Content-MD5 you specified was invalid, S3 Extended Request ID: null

After researching the issue I found it was due to a mismatch with the Localstack server and the Hadoop libraries — Localstack was not computing an MD5 on the content uploaded but the Hadoop libraries did and it didn’t match Localstack’s missing metadata. Since this is for development only with smaller files on local environments I wasn’t too concerned with file transfer integrity checks, so I made another change to S3AFileSystem.java to disable setting the new object metadata altogether.

A Happy Ending After All

Once these changes to the hadoop-aws jar and the newer avro-tools jar were bundled and deployed to Spark and Hadoop, I was able to run our data platform with Spark and Localstack S3 using s3a URLs with the following spark configuration:

spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4 -Dcom.amazonaws.services.s3.enforceV4
spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4 -Dcom.amazonaws.services.s3.enforceV4
spark.hadoop.fs.s3a.endpoint=localstack:4572
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.connection.ssl.enabled=false

where localstack is the Docker container hostname of the localstack S3 instance.

Having the ability to run Spark with Localstack has made developing and testing our data platform much faster and easier, hopefully my experience working through the initial issues can help others looking to do the same.

My modified Hadoop 2.7.3 source code is available at https://github.com/suburbanmtman/hadoop-2.7.3-spark-localstack.

--

--

David Smith

Data, Software, Cloud Architecture consultant at Gentle Valley Digital building scalable big data and SaaS solutions.