Getting started with EMRFS

Snigdhajyoti Ghosh
3 min readJul 11, 2020

--

The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3.

How to access a file from S3 using EMRFS

# Using Java

Coming from HDFS it is very easy to implement EMRFS. You just need to pass URI("s3://<bucket-name>") object while getting filesystem object.

# Using Java-Spark

There is no such configuration change in your spark application. Make sure that the required .jar in added to your PATH.

# Using CLI

You probably familiar with AWS S3 CLI.
But using Hadoop FileSystem API you can do lots of other thing as well. It uses EMRFS to read and to write to S3. So it is more compatible with Consistent View.

A quick guide how your new CLI command will look like:

# Listing objects
hdfs dfs -ls s3://<bucket>/<object-path>
# Copy from local to s3
hdfs dfs -copyFromLocal <local-file> s3://<bucket>/<object-path>
# Copy to local from s3
hdfs dfs -copyToLocal s3://<bucket>/<object-path> <local-file>

What other challenges you will face with EMRFS

As Teepika R M mentioned in a blog -

EMRFS is not a separate file-system, it’s an implementation of HDFS that adds strong consistency to S3, whereas S3 without EMRFS implementation provides read after write consistency for PUTS of new objects and eventual consistency for other operations on objects in all regions.

# If you not enable consistency view

S3 has a caveat with its read-after-write consistency model that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.

If you fire a hive query to insert data into some s3 location, hive creates a staging directory .hive-staging_* under that location. And at the end it renames the staging dir to the actual one.

So when hive PUT an object to S3, instantly its’t available for HEAD or GET. Sometimes the operation happens so fast, due to s3 consistency query failed intermittently.

# If you enable consistency view

After you enable consistency view in EMR cluster, it creates a DynamoDb table to keep track of S3 metadata. Basically it’s a key-value map for each and every object in S3.

After you enable this, if you do any PUT operation on an object using aws-s3-sdk or using boto3(aws s3 cli), it won’t be in sync. Now whenever you try to perform some operation on the same object, it will throw a ConsistencyException. Because DynamoDb is not in sync. The idea is also apply for changing file content, even though you didn’t change the filename.

And if you did the same-thing by mistake, you need to manually sync using — emrfs sync s3://<bucket>/<path or path-to-object> Or to see the difference using emrfs diff s3://<bucket>/<path or path-to-object>

# Key takeway

Its better to forgot about boto3 or aws-s3-sdk when you are working with EMRFS consistency view.

It comes with its own cost for DynamoDb.

--

--