Making Azure Data Lake Store the default file system for Hadoop

Configuring your Hadoop cluster to use Azure Data Lake Store as the default file system is really easy. You just need to set one property in your core-site.xml

Amit Kulkarni
Azure Data Lake
3 min readFeb 21, 2017

--

In my previous post, I showed you how you can connect your own Hadoop and Spark clusters to Azure Data Lake Store (ADLS). If you follow the instructions, you will end up with a Hadoop installation that can successfully connect to ADLS.

Querying Azure Data Lake Store using hdfs dfs commands

In this post, we will take our installation one step further and make Azure Data Lake Store the default file system for our Hadoop.

What is the default file system for Hadoop?

Simply put, the default file system is what relative paths resolve against. What does this mean? Basically, there are two ways in which you can specify paths in Hadoop namely absolute paths and relative paths. Absolute paths are simple. They contain a path prefix which identifies the file system and then the whole path to locate a file or a folder. Here are some examples of an absolute paths.

A relative path, on the other hand, simply provides a sub path without the prefix identifying the file system. For example a relative path may look like the following /user/filename.txt. At run time Hadoop looks up the default file system and then combines it with the relative path to come up with the absolute path.

Relative paths are resolved against the default file system on Hadoop

So to give a concrete example, if the default file system was hdfs://123.23.12.4344:9000 then the /user/filename.txt would resolve to hdfs://123.23.12.4344:9000/user/filename.txt.

Why does the default file system matter? The first answer to this is purely convenience. It is a heck lot easier to simply say /events/sensor1/ than adl://amitadls.azuredatalakestore.net/ in code and configurations. Secondly, many components in Hadoop use relative paths by default. For instance there are a fixed set of places, specified by relative paths, where various applications generate their log files. Finally, many ISV applications running on Hadoop specify important locations by relative paths.

Setting the default file system

OK let’s cut to the chase. The property in Hadoop that specifies the default file system is called fs.defaultFS. You set it up in core-site.xml. The core-site.xml is located in %HADOOP_HOME%\etc\hadoop. You can read more about this property on this page in the Hadoop docs. Note: Do not set fs.default.name as it is now deprecated.

By default, this property points to local file system in a single machine install of Hadoop. For most Hadoop clusters, this property is configured to point to the Hadoop Distributed File System (HDFS). On my machine, if I run hdfs dfs -ls /temp it enumerates the local hard drive.

I am going to add the following lines to my core-site.xml. This will make ADLS the default file system

Now, if I run hdfs dfs -ls / it enumerates my Azure Data Lake Store account.

Here’s a screenshot from my Azure portal to show you that indeed it is enumerating the content of my data lake store account.

Summary

There. It’s that simple! You just need to set the fs.defaultFS property in core-site.xml to Azure Data Lake Store. Once you do that, all relative paths will thereafter resolve directly against your Azure Data Lake Store account.

You can find me on Twitter here.

--

--

Amit Kulkarni
Azure Data Lake

Technology Enthusiast. Interested in Cloud Computing, Big Data, IoT, Analytics and Serverless.