Making Azure Data Lake Store the default file system for Hadoop
Configuring your Hadoop cluster to use Azure Data Lake Store as the default file system is really easy. You just need to set one property in your core-site.xml
In my previous post, I showed you how you can connect your own Hadoop and Spark clusters to Azure Data Lake Store (ADLS). If you follow the instructions, you will end up with a Hadoop installation that can successfully connect to ADLS.
In this post, we will take our installation one step further and make Azure Data Lake Store the default file system for our Hadoop.
What is the default file system for Hadoop?
Simply put, the default file system is what relative paths resolve against. What does this mean? Basically, there are two ways in which you can specify paths in Hadoop namely absolute paths and relative paths. Absolute paths are simple. They contain a path prefix which identifies the file system and then the whole path to locate a file or a folder. Here are some examples of an absolute paths.
A relative path, on the other hand, simply provides a sub path without the prefix identifying the file system. For example a relative path may look like the following
/user/filename.txt. At run time Hadoop looks up the default file system and then combines it with the relative path to come up with the absolute path.
Relative paths are resolved against the default file system on Hadoop
So to give a concrete example, if the default file system was
hdfs://18.104.22.16844:9000 then the
/user/filename.txt would resolve to
Why does the default file system matter? The first answer to this is purely convenience. It is a heck lot easier to simply say
adl://amitadls.azuredatalakestore.net/ in code and configurations. Secondly, many components in Hadoop use relative paths by default. For instance there are a fixed set of places, specified by relative paths, where various applications generate their log files. Finally, many ISV applications running on Hadoop specify important locations by relative paths.
Setting the default file system
OK let’s cut to the chase. The property in Hadoop that specifies the default file system is called
fs.defaultFS. You set it up in
core-site.xml is located in
%HADOOP_HOME%\etc\hadoop. You can read more about this property on this page in the Hadoop docs. Note: Do not set
fs.default.name as it is now deprecated.
By default, this property points to local file system in a single machine install of Hadoop. For most Hadoop clusters, this property is configured to point to the Hadoop Distributed File System (HDFS). On my machine, if I run
hdfs dfs -ls /temp it enumerates the local hard drive.
I am going to add the following lines to my
core-site.xml. This will make ADLS the default file system
Now, if I run
hdfs dfs -ls / it enumerates my Azure Data Lake Store account.
Here’s a screenshot from my Azure portal to show you that indeed it is enumerating the content of my data lake store account.
There. It’s that simple! You just need to set the
fs.defaultFS property in
core-site.xml to Azure Data Lake Store. Once you do that, all relative paths will thereafter resolve directly against your Azure Data Lake Store account.
You can find me on Twitter here.