Wiring your older Hadoop clusters to access Azure Data Lake Store
For Hadoop versions lower than 3.0, you need some additional setup in order to access the data lake store.
In a couple of recent articles, we discussed how to connect any Hadoop or Spark cluster to Azure Data Lake Store and how to make it the default file system. These instructions work great if you are using the latest versions of Hadoop (3.0 and later). The binaries for the newer versions come bundled with all the components that are required to access Azure Data Lake Store. However, if you are using an older version of Hadoop you need to manually install the components to access ADLS. In this article, I will show you how to wire it all up.
The key to getting your older versions to talk to Azure Data Lake Store is installing the required binaries.
How old is really old?
The Azure Data Lake Store binaries have been broadly certified for Hadoop distributions after 3.0 and above. We are really in uncharted territory for lower versions. So the farther away you go from 3.0 the higher the likelihood of them not working. My personal recommendation is to go no lower than 2.6. After that your mileage may really vary.
Getting Hadoop 2.6 installed
Similar to my prior posts, I am going to describe the concepts by installing Hadoop locally on a single box. Let’s download the binaries directly from the Hadoop Apache Releases page.
Get the binaries for version 2.6.5 and unpack them. On my machine I extracted them to
Now let us configure this setup to access Azure Data Lake Store
Configure Hadoop to access Azure Data Lake Store
At this point you pretty much want to follow this previous post to configure your Hadoop setup. The steps you want to follow are:
- Download winutils.exe and hadoop.dll
- Set JAVA_HOME
- Creating the identity and getting the credentials to access Azure Data Lake Store
core-site.xmlto set these properties up
These steps are all explained in full details in the article. After you the steps above, you are not quite done yet. At this point if you try to access Azure Data Lake Store, you will see an error similar to the following.
The reason we see this is because the binaries that are required to access the Azure Data Lake Store, are not baked in the 2.6 builds. These are the two binaries that are needed:
In the remainder of the article, we will talk about how to get them.
Getting the Data Lake Hadoop File System client
The proper way to get
hadoop-azure-datalake-<version>.jaris to back port Hadoop-13037 for your Hadoop version and build it. That, however, is not for the faint of heart. Describing how to do it this way is outside the scope of this article.
Here, I am going to cheat a little bit and show a slightly easier way. Basically we are going to get the binaries from newer builds of Hadoop directly from Apache. So similar to how I described in the previous article, let’s download and extract a newer version of Hadoop (3.0 alpha 2 or above).
Then locate the file in the downloaded build.
I extracted my build to
C:\hadoop-3.0.0-alpha2. The file will be under
Copy it to your older Hadoop version in the same sub-path.
Next, let’s get the latest Azure Data Lake Store Java SDK
Downloading the Azure Data Lake Store Java SDK from Maven
The latest version of ADLS Java SDK can be found in the Maven repository. Go to https://search.maven.org/ and search for “data lake”. You want to locate the ArtifactId
You want to download the latest version in to the same relative location in 2.6 folder. i.e
Test that everything works
Now that everything is set up properly, access to Azure Data Lake Store should work for you.
There you go. Now you can enjoy connectivity to the the data lake store from your Hadoop setup.
In addition to all the other steps required, the main difference with older Hadoop versions are that they do not come bundled with the binaries needed to access Azure Data Lake Store. Once you have the binaries installed correctly, you can access Azure Data Lake Store.
For questions and comments reach out to me on Twitter here.