Wiring your older Hadoop clusters to access Azure Data Lake Store

For Hadoop versions lower than 3.0, you need some additional setup in order to access the data lake store.

In a couple of recent articles, we discussed how to connect any Hadoop or Spark cluster to Azure Data Lake Store and how to make it the default file system. These instructions work great if you are using the latest versions of Hadoop (3.0 and later). The binaries for the newer versions come bundled with all the components that are required to access Azure Data Lake Store. However, if you are using an older version of Hadoop you need to manually install the components to access ADLS. In this article, I will show you how to wire it all up.


The key to getting your older versions to talk to Azure Data Lake Store is installing the required binaries.

How old is really old?

The Azure Data Lake Store binaries have been broadly certified for Hadoop distributions after 3.0 and above. We are really in uncharted territory for lower versions. So the farther away you go from 3.0 the higher the likelihood of them not working. My personal recommendation is to go no lower than 2.6. After that your mileage may really vary.

Getting Hadoop 2.6 installed

Similar to my prior posts, I am going to describe the concepts by installing Hadoop locally on a single box. Let’s download the binaries directly from the Hadoop Apache Releases page.

Get the binaries for version 2.6.5 and unpack them. On my machine I extracted them to C:\hadoop-2.6.5


Now let us configure this setup to access Azure Data Lake Store

Configure Hadoop to access Azure Data Lake Store

At this point you pretty much want to follow this previous post to configure your Hadoop setup. The steps you want to follow are:

  1. Download winutils.exe and hadoop.dll
  2. Set JAVA_HOME
  3. Add %HADOOP_HOME%\share\hadoop\tools\lib\* to HADOOP_CLASSPATH
  4. Creating the identity and getting the credentials to access Azure Data Lake Store
  5. Modify core-site.xml to set these properties up
dfs.adls.oauth2.access.token.provider.type
dfs.adls.oauth2.refresh.url
dfs.adls.oauth2.client.id
dfs.adls.oauth2.credential
fs.adl.impl
fs.AbstractFileSystem.adl.impl

These steps are all explained in full details in the article. After you the steps above, you are not quite done yet. At this point if you try to access Azure Data Lake Store, you will see an error similar to the following.

The reason we see this is because the binaries that are required to access the Azure Data Lake Store, are not baked in the 2.6 builds. These are the two binaries that are needed:

  • hadoop-azure-datalake-<version>.jar
  • azure-data-lake-store-sdk-<version>.jar

In the remainder of the article, we will talk about how to get them.

Getting the Data Lake Hadoop File System client

The proper way to get hadoop-azure-datalake-<version>.jaris to back port Hadoop-13037 for your Hadoop version and build it. That, however, is not for the faint of heart. Describing how to do it this way is outside the scope of this article.

Here, I am going to cheat a little bit and show a slightly easier way. Basically we are going to get the binaries from newer builds of Hadoop directly from Apache. So similar to how I described in the previous article, let’s download and extract a newer version of Hadoop (3.0 alpha 2 or above).

Then locate the file in the downloaded build.

I extracted my build to C:\hadoop-3.0.0-alpha2. The file will be under share\hadoop\tools\lib

Copy it to your older Hadoop version in the same sub-path.

Next, let’s get the latest Azure Data Lake Store Java SDK

Downloading the Azure Data Lake Store Java SDK from Maven

The latest version of ADLS Java SDK can be found in the Maven repository. Go to https://search.maven.org/ and search for “data lake”. You want to locate the ArtifactId azure-data-lake-store-sdk.

You want to download the latest version in to the same relative location in 2.6 folder. i.e C:\hadoop-2.6.5\share\hadoop\tools\lib

Test that everything works

Now that everything is set up properly, access to Azure Data Lake Store should work for you.

There you go. Now you can enjoy connectivity to the the data lake store from your Hadoop setup.

Summary

In addition to all the other steps required, the main difference with older Hadoop versions are that they do not come bundled with the binaries needed to access Azure Data Lake Store. Once you have the binaries installed correctly, you can access Azure Data Lake Store.

For questions and comments reach out to me on Twitter here.