Connecting your own Hadoop or Spark to Azure Data Lake Store
Works with any cluster or even when running locally
Azure Data Lake Store (ADLS)is completely integrated with Azure HDInsight out of the box. Using HDInsight you can enjoy an awesome experience of fully managed Hadoop and Spark clusters on Azure. If you’re starting anew, I would strongly recommend beginning with HDInsight and then considering other choices if needed.
That said, if you are running your own Hadoop/Spark clusters it is also really easy to access or store data in Azure Data Lake Store. This will work if your cluster is in Azure, on premises or you don’t even have a full blown cluster and are trying Hadoop/Spark locally. The instructions below are pretty much similar in all these cases.
Configuring a Hadoop or Spark cluster to access Azure Data Lake Store is really easy.
In this post, I am going to show you how to connect Spark running locally on my machine to ADLS. Along the way you will learn how to do this for Hadoop as well.
What you need
Obviously, you need to first have an ADLS account set up in your Azure subscription. If you want to know more about setting up Azure Data Lake Store visit this documentation page.
Here, I have my ADLS account called
On your machine you also need to have Java installed. On mine, I have the latest Java SE Development Kit that I downloaded from here.
The setup below is on my Windows machine. However, it is not much different to get this going on other platforms like Linux or Mac. Simply skip the steps that I outlined which are Windows-specific.
With the pre-reqs taken care of, let’s get going with Spark and Hadoop.
Download the latest Spark distribution directly from the Apache page.
At time of print, the latest version is 2.1.0. Choose the package pre-built with user-provided Hadoop. You need to extract the binaries locally on your machine.
Useful Tip: On Windows if you need to extract tar.gz or .tgz files, you can use 7-Zip.
On my Windows machine, I extracted the binaries to a folder called
Download the latest Hadoop binaries directly from Apache at this location .
You need to get version
3.0.0-alpha2 or above. These versions are bundled with the client that can directly talk to Azure Data Lake Store. Note that you need to get the binaries tarball and not the sources. Extract the binaries on your local machine.
On my machine, I extracted them to
Download winutils.exe and hadoop.dll
To run Hadoop on Windows, there are some special considerations. The Apache page that describes them is located here. The bottom line of this is that you need two special binaries on Windows in order to get everything set up correctly. They are
Typically, they are not packaged in the Hadoop build that you just downloaded from Apache. You need to get them from GitHub here. There is no special version there for Hadoop 3.0. So what you want to do is to get the 2.7.1 version. It seems to works fine on our
3.0-alpha2 that we previously downloaded.
Put the two files in the bin directory of your Hadoop.
Disclaimer: These are binaries off GitHub. Use them at your own risk. There is perhaps a way to build these from sources of prior Hadoop versions. However, how to do that is out of scope for this article.
Configure your Hadoop installation
Now that you’ve got all the binaries needed, you need to do some basic configuration. You are going to set a couple of parameters in the
- You need to set the
JAVA_HOMEproperty for Hadoop. Hadoop is kinda finicky about the way you specify this. Typically on Windows Java gets installed in
C:\Program Files\Java. Hadoop does not like spaces in
JAVA_HOME. So it needs to be set to the DOS-path which is something like
C:\PROGRA~1\Java\jdk1.8.0_121.Set it correctly based on the Java version that you have installed.
HADOOP_CLASSPATHneeds to have
%HADOOP_HOME%\share\hadoop\tools\lib\*. You need this because the directory that contains the two JAR files that are needed to access ADLS. These are the two JAR files of relevance here
hadoop-env.cmd needs to look like this after this step
Test your Hadoop
By now, your Hadoop installation is pretty much all configured for local use. We have not yet completed everything that is needed to access ADLS. However, before going there it is prudent to ensure that the Hadoop-specific and Windows-specific parts are set correctly.
bin\hadoop classpath. It should work correctly with the whole class path showing.
Next we will try to see whether Hadoop can enumerate local files. This requires that the
hadoop.dll be installed correctly.
bin\hdfs dfs -ls /temp. This should correctly show the listing of the local directory.
If these two steps work, everything is configured correctly locally. Now we are going to wire up our Hadoop setup to access Azure Data Lake Store
Create the identity to access Azure Data Lake Store
Azure Data Lake Store uses Azure Active Directory (AAD) to manage identities that can access it. There are multiple types of identities that AAD supports such as users and service principal identities (SPI). For long running processes such as a Hadoop/Spark cluster, service principal identities are an elegant choice.
The general steps to set up identities and providing access to the right data in ADLS are as follows:
- Create an Azure AD web application
- Retrieve the client ID, client secret, and token endpoint for the Azure AD web application.
- Configure access for the Azure AD web application on the Data Lake Store folders/files that you want to access from the cluster.
A step by step tutorial for how to perform the steps above is provided at this location.
After completing the steps above you should have obtained the following pieces of data to continue configuring your cluster.
- Client ID
- Client Secret
- Token Endpoint
Modify the core-site.xml in your Hadoop cluster
The main step in configuring Hadoop to access ADLS is to setup the Azure Data Lake File System. This is achieved by editing your
core-site.xml, which contains cluster-wide configuration. The
core-site.xml is located in
Below are the settings in
core-site.xml that you need to set. Be sure to substitute
YOUR TOKEN ENDPOINT, YOUR CLIENT ID and YOUR CLIENT SECRET with the values that you obtained in the previous step.
<value>YOUR TOKEN ENDPOINT</value>
<value>YOUR CLIENT ID</value>
<value>YOUR CLIENT SECRET</value>
What this does is that it configures Hadoop to use the Azure Data Lake Store file system client. This is the component that interacts with Azure Data Lake Store. To read more about the Azure Data Lake Store client in Hadoop, you can visit this page.
Now you have configured Hadoop and are ready to test your connection to Azure Data Lake Store.
Test connectivity to Azure Data Lake Store from Hadoop
Once everything above is set up, verifying that everything works and you are able to connect to ADLS through Hadoop is really easy. You simply run
hdfs dfs shell commands that are built in with Hadoop to check connectivity. Here are some examples.
Pretty cool, huh? Basically most of this article was all about getting Hadoop set up correctly. The Azure Data Lake specific portions were simply setting up the credentials and including the client in the Hadoop class path.
Once you set up the right credentials in
core-site.xml, connecting to ADLS is straightforward.
Now that Hadoop is set up correctly, we can move onward to Spark
Configuring Spark to connect to ADLS
Spark primarily relies on the Hadoop setup on the box to connect to data sources including Azure Data Lake Store. So the Spark configuration is primarily telling Spark where Hadoop is on the box. This is done by setting environment variables. Here are the variables to set
HADOOP_HOME= Point to the Hadoop installation on the box. In my case it is
SPARK_HOME= Point to the directory that contains the Spark binaries. For my machine it is
SPARK_DIST_CLASSPATH= You need to run “
hadoop classpath”. Then copy the output of that and set this.
Here is me setting the environment variables in my console
Now you are all set to use Spark.
Using Azure Data Lake Store through Spark
You can try this step either through the Scala or PySpark shells. Being a Python fan, I personally prefer PySpark. This will also work equally well through Scala, Spark SQL and pretty much everything that runs on Spark. To launch PySpark simply run
Once PySpark is launched, you can run some basic code to check that you can indeed connect to your ADLS account and use data in it. Here is probably the most trivial piece of code that will test it
rdd = sc.textFile("adl://amitadls.azuredatalakestore.net/accounts.csv")rdd.count()
What this does is that it creates a simple RDD from a CSV file that is stored in ADLS. When the path starts with
adl:// Spark calls into Hadoop to load the right file system client. Since we have properly configured Hadoop already, it knows where to load the client from and what credentials to use.
You’re now in business and can use Azure Data Lake Store as a file system from Spark.
In this post, we looked at how to connect both Hadoop and Spark to Azure Data Lake Store. The bulk of this article simply contains the gory details of getting Hadoop and Spark themselves set up on a Windows box. The ADLS-specific parts are pretty much setting up the credentials and including the client jars in the class path.
These core concepts will work equally well if you are trying on an actual Hadoop or Spark cluster instead of this single-box setup. Also you can connect to ADLS from anywhere that you can reach the public HTTPS endpoint of ADLS. So it will work on Azure VMs, on premises or other cloud providers.
Hopefully, this was easy to follow and you are able to get this working on your end. If you have any questions please feel free to reach out me. Enjoy!
You can find me on Twitter here.