Using Jupyter notebooks and Pandas with Azure Data Lake Store

Data science and advanced analysis using Python on data in your data lake store account

Here’s a question I hear every few days. How do I access data in the data lake store from my Jupyter notebooks? People generally want to load data that is in Azure Data Lake Store into a data frame so that they can analyze it in all sorts of ways.

Before we dive into the details, it is important to note that there are two ways to approach this depending on your scale and topology.

The easy way — using PySpark

If you already have a Spark cluster running and configured to use your data lake store then the answer is rather easy. You can simply open your Jupyter notebook running on the cluster and use PySpark.

Here is the document that shows how you can set up an HDInsight Spark cluster. My previous blog post also shows how you can set up a custom Spark cluster that can access Azure Data Lake Store. This method works great if you already plan to have a Spark cluster or the data sets you are analyzing are fairly large.

I will not go into the details of how to use Jupyter with PySpark to connect to Azure Data Lake store in this post. This article in the documentation does an excellent job at it.

Even easier — Using the Azure Data Lake Python SDK

On the other hand, sometimes you just want to run Jupyter in standalone mode and analyze all your data on a single machine. You simply want to reach over and grab a few files from your data lake store account to analyze locally in your notebook.

This is also fairly a easy task to accomplish using the Python SDK of Azure Data Lake Store. In this post I will show you all the steps required to do this.

Prerequisites

For the rest of this post, I assume that you have some basic familiarity with Python, Pandas and Jupyter.

On your machine, you will need all of the following installed:

  1. Python 2 or 3 with Pip
  2. Pandas
  3. Jupyter

You can install all these locally on your machine. A great way to get all of this and many more data science tools in a convenient bundle is to use the Data Science Virtual Machine on Azure. I really like it because it’s a one stop shop for all the cool things needed to do advanced data analysis.

I also frequently get asked about how to connect to the data lake store from the data science VM. So this article will try to kill two birds with the same stone. I show you how to do this locally or from the data science VM.

The Data Science Virtual Machine is available in many flavors. I am going to use the Ubuntu version as shown in this screenshot.


Installing the Azure Data Lake Store Python SDK

If you want to learn more about the Python SDK for Azure Data Lake store, the first place I will recommend you start is here. Installing the Python SDK is really simple by running these commands to download the packages.

On the local machine

This is very simple. I am assuming you have only one version of Python installed and pip is set up correctly. You simply need to run these commands and you are all set.

pip install azure-mgmt-resource
pip install azure-mgmt-datalake-store
pip install azure-datalake-store

On the data science virtual machine

Here it is slightly more involved but not too difficult. There are multiple versions of Python installed (2.7 and 3.5) on the VM. You need to install the Python SDK packages separately for each version. Additionally, you will need to run pip as root or super user.

For Python 3.5

First run bash retaining the path which defaults to Python 3.5. Then check that you are using the right version of Python and Pip.

sudo env PATH=$PATH bash
python --version
pip --version
Here I can see that everything is on Python 3.5

Now install the three packages.

pip install azure-mgmt-resource
pip install azure-mgmt-datalake-store
pip install azure-datalake-store

You can validate that the packages are installed correctly by running the following command.

pip list | grep 'azure-datalake-store\|azure-mgmt-datalake-store\|azure-mgmt-resource'
The packages are installed on my VM

For Python 2.7

Run bash NOT retaining the path which defaults to Python 2.7. Then check that you are using the right version of Python and Pip. To run pip you will need to load it from /anaconda/bin. This is the correct version for Python 2.7.

sudo env PATH=$PATH bash
python --version
/anaconda/bin/pip --version
Here everything is using Python 2.7

Now install the three packages loading pip from /anaconda/bin.

/anaconda/bin/pip install azure-mgmt-resource
/anaconda/bin/pip install azure-mgmt-datalake-store
/anaconda/bin/pip install azure-datalake-store

Check that the packages are indeed installed correctly by running the following command.

/anaconda/bin/pip list | grep 'azure-datalake-store\|azure-mgmt-datalake-store\|azure-mgmt-resource'
Everything seems to be installed correctly

Start a Jupyter Notebook

On the data science VM you can navigate to https://<IP address>:8000. If you are running on your local machine you need to run jupyter notebook.

Create a new Jupyter notebook with the Python 2 or Python 3 kernel. In this example, I am going to create a new Python 3.5 notebook. If you have installed the Python SDK for 2.7, it will work equally well in the Python 2 notebook.

Obtain an authentication token

In order to read data from your Azure Data Lake Store account, you need to authenticate to it. There are multiple ways to authenticate. In this example below, let us first assume you are going to connect to your data lake account just as your own user account. The following method will work in most cases — even if your organization has enabled multi factor authentication and has Active Directory federation enabled.

Running this in Jupyter will show you an instruction similar to the following. Click that URL and following the flow to authenticate with Azure.

Load an Azure Data Lake Store file into a Pandas data frame

Once you go through the flow, you are authenticated and ready to access data from your data lake store account.

If you run it in Jupyter, you can get the data frame from your file in the data lake store account.

Here onward, you can now panda-away on this data frame and do all your analysis.

Using a Service Principal Identity

There is another way one can authenticate with the Azure Data Lake Store. That way is to use a service principal identity. A step by step tutorial for setting up an Azure AD application, retrieving the client id and secret and configuring access using the SPI is available here.

Once you get all the details, replace the authentication code above with these lines to get the token. After you have the token, everything there onward to load the file into the data frame is identical to the code above.

Summary

To round it all up, basically you need to install the Azure Data Lake Store Python SDK and thereafter it is really easy to load files from the data lake store account into your Pandas data frame. It works with both interactive user identities as well as service principal identities.

Hopefully, this article helped you figure out how to get this working. If you have questions or comments, you can find me on Twitter here.