Using the Azure Data Lake Storage Gen2 (ADLS Gen2) API with C#

Dan Cokely
7 min readAug 30, 2019
Photo by Edgar Chaparro on Unsplash

Update 1/2/2020

  • I’ve added a second part to this story, diving into some basic CRUD operations, published here

Intro

Azure Data Lake Storage Gen2 is an interesting capability in Azure, by name, it started life as its own product (Azure Data Lake Store) which was an independent hierarchical storage platform, mainly focused on providing a storage backing for big data platforms such as HDInsight through a WebHDFS compatible API. Azure Data Lake Storage Gen2 (ADLS Gen2) takes the key advantage of the original ADLS, the hierarchical storage structure, and applies it to the ubiquitous Blob Storage. The merging of these two technologies has become ADLS Gen2 (docs here). The value prop is the scale, performance, and security of hierarchical storage, combined with the cost effectiveness, tiering, and prevalence of Blob Storage. Big data platforms like Databricks, Hadoop, HDInsight, Hortonworks, and others can be mounted, connected, integrated, or are otherwise supported by ADLS Gen2. While using ADLS Gen2 as a backing for a big data platform is a point and click operation (or simple scripts etc.), connecting and leveraging ADLS Gen2 in a custom application through its API can be a bit trickier to get set up and running.

Demo

The API for ADLS Gen2 has all the standard commands you would expect to find to manage a hierarchical file system: CRUD, properties, metadata, directories, paths etc. (docs here). Microsoft’s docs say to use the Shared Key authorization for API calls, which isn’t a common form of authorization to use, so we’ll dig into setting it up correctly, as well as look at a library alternative that works just as well (probably better… and is easier to use).

We’ll use Visual Studio 2019 Community (16.1.1) and make a simple console app that can generate a token and create a basic filesystem through the ADLS Gen2 API, first with Shared Key authorization, then with the Microsoft.Azure.Services.AppAuthentication Nuget package.

The ADLS Gen2 Instance

First we need to create our storage instance, we’ll create it from the Azure Portal. We start off creating a standard Azure Storage Account:

Fill out the name, resource group, location, and other basic metadata however you like, keeping in mind ADLS Gen2 requires standard performance tier, and either Storage V2 or Blob account kind (as of writing):

Im enabling the secure transfer option (essentially forcing https for all connections), and the setting that makes this ADLS Gen 2, “Hierarchical namespace”:

Lastly Azure will make sure all the setting are valid, then we can create the account:

Once the storage account creation is completed, we will need the access keys in order to use the Shared Key approach. “key1” (the key itself, not the connection string) will be the one we will use later on, so make note of where to find it once we need to use it:

Lastly, lets double check that the hierarchical section of the account shows up, you should see the Data Lake Storage section, but no file systems yet:

The Shared Key

The ADLS Gen2 documentation references using the Shared Key approach as the authorization mechanism when accessing the API(docs on Shared Key here). While this is not the only access mechanism that works, since its the only way referenced in the documentation (for now) we’ll start by implementing Shared Key (Full disclosure, I think the second method we’ll look at is far more useful).

The Shared Key is a keyed digest, where the parameters of the request are hashed along with the storage account key, and the result is base64 encoded and attached as the authorization header. The receiving side (Azure) will then take all the parameters of the request, and hash them along with the same storage account key to make sure the request was not altered, and that a valid storage account key was used.

The parameters of our request are concatenated into the string to be signed with the storage account key, so using the MSFT docs as a guide, we’ll create the following helper program to generate keys, note the below method is essentially hardcoded to only create a key for the create file system operation, but could easily be adapted to generate keys for other operations:

Notice how most of the parameters are not required, the required pieces are the HTTP Verb (PUT), CanonicalizedHeaders, and CanonicalizedResource. The CanonicalizedHeaders for us are simply the x-ms-date (in GMT) and x-ms-version (2018–11–09) headers concatenated together with newlines (full format and options are detailed in the docs linked at the beginning of this section, or here again). The CanonicalizedResource header is a standardized way to represent the path to a resource and can include additional parameters about the resource. Since we will only be at the file system depth, the header is pretty straightforward. The resource header becomes more complicated as the depth and resource being request changes, but for now its simple since the file system is the highest level of resource. We take all these pieces along with the storage key pointed out in the earlier section, and produce the Shared Key.

Once we run the program, we can take the output (the Shared Key) and all the other parameters and plug them into a REST client of choice (im using Insomnia) to actually create a file system. We need the path to the hierarchical API (*.dfs.core.windows.net), the rest of the parameters come from the values used in the above code used to create the key (this makes sense, because Azure will be validating the parameters in our call against the key we send, so they should match). Plugging all of this in gives us the request we are going to send (left side), and once we send it, we should get a valid response, in this case a 201 (right side).

Now if we go back into the Azure Portal, our new dansfilesystem exists:

The Library

The Shared Key is fairly generic and can be used in most scenarios where you need to programmatically access the API. However, it can be cumbersome to need to build the strings and pass all the parameters required for each call every time. It would be a lot easier if we could simply generate a token independent of all the call parameters. As it turns out, we can use the Nuget package Microsoft.Azure.Services.AppAuthentication to do just that! The library will use your local developer credentials, either through the account you’re logged into Visual Studio with, or an active AZ login session (docs here). Even better, once the code is deployed to Azure it will use a managed identity (see more on managed identities for Azure Resources here) to authenticate, so no other config (other than enabling managed identity on the resource) is required.

The first step is to verify that Visual Studio is using the correct identity when its running the program (it will need to be an account that has some type of access to the resource being requested, in this case storage). Under “Tools” -> “Options) you will find the following:

Make sure the account listed in the dropdown has access to the storage account.

This time our method to generate the file system is much simper:

Finally, running the above code and going back to the portal, we will again see our new dansfilesystem created.

Next steps

Now that a file system exists and we have multiple ways to add authorization to our API requests, we can begin creating directories, files, etc.

I’ll be following up with another article soon diving into those file and directory operations to put the authorization header to work.

--

--