R and Azure’s Data Lake

In this blog post I’ll share a quick example in R.

Zero Gravity Labs
Zero Gravity Labs
Published in
3 min readJun 1, 2017

--

We’ve been taking a closer look at Azure for recent experiments at Zero Gravity Labs. While there are differences from the familiar AWS, I’m finding the Aure Data Lake REST API is a breeze to work with. In this blog post I’ll share a quick example in R.

Use Case

We have a preprocessing job where the data set is relatively small (a few million ndjson records). While Apache Spark is my typical go to, it’s a little overkill for this job. All I need is to download the records from Azure’s Data Lake onto a VM, parse and clean some XML fields (not covered in this blog post), and persist the cleaned data set back to the Data Lake.

Prerequisites

You’ll definitely need to set up an Azure account before starting, along with an Azure Active Directory Application with permissions to access the Data Lake. I found the Azure documentation helpful here.

Take note of these three pieces of information:

  • Directory ID (Tenant ID)
  • Application ID (Client ID)
  • Api Key

Round Trip

To make things a little easier, I put together a small API wrapper in R. See full repo here: https://github.com/zerogravitylabs/razdatalake/

Install and/or load

devtools::install_github("zerogravitylabs/razdatalake")

library(razdatalake)

Initialize configuration

# Directory ID (Tenant ID)
tenantID <- ""
# Application ID (Client ID)
clientID <- ""
# Api Key
apiKey <- ""

And authenticate

token <- getToken(tenantID, clientID, apiKey)

From here we follow a process of bringing data onto the VM for processing only. Storage and persistence remains the responsibility of the Data Lake (giving the advantage of only having to manage stateless VMs).

Given our use case, we support downloading of the nsjson data from multi-file folders

# Download contents of folder into single `data.table` ----
dt <- downloadFiles("DATA LAKE NAME", "FOLDER NAME", token)

And finally, once the data munging is complete, we persist back to the Data Lake. For simplicity, I first save a temporary file on the local VM’s disk.

# Persist in Azure Data Lake ----
jsonlite::stream_out(dt, file("dt.json"))
putFile("DATA LAKE NAME", "NEW FOLDER NAME", "dt.json", token)

Thoughts

While still early in my own tour of Azure, I am finding support for my use cases. Working with the Data Lake as the primary storage has been a straightforward process and–while I have not formally benchmarked–I’m seeing fast performance when transferring Data Lake data to and from an Azure VM.

Feel free to make use of or extend our repo https://github.com/zerogravitylabs/razdatalake/ and contact us (or talk to us on social) to tell us about your own use of these tools.

Noah Marconi is Scientist, Engineering and a Day 1 associate of Zero Gravity Labs.

Connect with Noah or any member of our team by following us on Facebook, Twitter, Instagram and LinkedIn.

--

--

Zero Gravity Labs
Zero Gravity Labs

Zero Gravity Labs is about the pursuit of what’s next. We are relentless experimenters and explorers.