R and Azure’s Data Lake
In this blog post I’ll share a quick example in R.
We’ve been taking a closer look at Azure for recent experiments at Zero Gravity Labs. While there are differences from the familiar AWS, I’m finding the Aure Data Lake REST API is a breeze to work with. In this blog post I’ll share a quick example in R.
Use Case
We have a preprocessing job where the data set is relatively small (a few million ndjson records). While Apache Spark is my typical go to, it’s a little overkill for this job. All I need is to download the records from Azure’s Data Lake onto a VM, parse and clean some XML fields (not covered in this blog post), and persist the cleaned data set back to the Data Lake.
Prerequisites
You’ll definitely need to set up an Azure account before starting, along with an Azure Active Directory Application with permissions to access the Data Lake. I found the Azure documentation helpful here.
Take note of these three pieces of information:
- Directory ID (Tenant ID)
- Application ID (Client ID)
- Api Key
Round Trip
To make things a little easier, I put together a small API wrapper in R. See full repo here: https://github.com/zerogravitylabs/razdatalake/
Install and/or load
devtools::install_github("zerogravitylabs/razdatalake")
library(razdatalake)
Initialize configuration
# Directory ID (Tenant ID)
tenantID <- ""
# Application ID (Client ID)
clientID <- ""
# Api Key
apiKey <- ""
And authenticate
token <- getToken(tenantID, clientID, apiKey)
From here we follow a process of bringing data onto the VM for processing only. Storage and persistence remains the responsibility of the Data Lake (giving the advantage of only having to manage stateless VMs).
Given our use case, we support downloading of the nsjson data from multi-file folders
# Download contents of folder into single `data.table` ----
dt <- downloadFiles("DATA LAKE NAME", "FOLDER NAME", token)
And finally, once the data munging is complete, we persist back to the Data Lake. For simplicity, I first save a temporary file on the local VM’s disk.
# Persist in Azure Data Lake ----
jsonlite::stream_out(dt, file("dt.json"))
putFile("DATA LAKE NAME", "NEW FOLDER NAME", "dt.json", token)
Thoughts
While still early in my own tour of Azure, I am finding support for my use cases. Working with the Data Lake as the primary storage has been a straightforward process and–while I have not formally benchmarked–I’m seeing fast performance when transferring Data Lake data to and from an Azure VM.
Feel free to make use of or extend our repo https://github.com/zerogravitylabs/razdatalake/ and contact us (or talk to us on social) to tell us about your own use of these tools.
Noah Marconi is Scientist, Engineering and a Day 1 associate of Zero Gravity Labs.
Connect with Noah or any member of our team by following us on Facebook, Twitter, Instagram and LinkedIn.