A gift from the cloud

Using cloudml to store and retrieve data from google storage buckets with R.

Clayton Besaw
The die is forecast

--

This article is part of our code and methods series. Every so often we will post an overview of a R or Python approach to statistical analysis and data visualization that we use in our work at One Earth Future.

One Earth Future was recently awarded non-profit status for Google’s suite of services, including credits and learning tools for the use of the Google Cloud range of products for the data storage and analysis.

In the forecasting project, we use mainly a combination of R (~85%) and Python (~15%). And while it is still early days for our cloud storage plans, we needed to quickly determine how best to get forecasting’s local data products into organization-wide storage buckets. This helps us to provide data to our comms department and various implementation programs across Colorado, Somalia, and Colombia.

cloudml is a nice package for uploading local data to google storage buckets and for retrieving data into a local directory or directly into an R data frame object.

This short vignette will walk us through the basics of cloudml and google storage operations on Windows 10.

First, install the cloudml and tfdatasets packages:

install.packages("cloudml", "tfdatasets")

Once these are installed, load the cloudml package and utilize the gcloud_install() function:

library(cloudml)gcloud_install()
You should see something like this!

The gcloud_install() function will begin the download of the Google Cloud SDK suite which will enable you to easily interface with your Google Cloud account. Official documentation suggests that you accept the default pathway suggested b the installation process.

Once you’ve finished the SDK setup, you should see some instructions for initializing your installation to connect your local processes to a default Google Cloud account. You should see something like this pop up in the GC SDK shell.

You should say yes
Choosing your default GC project to interface with.

The process will authenticate your GC account and log you in. From there you will have the choice of which specific cloud project you wish to set as your default pathway. You have the option to make a new project as well. For this example, we choose 3 for our already existing CoupCast GC project.

Assuming no major hiccups, we can begin to upload and download data accordingly. First, let’s send our freshly updated February 2019 REIGN data to our rubicon2 storage bucket within the CoupCast project by using the gs_copy() function.

gs_copy("REIGN_2019_2.csv", "gs://rubicon2/REIGN_2019_2.csv")

The first argument is the the file you wish to send (source) and the second is the output (destination). Here we are simply sending the data that is in our working directory to a new file called REIGN_2019_2.csv in our storage bucket. If the transfer is successful, you should see the following console output.

“I can do that Dave” — HAL 9000 if it wanted to help out with our data projects.

Retrieving data is as simple as changing the source to a file in our storage bucket. Let’s retrieve the January 2019 REIGN data from storage and save it our working directory.

gs_copy("gs://rubicon2/REIGN_2019_1.csv", "REIGN_2019_1.csv")

If successful, the console should give a similar output.

Finally, we can use the tfdatasets package to import data in our storage bucket directly into a R data frame object. Let’s import our newly uploaded February REIGN data into R.

library(tfdatasets)
#setup readable filepath
gc_dir <- gs_data_dir("gs://rubicon2")
reign_file <- file.path(gc_dir, "REIGN_2019_2.csv")
#import data
reign_df <- read.csv(reign_file,
header = TRUE)
dim(reign_df)[1] 132770 38

With the data in memory, we can begin to work with it. gs_data_dir() sets up the storage bucket root. Then you can use file.path() to combine the root with the specific file you wish to import. Then use the appropriate import function to finish the process.

--

--

Clayton Besaw
The die is forecast

Research Associate at One Earth Future. Political violence, instability, forecasting, machine learning.