Beautiful data science with functional programming and R

Beautiful data science with functional programming and R

If you follow what’s going on in the R ecosystem, you know that the language has been revolutionized by two packages: dplyr and magrittr. The former provides elegant methods for working with relational data, while the latter brought the pipe %>% to R.

Together, these packages make handling tabular data a snap. For those of you who haven’t seen these packages in action, here’s a little taste

iris %>%
group_by(Species) %>%
summarize(Mean.Width=mean(Sepal.Width))

The syntax is super clear — even some one with no R experience can understand those three lines of code. No wonder these packages caught on so fast.

But what if you don’t have tabular data? What if you’re using JSON or maybe a time series? Sure you can shoeshorn those data types into a data.frame, but it’s not the most natural. This is here my new favorite R package purrr can help you out.

What’s purrr?

The goal of purrr is to bring more functional programming to R. It gives you easy-to-use mappers, reducers and filters that you can chain together. In short, it’s dplyr for non-tabular data.

That’s great in theory, you’re probably thinking, but how do I use it in practice? What does the code look like? Well, I say, the best explanation is always an example.

An Example: Forecasting Gross Domestic Product

Let’s say you want to forecast the GDP of a bunch of countries using R. You could force this data into a data.frame but it’s going to be a bit weird. Or you could use R’s built-in time series class but then you don’t have dplyr’s easy syntax. This is where purrr really excels: you can use time series objects and have dplyr-like syntax.

Getting the data from Quandl

We’re going to use the awesome Quandl API to get a bunch of GDP data and then forecast the GDP for 9 different countries for the next 10 years. First, let’s load some libraries and define the datasets that we’re interested in.

library(Quandl)
library(purrr)
library(forecast)
library(dplyr)
datasets = c(Germany="FRED/DEURGDPR",
Singapore="FRED/SGPRGDPR",
Finland="FRED/FINRGDPR",
France="FRED/FRARGDPR",
Italy="FRED/ITARGDPR",
Ireland="FRED/IRLRGDPR",
Japan="FRED/JPNRGDPR",
Netherlands="FRED/NLDRGDPR",

Next, let’s use the map function from purrr to pull down the data as a time series object for each of these countries.

time_series = datasets %>%
map(Quandl, type="ts")

What’s going on with those two lines of code? The map function takes a list or vector and a function as inputs. It then applies the function to each element. So, for each of the elements in the datasets vector, it’s calling the Quandl function to retrieve a time series object. Thetype=”ts” at the end ensures that we’re getting a time series object instead of a data.frame.

You may be thinking that map is awfully similar to lapply, and you’re right. In fact,purrr is wrapping the standard lapply. But it’s providing an improved syntax and expanded capabilities that we’ll see later.

If you look at time_series, you’ll see that it’s a list of time series, one for each of the countries. (I truncated the output to save space).

> str(time_series)
List of 9
$ Germany : Time-Series [1:52] from 1960 to 2011: 684721 716424 749838 ...
$ Singapore : Time-Series [1:52] from 1960 to 2011: 7208 7795 8349 ...
$ Finland : Time-Series [1:37] from 1975 to 2011: 85842 86137 86344 ...

Forecasting GDP

Next, we’re going to train an ARIMA model on each of these time series and then forecast for the next 10 years. The code for this is super easy with purrr. Check out my other blog post on time series if you want to find out more about these models.

forecasts = time_series %>%
map(auto.arima) %>%
map(forecast, h=10)

With purrr, chaining together these maps becomes super easy. No messy loops or confusing lapply calls to deal with. It’s functional programming at its best.

Plotting the forecasts

Now that we have our forecasts, it would be good to visualize them. We’re going to use themap2 function to do this. Here’s the code:

par(mfrow=c(3,3)) # set grid for plots
map2(forecasts, names(forecasts),
function(forecast, country) plot(forecast,
main=country,
bty="n",
ylab="GDP",
xlab="Year"))

Which produces this plot

Turning the forecasts into a data.frame

Of course, we wouldn’t be doing function programming if we didn’t have a reducing step. Thankfully, purrr has a reduce function made for doing this. I’m going to take each of my forecasts, convert them to data.frames and then stack them together with rbind. The code for this is again really simple.

forecasts_df = forecasts %>%
map(as.data.frame) %>%
reduce(rbind)

The reduce step sequentially binds each data.frame. Once you have the final data inforecasts_df you can start working with dplyr, since the data is in a nice tabular format.

Summary

Next time you find yourself struggling with non-tabular data in R, check out purrr. We’ve only just scratched the surface of the package, but it’s already making my R cleaner and faster. Until next time, happy coding!