Big Data with R !!

Apoorva jain
Analytics Vidhya
Published in
3 min readJun 11, 2020

Every now and then, we always face and hear that R is sluggish with big data. Here we are talking about terabytes or petabytes and this is one of the biggest limitations of R that the data should fit within the RAM.

To avoid this we use out of memory processing concept that process in chunks rather processing it all at once. We use two different packages that are shown below.

#install.packages("ff")
library(ff)
#install.packages("ffbase")
library(ffbase)
  1. ff package basically chunks the data and stores it as an encoded raw flat file on the hard disk and also gives you access to the functions much faster. The data structure that is ff data frame also provides mapping to the dataset that is partitioned in RAM. Example of how the chunks of data are going to work, assume a 2GB file it takes about 460 seconds to read the data in the file with 1 ff data frame of a size 515 KB and 28 ff data files of 50 MB each, therefor 1.37GB.
  2. To perform basic merging, finding duplicates and missing values, creating subset, etc we use ffbase package. We can also perform clustering, regressions, and classification directly with the ff objects.

Let's look for some R-code for the above-described operations

# Uploading from flatfiles system("mkdir ffdf")
options(fftempdir = "./ffdf")
system.time(fli.ff <- read.table.ffdf(file="flights.txt", sep=",", VERBOSE=TRUE, header=TRUE, colClasses=NA))system.time(airln.ff <- read.csv.ffdf(file="airline.csv",
VERBOSE=TRUE, header=TRUE,colClasses=NA))
# Merging the datasetsflights.data.ff = merge.ffdf(fli.ff, airln.ff, by="Airline_id")

Subsetting

# Subsetsubset.ffdf(flights.data.ff, CANCELLED == 1, select = c(Flight_date, Airline_id, Ori_city,Ori_state, Dest_city, Dest_state, Cancellation))

Descriptive statistics

# Descriptive statisticsmean(flights.data.ff$DISTANCE)
quantile(flights.data.ff$DISTANCE)
range(flights.data.ff$DISTANCE)

Regression with biglm (Dataset: Chronic Kidney Disease Dataset by the University of California Irvine at http://archive.ics.uci.edu/ml/index.html)

# Regression requires installation of biglm packagelibrary(ffbase)
library(biglm)
model1 = bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = ckd.ff, family=binomial(link = "logit"), na.action = na.exclude)model1
summary(model1)
#Refining of the model can be done according to the significance level obtained in model1

Linear Regression with biglm and bigmemory

# Regression with big memory and biglm packagelibrary(biglm)ckd.mat = read.big.matrix("ckd.csv", header = TRUE, sep = ",", type = "double",backingfile = "ckd.bin", descriptorfile = "ckd.desc")regression  = bigglm.big.matrix(class~ bgr + hemo + age, data = ckd.mat, fc = c("bgr", "hemo"))summary(regression)

Further, when you dive a little deep we have just talked about storing but when we need to process or analyze data we need to know parallel computing. The simplest way to explain this is a youtube video and count the time of a random color appearing in the video so in this case parallel computing comes in the play where mapper splits the input and further reduced to key-value pair.

Therefore we use H20 for a fast and scalable platform for parallel and big data in R.

I hope you find this article helpful to work with big data in R. Thank you for reading the article.

“Data is the new science, Big data holds the answer”- Pat Gelsinger

Author of image informationweek.com

--

--