Pleasingly Parallel: Accelerate Your Workflow with DSX

A Case Study

How we reduced modeling time by 80% with Data Science Experience

We used IBM Data Science Experience to substantially improve an R modeling workflow with R Notebooks, Spark, and parallel computing. In the below case study, I’ll give a walkthrough of the tools we chose and the decisions we made along the way. Importantly, the client was strong in SQL and R, but less so with Spark and other approaches to building models in parallel.

The client

  • A multinational retailer with many hundreds of stores and millions of monthly customers.

The team and their role

  • An analytics team builds a customer churn model for each of several hundred customer segments. These models do not depend on one another — they do not need to communicate any intermediate results. This task is pleasingly parallel.
  • Will this particular credit card holder make a purchase at one of our stores in month x? Which factors are most important?

The issues

  • Modeling execution time. Too time-consuming. It’s difficult to experiment and build prototypes. The completion time of the R process is over 90 min.
  • Access control and authorization. The client wants to provide access to some data to certain people, but guard database access.

Consider below a graphical overview of the differences between the original and the new approaches.

Comparing approaches

The diagram above provides a high-level comparison of the original and the new approaches to the client’s workflow.

Original Approach

The client’s analytics team builds one churn model for each of several hundred customer segments. Contained in one long R script, their original workflow was straightforward, logical, and slow.

In psuedocode, the logic looked something like

for(i in 1:800){ 
# query the database, select rows corresponding to the current segment
data <- dbQuery("select * from tbl where segment = segment[i]")
testData <- subsetOf(data)
# aggregate and preprocess the data 
processedData <- preprocess(data)
processedTestData <- preprocess(testData)
# build a model 
myModel <- model(processedData)
# write results 
results <- model.preciousResults(processedTestData) write.table(results, localFile)
}

For each customer segment, the original approach would query the database, perform a standard set transformations, build a model, and write the results to a text file. This process executes in over 90 minutes.

We modified each of these steps in order to dramatically reduce execution time and to provide access control along the way.

Modified Approach

Let’s examine the changes

Data Ingestion

We first tackled data ingestion and the standard transformations. We replaced a db query in each iteration with one query using read.jdbc() from SparkR in a DSX Notebook. This creates a Spark Dataframe representing the database table accessible via JDBC URL.

  • This is advantageous because we can leverage the power of Spark to perform aggregations and transformations on the data at once with one query, rather than in each iteration on a subset of the data. Using a combination of Spark SQL, Spark Dataframe Operations and User-Defined Functions, we can preprocess all of the data needed for modeling much more quickly.
  • Another advantage is that we guarantee that the transformations do not fail on a particular subset of the data. This is very helpful for debugging and for making sure that everything runs smoothly.

Save and Access Control

Next, we write the aggregated and transformed data to a text file. This is an important step because it accomplishes both goals of speed and access control.

  • We write the data as a csv in a DSX Project because we can persist the cleaned data, collaborate, and control permissions. This means that analysts can load the data into a BI tool and other data scientists can audit the data and collaborate -- without the need for database credentials.
  • On DSX, you can connect a database to a Project directly and control permissions that way as well, if you so choose.

Parallel Processing

Now that we’ve prepared the data for modeling, it’s time to take advantage of the parallel computing capabilities we’ve enabled on DSX.

  • We converted the for-loop into a function. This way, we could run this function over a list of elements, distributing the computations with Spark.
  • The parallel R package also allowed us to take advantage of our environment and get lightning fast results.
  • Each worker evaluates the function on a different subset of the data. The most important point is that these computations do not depend on one another, so we can parallelize them easily. After each function completes, the worker sends the results to the master node. All of the results of the model are small enough to fit in the large memory of the master.

After revising our approach, we went with something similar to this:

# read the entire table into a Spark DataFrame 
data <- spark.read.jdbc("the database table") 
# preprocess the data all at once 
# combination of Spark SQL, User Defined Functions, and Spark Transformers
# converting datatypes, datetime transformations, aggregating columns, imputing values, text processing, etc
prepData <- sparkPreprocess(data) 
# write the data as a csv 
# aggregated data is OK to share in DSX projects, ready to use for modeling
write.df(prepData) 
# now, we convert the for loop from the previous method into a function 
# we remove all of the aggregation and preprocessing steps which we've done at once with Spark
# with a function, we can use `spark.lapply` (from `SparkR`) or `parLapply` from (`library(parallel)`) to distribute the function across a cluster
paraFunc <- function{ 
build the models in parallel
}
# distribute the function across the cluster 
# results are small enough to fit in the memory of the large master node on DSX
results <- parallelApply(1:800, paraFunc)

We’ve successfully executed very similar logic in a fast, distributed way. The original approach took over 90 minutes, while the modified runs in under 15 min.

Review the issues

  • Execution time: With the revised method, we reduced execution time by over 80%
  • Access control: We control data governance and access through the concept of Projects and we provide access to preprocessed data (anonymized, aggregated) that we use for modeling.

Rethink your workflow with Data Science Experience to achieve more efficient performance, expand control over collaboration and permissions across the modeling pipeline, 
 and develop better solutions for your team and for yourself.


Originally published at datascience.ibm.com.