Data Transformations in R and Go

Hannah Voelker
grail-eng
Published in
5 min readDec 12, 2020

The first step in most data science and machine learning workflows is obtaining and preparing data for analysis. Imperfect data jeopardizes the end result, and in the case of detecting cancer, can have numerous negative consequences. At GRAIL, many data sources are factored into our cancer predictions, including a patient’s clinical information, lab processing metrics, genomic data, and bioinformatics pipeline output. This plethora of inputs, coupled with the volume of data and the scale of the problem we are trying to solve, requires a robust data-engineering pipeline.

The TidyData team at GRAIL was founded to support the bioinformatics scientists by reducing the data preparation workload and ensuring consistent analysis. Rather than having data scientists prepare the data themselves, worry about data permissions, locate all relevant source data, and translate them into one datastore, we developed a data generation pipeline to handle this process, as well as client libraries for our end-users to fetch the data. This pipeline ensures that:

  • The relevant information required for analysis is available
  • Schema is enforced to create consistent representation across datasets
  • Data is presented with controlled terms, units, or ranges depending on the column type
  • Dataset access is strictly controlled to perform blinding and ensure robust classifier development
  • Relevant filters can be materialized with the dataset upon retrieval

These principles allow us to present data in a consistent, explicit manner for analysis, while also providing flexibility in how the data are represented beyond its original form. One way we do this is from a series of automated data transformations that we run to create clinical variables for the classifier.

When a participant enrolls in a GRAIL-sponsored study, information is collected at the clinical level. This information includes prior history of cancer, activity level, demographics, etc. Patient-provided background health information is captured digitally in forms called electronic case report forms, or eCRFs. While there is controlled input for most of these fields, much of the information collected requires derivation beyond the original input to determine an individual’s particular risk factors for various cancer types, put the participant into a specific analysis population, or create meaningful filters for the dataset. GRAIL’s team of clinical data engineers work closely with the Bioinformatics team who perform the analysis to determine what derived variables to implement, and exactly how they should be implemented.

These data engineers are typically well versed in R, and with hundreds of derived variables needed per study, they need to be able to quickly implement these data munging functions using dplyr and the other tidyverse packages. However, we have hundreds of dataframe columns that need to be generated for our large-scale clinical studies, and R is a slow language at scale. R is optimized for scripting, quick analysis, and plotting, and is rarely used in large-scale engineering systems due to its slow speed and integration difficulty. Furthermore, when an R script fails, there is little indication of where or why it failed. Despite these shortcomings, R is a more widely used statistical analysis language with plenty of functionality to support native data types (dataframes, named lists, etc.) and third-party packages. Therein lies the challenge of making data transformations scalable — we want to generate tidy data on a daily (or even more frequent) cadence and continuously develop these pipelines as user needs change, while allowing for some of these data derivations to be performed in R.

We make clinical data generation efficient by using a tool within our pipeline that we call transformers. A transformer is a code module in R, Go, or Python. A single transformer takes in a named list of dataframes as an input and outputs a single dataframe with the derived variables of interest. A transformer’s output can depend on the eCRFs, data cleaning information, or results from other transformers. Primary keys are enforced on the inputs and the outputs of transformers to ensure data quality. Once all transformers have run, we combine their results to create the clinical data output, where each row reflects a participant with related clinical variables. Transformers have two components — the code module, which contains the logic for the variables of interest, and the manifest, which explicitly defines the input and outputs for this transformer — as well as a name, description, and the path to the code module. Manifests are generally straightforward to develop. It is a JSON config, which also makes the inputs and outputs easy to read.

An example transform manifest for deriving a single variable

Using this transformer runner, we allow statisticians to continue developing in a language that has full functionality for data derivations, while also allowing us to run our pipeline in Go. We utilize transformers not only because they allow us to run R functions within our Go runner, but also because of some key features the infrastructure brings us. By having a runner that executes a data analysis function as its own entity, we can order transformers based on their data dependencies and run these functions concurrently. Go is an excellent language for concurrency since Goroutines have a low cost to create and allow for safe communication between routines. Furthermore, since a transformer targets a function that creates a small number of variables, when a failure occurs due to a change of source data or a bug, debugging becomes much faster; we are able to target the particular code module where it failed. R has some overhead — transformer derivations need to be exported functions within an R package, and every transformer must take in a named list of dataframes and return a single dataframe. While initially this meant a large refactor of our clinical data derivation scripts to support running transformers, it created a standard for future derivations. Furthermore, we can have derivations in both languages, as the runner works on a function-by-function basis. We could have a transformer in Go, depend on a transformer output that was run in R, due to the in-memory datastore.

We plan to implement more clinical data derivations in Go as we build out our dataframe library. Adding functionality that mimics current R data analysis libraries will make it easier for data scientists to write derivations. Additionally, we are making improvements in speed and memory management as we find new ways to enhance the rest of the data generation pipeline. As we constantly tweak how we derive clinical variables, transformer infrastructure is a concept that we have found effective in allowing R and Go to exist in harmony.

--

--