Best Practices for Unit Testing and Linting with R

Written by Kristjana Popovski and Hannah Kennedy — Members of Microsoft’s Commercial Software Engineering (CSE) team

Hannah Kennedy
11 min readOct 9, 2019
Collaborative coding was the backbone of our project — Photo by Alvaro Reyes on Unsplash

Over the past 9 months, we have been working with a team of clinicians and data scientists from Great Ormond Street Hospital (GOSH) on Project Fizzyo which focuses on researching the efficacy of various treatments for children with Cystic Fibrosis (CF). We focused on two information streams: Fitbit activity data, and customized airway clearance devices. Our goal was to create a maintainable and robust data-processing pipeline to enable clinicians’ on-going research. Our pipeline was written in R and we wanted to bring CSE’s commitment to good engineering practices by incorporating unit testing, linting, and continuous integration. We wanted to share how we added these practices to our project.

Unit Testing

STOP! … in the name of testing — Photo by Michael Mroczek on Unsplash

The principle behind unit testing is that every small section of code will have an expected output or behavior when given a supplied input — be that output an error, a warning, or something else, a unit test will pass or fail accordingly. This method can be useful for data science projects that may grow quickly and expand scope when new features are added. Tests should always pass when introducing changes in the underlying code, and therefore fail when not done properly. Tests also should verify behavior independently of other functions and therefore not rely on the output of another function; we had initially encountered hurdles in our testing because we did not originally have mock data. In a growing codebase, having tests (especially unit tests) makes debugging easier.

We implemented unit testing for our R-based Machine Learning project to ensure the accuracy and robustness of all parts of the processing pipelines for ACT and Fitbit. The ACT pipeline consisted of three main components that we ran as logically numbered pipeline steps. Our first ACT pipeline step was Data Cleaning, where we dropped NAs, deduplicated the data if it contained any duplicate sessions, and used the BEADS algorithm to remove any drifts in the ACT pressure measurements. The second ACT pipeline step was Labeling, where we Identified and labeled breaths in clean ACT data. The last pipeline step was Featurisation, where we extracted ACT breath features such as breath amplitude and duration and break duration. The Fitbit pipeline was composed of two main steps: a footsteps and heart rate data Validation Step, where NAs were dropped from the raw data, and a Featurisation step to extract features from the clean footsteps and heartrate fitibit data.

We tested utility functions for the the three main data processing pipeline steps: data cleaning, data labeling to identify breaths, and feature extraction. Because the accuracy of feature extraction was essential for getting meaningful clustering results, we wrote our tests in a way that validated isolated components along the data science pipeline. For example, we had a utility script cleaning_utils.R which contained functions to clean the data before featurization. We made sure that each function contained a corresponding unit test in test_cleaning_utils.R. We compared the cleaned results with .csv files of generated expected outcomes and tested edge case scenarios.

During our unit testing setup, we faced the challenge of how to validate utility functions in a continuous integration build pipeline in Azure DevOps without using real Fitbit and Airway Clearance Devices data, to comply with the EU General Data Protection Regulation (GDPR). We created mock data generators for the two different data streams with completely synthetic data for our unit tests, ensuring that we were able to run our Azure DevOps CI build pipeline for every pull request to master.

We used the testthat package developed by Hadley Wickham to set up our unit tests. This package is widely used for testing the production of other packages, but since we had not written R packages and had a nested directory structure, we had to loosely tailor our code to use testthat. Writing unit tests for our functions helped us easily debug any errors in the code base whenever we added new features to the feature extraction scripts. This advantage was due to the testthat package’s detailed error messages in case of test failures.

We set up a ./tests directory in the root folder of our project.

.
├── testthat
│ ├── test_labelling_utils.R
│ ├── test_cleaning_utils.R
│ ├── test_featurisation_utils.R
├── testthat.R
├── testthat_load_mock_data.R
└── testthat_source_input.R

Within ./tests, we created a main testthat.R script that:

  1. Loads all necessary libraries.
  2. Sources all test files and utility files.
  3. Loads the .csv mock datasets used for the unit tests.
  4. Runs each test script in a for loop and halt execution in case of test failure.

A sample format of the testthat.R script:

# load the necessary libraries
library(testthat)
library(tidyverse)
# source all files, testing utilities
source("tests/testthat_source_input.R")
# load mock datasets
source("tests/testthat_load_mock_data.R")
for (file in files) { source(file) }# run all test scripts in the /tests/testthat/ directory
# mimic "stop_on_failure" here so that tests stop
# if there's a file with a failure.
for (testFile in testFiles) {
testResult <- test_file(testFile, reporter = "Summary")
testResult <- as.data.frame(testResult)
if (any(testResult$failed == 1)) {
stop(paste("Failure on", testFile, sep = " "))
}
}

Within ./tests, we also created a script testthat_source_input.R that loads all the utility files we want to test from their respective directories and the associated unit tests found in the ./tests/testthat directory.

A sample format of the testthat_source_input.R script:

This gathers the test files and source files.

Within a test we load mock data (see the next section for details) and then use the library’s various functions to test equality, truthiness, etc. Check out the testthat package to see all the functions you can use!

Uniform styling makes for elegant code — Photo by Edgar Chaparro on Unsplash

Linting

Linting is a means of stylistically standardizing code, therefore making code typically easier to read. Having a linter provides polish and uniformity, which is essential for projects with lots of contributors. Some languages provide their preferred styling — for example, Python has PEP8, and C# has its own conventions. There is no “official” style guide for R. There are conventions that most people follow and some style guides provided by the makers of the tidyverse and Google.

Having some set conventions for code style makes sure that a project is written predictably. We primarily based our styling choices on Google’s guide, however the two guides have informed one another’s conventions over the course of development. Our rules fell into a few categories:

  • Spacing.
    -
    Having spaces after commas
    - Having spaces around infix operators
    - Using spaces instead of tabs
    - No spaces immediately inside an opening or closing parenthesis
    - No trailing white space at the end of lines *AND* and the end of files.
  • Style.
    -
    Left arrow assignments
    - Having a line length of 100
    - Using camelCase for variable names and CamelCase for function names
    - Enforcing double quotation mark use
    - No open curly braces on their own line
    - Closing curly braces should always be on their own line.
  • Coding Conventions.
    - Only relative paths
    - Having no commented code
    - Explicitly typing out TRUE and FALSE
    - Each step of a piping operation is on its own line unless the entire operation can fit on one line.

The code snippet — which only has a few linters as an example:

library("lintr")# We feed the R script a file to lint
args <- commandArgs(trailingOnly = TRUE)
# The linters to be used
linterList <- list(useRelPaths = lintr::absolute_path_linter,
useArrowAssignment = lintr::assignment_linter,
closedCurly = lintr::closed_curly_linter,
spaceCommas = lintr::commas_linter,
noCommentedCode = lintr::commented_code_linter,
infixSpaces = lintr::infix_spaces_linter,
lineLength = lintr::line_length_linter(100),
spacesOnly = lintr::no_tab_linter,
...
)
# Method to run the linters against a file, prints any caught code
# Then returns the number of instances of caught code
runLinterOnFile <- function(file, lintList = linterList) {
result <- lintr::lint(file,
linters = lintList,
exclude_start = "# Exclude Start",
exclude_end = "# Exclude End")
print(result)
return(length(result))
}
# Apply the linting to provided file
lintingOutput <- runLinterOnFile(args[1])

The way we supported these rules was by using the R package lintr. Using this library was appropriate since so much of the R we were using was built off of tidyverse, which is helmed by Hadley Wichkham. Jim Hester — the primary maintainer of lintr— works right alongside of Hadley at RStudio. Because we already had so much support and usage of the tidyverse, using a linter that the library recommended was the right choice.

We came up with a list of those linting rules and then lintr provided its lint(file, linters = list…) function. From there we could lint files. We opted to only lint files that were flagged when running a git diff from our master branch (and were not deleted) and to only lint R files.

git diff origin/master --staged --name-only --diff-filter=d | grep -i '.*\.R$'

For example, if we put the following code snippet into a file called tests.R, the linter would fire off on a number of linting errors.

THE_VARIABLE=4-( 10*3 )

We invoke the linter with the file tests.R and see the results in the command line:

The results of running `Rscript linter.R tests.R`

Mock Data

Time to play with random chance — Photo by Jonathan Petersson on Unsplash

There’s the common practice in unit testing to use mock data. In some cases, one can come up with fake data that looks and behaves enough like the real thing that it suits testing purposes. For our project, the signals we were attempting to recreate needed to look and behave like signals received from an airway clearance device or from the Fitbit device. The analysis of these signals needed to be appropriately similar, not pseudo-random. In cases like breath signals, the signal needed to look like a breath and heart rates to look like an actual heart rate graph for a person and not arbitrary signals.

Additionally, we held patient privacy requirements as one of the top priorities in our project, so personal identifiable information (PII) would not be exposed. Any mock data we extracted needed to not be traceable back to patients. We used a digital research environment as the first layer of such protection, in that any manipulation or use of patient data remained on the host and not locally on any computer.

An example of the mock heart rate data
An example of one type of mock ACT device data
For comparison, a pseudo-random signal… or literal white noise — courtesy of Omegatron [CC BY-SA 3.0] from wikimedia page

The signals in our data set fell into two categories: predictable, partially-aggregate values from a Fitbit device (heart rate, footsteps), and less-predictable raw values from an airway clearance device. We followed a similar pattern for each. Using isolated subsections of breathing data and Fitibit data from real signals of real patients we could apply four layers of obfuscation to create new mock data.

  1. From the source set of data, randomly select some of the data-points (using R’s built in sample()). The sample rate could be tweaked.
    a.) For breaths, we would need previously isolated breaths from a recording.
    b.) For heart rate, we would need to sample recorded values from the specific hour in a 24-hour day.
  2. Within the subset of data, shuffle the order around where feasible.
    a.) For breaths, we would shuffle the order around.
    b.) For heart rate, we would shuffle the patient from which we would select the measurements but within the same time-slot. i.e. Pick a measurement from the 15:00 slot from one of x patients, pick a measurement for the 16:00, etc.
  3. To make the mock data harder to be identifiable — a priority for us as we dealt with real patient data — we would modulate the values to be very close to but not quite the same as the base values.
    a.) For breaths, we used R’s jitter() since the values we recorded had some decimal precision.
    b.) For heart rate, we would augment the values by +/- 3 to keep activity and heart rate similar enough to in most cases not drastically change the intensity level recorded.
  4. The schemas for our data had an associated patient ID and session (recording) ID. We would create a new patient ID and a new session ID using the UUID library.

By using all of these techniques, we had data we could safely extract from our environment and use in unit testing, knowing that the data would behave similarly as real data when put through our analysis pipeline.

Continuous Integration

The stream of information — Photo by Emre Karataş on Unsplash

We implemented a Continuous Integration pipeline in Azure DevOps to automatically run the linting and unit tests for every pull request to the master branch. To be merged into master, all branches must have passed the CI pipeline build. We set up our own infrastructure instead of using the Microsoft-hosted agents in Azure DevOps, since R and its packages are not preinstalled and there is currently no caching available. Installing these packages on a Microsoft-hosted agent would require the installation of r-base and all project-related dependencies in every run, which takes a substantial amount of time. The run-time of the build would keep increasing as the scope and number of libraries expanded.

For Azure pipelines, you can use a YAML file to configure your continuous integration settings. (For more documentation, you can follow up on the docs.) We set up our pipeline trigger for the master branch, so that the build runs automatically for every pull request to master (see code snippet below). The following code snippet is our azure-pipelines.yml file:

trigger:
branches:
include:
- master

variables:
- group: docker-repo-settings

In this other code snippet, we set up a job to update our container if any packages or files change:

jobs:
- job: UpdateContainerImage
pool:
vmImage: 'ubuntu-16.04'
displayName: Update the container if any packages or devops files changed

steps:
- script: |
# Check if devops files changed
devops_files=(install_packages.R package_list.txt build/Dockerfile build/azure-pipelines.yml)
changed=0
for file in $(git diff HEAD HEAD~ --name-only); do
if [[ " ${devops_files[@]} " =~ " ${file} " ]]; then
changed=$((changed+1))
echo "$file $changed"
fi
done
echo "##vso[task.setvariable variable=devops_files_changed]$((changed))"
displayName: Set devops_files_changed to the number of devops files changed

- script: |
docker build -t $(docker_repo_user)/$(docker_repo_name):$(Build.BuildNumber) -t $(docker_repo_user)/$(docker_repo_name) -f build/Dockerfile .
docker image ls
docker login -u $(docker_repo_user) -p $(docker_repo_pwd)
docker push $(docker_repo_user)/$(docker_repo_name)
displayName: Build and push the container of the R environment
condition: gt(variables['devops_files_changed'], 0)

To set up our own infrastructure for the CI pipeline, we created a Docker container of the R-environment with the project package dependencies (lintr, testthat, covr, etc…). This built on top of Rocker, the pre-made image of R in Docker. In the same ./build directory as our Dockerfile, we also created the azure_pipelines.yml file, where we defined the Continuous Integration pipeline setup. Using a container decreased our pipeline run time from more than twenty minutes to less than two minutes!

Conclusion

This setup is how we found the most success for our R data science project in a DevOps context. We hope that this can be an inspiration and reference for others when attempting something similar.

See below with how to get in touch if you have any questions!

--

--