Flexible reproducibility in data workflows
“Like families, tidy datasets are all alike, but every messy dataset is messy in its own way.” — Hadley Wickham, 2014
The first, largest, and most unsung step of any analytical project is usually the process of getting data into a format that can be effectively analyzed. Sometimes, if the data collected doesn’t change or won’t change quickly enough, the process of tidying data is a one-time investment to generate a static dataset that can be thoroughly analyzed, or even shared. Other times, raw data changes by the hour or minute, requiring automated systems of extraction, transformation, and loading into a data warehouse for further analysis. But what of the case in the middle?
In the legal world, data is quite commonly updated by humans on a semi-regularly basis. Legal motions often require the parties to report on class members on a monthly or quarterly cadence, which can involve many manual and mutable steps. How can we build reproducible analysis pipelines that are flexible enough to evolve and adapt over time?
This is the first of a mini-series on data workflows. The following posts will go deeper into packages that can help us implement these principles in R.
1. Stay as general as possible.
When solving data hygiene problems, hyper-specific solutions can be a siren call. They’re often the quickest and simplest when first presented with a problem, but they may be band-aid fixes that are no longer useful when data is updated. Consider a spreadsheet in which a single date value is entered “1/1/2019” while all others are entered in the format “2019–01–01”. You could do any of the following to handle this problem:
1. Open the spreadsheet in Excel, change the offending value, and re-save it.
2. Find the index of the offending value in your data frame, and hard-code that value to “2019–01–01” in your analysis pipeline, without touching the raw data.
3. Encode an if statement in your analysis pipeline statement that changes all instances of “1/1/19” to “2019–01–01”.
4. Use RegEx and other tools to detect the structure of the values in the date column, and format each based on its structure rather than based on its literal value.
As we move down the list, the solutions become more challenging to implement, but more robust over time. Solution #1 — altering the raw data — should be avoided at all costs, but what is wrong with solutions #2 and #3? If your data will never change, either option works as a quick fix that is readable and interpretable in the code (with #3 remaining robust to sorting or filtering decisions further up in your analysis). But, if there’s any chance you’ll need to update this analysis, you’ll want to stick with #4. It may be impossible to anticipate all problems and changes in future datasets, but by writing code that addresses multiple use cases, you’ll likely have to do less pruning in the future.
2. Encode your assumptions so your code fails loudly.
All computer code contains assumptions. In the example above, even the best solution contains the assumption that date data come in one of two forms: (m)m/(d)d/(yy)yy, or yyyy-mm-dd. This is a common assumption in the United States, but might be different in other countries. If we saw dates of the format 31/1/19, our assumption would not hold, and we would need new strategies for dealing with dates where the month and day are ambiguous.
Sometimes, when new data fails to meet our assumptions, the results of the analysis will be obviously wrong or impossible. For example, in our work on Ms. L vs. ICE, date data can come in many different formats. If we assume everything has a mm-dd-yyyy format, our analysis pipeline may inadvertently calculate that some children have a negative age. Impossible values are an obvious flag, but just as often, results won’t fall outside a plausible range, even if new data doesn’t fit the model of what you had before. Just because a result is plausible doesn’t mean it’s correct.
In R, a statistical programming language used by much of the ACLU’s Analytics team, the `assertthat` and `assertr` packages can be used to verify that our filtered dataset has the number of rows we expect, or to assert that all variables with “date” in the title are in fact of the date class. When your assumptions fail, you’ll have to go back into the raw code to understand what went wrong. This final tip can be your guide — and may bring you back to where you started.
3. Look at the changes in your data.
Our team has seen some truly unique approaches to data management from various government agencies, but guard rails like generalized scripts, encoded assumptions, and visual inspections of data changes can make it easier to reliably update analyses. Stay tuned for part two of this series, where we’ll take a deeper dive into assertions and data validation.