How to Manage Your Research Data?

Empirical research proceeds often through numerous iterations, involving lots of source documents, datasets and other files. Unless you are prepared to do quite a bit of digital housekeeping, burgeoning materials can easily spin out of control, turning research from an intellectual adventure into an administrative nightmare. In this post, I describe a simple system for taking care of raw data, datasets and, finally, for archiving analyses.

My motivation to consolidate a set of personal practices into a system emerged from frustration with increasing digital housekeeping and an aspiration to fully exploit my datasets wherever I work in the future. Yet, the word ‘system’ sounds somewhat bureaucratic and tends to put some people off — I acknowledge that under certain circumstances the system may not be suitable for you. If you know for sure that you are only ever going to do a single research project, ad hoc is probably a better approach for you than any rule-based way of organizing materials.

I wanted a system that is both robust and shamelessly practical. Most importantly, it should be possible to take care of any foreseeable project within the same framework. The system should also be relatively agnostic to philosophical, methodological and ethical questions related to research data. This is not to say those matters would not be important, just that it would be nice to have a system that can digest all kinds of empirical investigations. These are admittedly ambitious aims that could easily lead to complex rules and lots of meta-work, that, in turn, would defeat the purpose of my system.

The solution is to approach data management as file management. Whether you work within positivist, realist or interpretivist methodology, your empirical research is going to involve lots of files. The filing system should make it easy to work with multiple collaborators and large datasets consisting of many different kinds of files and observations; materials should be stored so that they make analyses replicable, and, finally, the system should be independent of organizational support and free of technological lock-ins. Overall, digital housekeeping should take as little time as possible away from intellectual work, maximize data reusability and help construct a chain of evidence from the results back to the data.

But isn’t there an app for that…

No. Putting research data under control is not about technology. It is about understanding the workflow of empirical research, and there are good reasons not to surrender the control of your workflow to a piece of software. For instance, I use version control systems such as GitHub in my projects, yet they simply do not meet the above criteria. This is because any smart technology adds a layer of complexity that quickly becomes unmanageable if that piece of technology stops working. Even when the technology works, the extra complexity can hit back in a number of ways and lock you into a particular system — not to mention that it can be nearly impossible to make valuable collaborators learn git.

Complex technologies such as version control systems can be indispensable for a research project, but they are not a substitute for mastering the research workflow. This is not to say that data management would not depend on some technological foundations. Indeed, those foundations must be carefully chosen to be as future proof as possible.

The workflow of empirical research

The empirical part of research proceeds generally from data collection to analysis, findings, and, eventually, to the dissemination of results. The first step is to obtain observations about the phenomenon of interest, that is, the ‘world out there’. This may entail setting up a laboratory experiment, observing people and things in their natural setting, or running a mathematical simulation. The observations are captured as raw data. The data is ‘raw’ as it represents the initial input to an empirical analysis — not necessarily in the philosophical sense of objective facts free from conceptual content. The data are then typically processed in one or more steps into research datasets that can be analysed using various methods and tools. Finally, the study should be archived. These four steps are illustrated in Figure 1.

Figure 1. Schematic representation of research workflow

A linear, waterfall-like description of empirical research is, of course, a gross simplification and cannot capture the multiplicity of practical research experience. In reality, research may stumble forward through dead-ends, endless iterations and sidetracks until its results suddenly start to crystallize. It is important to understand that Figure 1 is not an attempt to summarize how empirical investigations (should) proceed in practice. Instead, it depicts schematic steps that can help manage materials throughout the investigation by linking them always to preceding operations.

The point about thinking research as if it were a linear process is that it helps to link together those specific operations that, in the end, produced the results. A sceptic colleague who wants to assess or replicate the process does not want to repeat all the iterations and dead ends that you as the original investigator had to take. Rather, he or she wants to see clear steps connecting the results to the data. The chain of effective research operations can be constructed by linking each research operation backward to its predecessor(s) and finally backtracking the chain from the results to the data. This is best done as you stumble forward in your research. Trying to reconstruct the chain once you have arrived at the results will be more laborious and easily misses important details.

MIDAS system

The solution is minimalist but not puritan — let us call it a MInimal Data Archival System. The idea is to organize all materials into packages that are stored in a simple filesystem structure and linked together using stable unique identifiers.

Rule 1: Organize all materials into packages that are labelled with stable unique identifiers.

MIDAS depends on basic features available in all common filesystems, which makes it free from technological lock-ins. There are merely three more rules that specify how to use packages and identifiers in more detail. The rest can be adapted to your own circumstances.

Package

A package is a fancy name for a directory (folder) that contains files. The difference to a common directory is that a package is labelled with a stable unique identifier.

The contents of individual packages can be organized to suit the type of study — it would be very difficult to design generally applicable rules for the myriad of materials and types of studies different researchers engage with during their careers. However, you must strive to make packages as self-explanatory as possible. This means that each package must provide enough information on how its contents were created and connect to previous research operations using package identifiers. The metadata should understandable to the sceptic colleague or, at minimum, to yourself years after you completed the research. Once a package has been created, any change to its contents may invalidate an inbound reference from a later package.

Rule 2: Do not change package contents (unless you know that it does not invalidate any incoming references).

Text files are the best way to store material. Note that ‘text file’ is an umbrella term that covers a broad variety of file formats such as plain text, eXtensible Markup Language (XML), Comma-Separated Values (CSV), HyperText Markup Language (HTML), Rich Text Format (RTF), Structured Query Language (SQL), JavaScript Object Notation (JSON), TeX, and many others. What is common to all these is that if you open them in a text editor, they show up more or less human readable. Some types of data such as images and audio cannot be stored as text files. For such files, it is recommended to use open, standard file formats that are widely in use by many different applications. For instance, using PDF (ISO 32000–1:2008) and JPEG (ISO/IEC 10918) for images should be fairly safe options.

Identifier

The purpose of package identifiers is to allow pointing to specific materials without the risk that the link may become invalid later. An identifier is a unique name meaning that there must not be two different packages with the same directory name. In more technical terms, there must be a one-to-one relationship between identifiers and packages. An identifier must also be stable so that once a package has been created its identifier will not be changed. You can format your identifiers in any way or opt for no common format at all as long as they are unique and stable. If you decide to change your naming practice for new packages, do not change the identifiers of already existing packages.

Rule 3: Do not give the same name for different packages.
Rule 4: Do not change an identifier once it has been created.

I have a habit of naming my packages yyyymmdd-name such as ‘20150707-enwiki-dump’. The date in front (note those leading zeros) makes it easy to ensure that the identifier is unique. Also, it provides useful information about the package and enables sorting packages chronologically in the case filesystem timestamps get accidentally updated. Using dashes instead of spaces makes the identifier more compatible with URLs, and they also help the package stand out in prose. Compare “we used enwiki dump to…” vs. “we used 20150707-enwiki-dump to…”. In the former case, it is not clear if enwiki dump is a specific set of files or refers to such dumps in general, whereas 20150707-enwiki-dump points more clearly to a specific set of files.

Filesystem structure

All packages can be stored in the same directory, because they cannot conflict with each other due to their unique identifiers. I have nevertheless found it convenient to put packages into separate directories according to their position in the workflow. You can easily try different approaches and even change your approach as many times as you wish — as long as you stick to the Rules.

Figure 2. Filesystem structure — note that I have not changed old package names (identifiers) with spaces and underscores despite recently opting for dashes. The rawdata directory contains three Wikipedia database dumps that already had the date so I decided not to append it in front of the package again.

rawdata contains raw data and datasets from external sources. The rationale is to put here material that I cannot reproduce from its source. For example, in my Wikipedia research I store database dumps that I have downloaded from the Wikimedia Foundation servers in this directory. If I were to do some interviews, I would store the audio recordings here.

datasets contains datasets that have been processed from external sources, packages in the rawdata directory, and from other packages in the datasets directory. For example, interview transcripts processed from audio recordings in rawdata would be stored here. If I then further code the interviews transcripts, the resulting dataset would also be stored here. Package identifiers make it easy to trace back data processing steps to earlier packages, raw data and external observations (note that the step “Processing data into research datasets” in Figure 1 is iterative).

anarchive is abbreviated from ‘analysis archive’. My practice is to archive analyses that underpin published research and identify the archival package with the corresponding publication. This way the archive maps directly to claims that I have made in public and may have to defend in the future. The archive package should allow replicating the analysis by storing all relevant materials and by adding necessary metadata. You can copy and link to other packages, write notes about the research process, describe the environment in which research operations took place — whatever helps understand and trace back the steps that were taken to achieve the results.

Limitations

One might argue that the system is too minimal. MIDAS lacks clear, positive rules what to do in different situations. Three out of its four rules merely tell what you are not supposed to do. This is an intentional decision based on the following assumptions. First, minimizing the number of rules makes the system easy to learn and transparent in practice. Second, adding more rules would increase complexity and, consequently, the likelihood that the rules conflict each other in practice. Third, no amount of rules will ultimately help if you do not understand the research workflow and what you are trying to achieve. Fourth, minimal rules leave space for adapting the system to many different circumstances.

Also, the system is not completely free from philosophical ideals. Most obviously, I assume that replicability is worth pursuing in research. This is probably not too controversial and, I argue, relatively harmless assumption. Perfect replicability is anyway merely an ideal, since it is impossible to store the entire research environment inside a package and thus enable replicability in practice under all circumstances. Nevertheless, an archival package should allow replicability in principle by describing how the analysis could be replicated if the environment was available.

Finally, the system does not say anything about information security and privacy. You need to protect your data processing and storage environment from unauthorized access and malfunctions up to an appropriate standard. However, the appropriate standard depends on so many things that I felt unable to give any guidance on the topic. There are lots of generic information security materials that provide a good starting point for your individual assessment.

There are probably many other limitations such as the fact that the system is strictly limited to the part of the research process that involves actual empirical data. It says nothing about the design and planning of empirical research operations. Let me stress that the objective of MIDAS is to provide only core principles that hopefully allow an individual researcher to develop effective data management practice.

Final thoughts

I have found the system workable and transparent in my day-to-day research work, and my collaborators have not complained either. There is no guarantee that the system will work for others or even for me in the long run, but, after all, no system can give such a guarantee. The rules and other guidance are there to help construct your way of managing research data, not to be blindly followed.

I would like to thank Attila Marton from Copenhagen Business School and Niccoló Tempini from University of Exeter for their helpful comments on a draft version of this post.

If you have a better system, please let me know!