The pursuit of reliability in research: can code testing make a difference?

Julien Rechenmann
Nerd For Tech
Published in
9 min readSep 1, 2022


Ensuring the reliability of findings has been one of the highest priorities among researchers. It is the foundation that allows building our knowledge of the world brick by brick. Our trust in this system suffers terrible consequences when the conditions of reliability are not met. Due to reasons from data manipulation to honest mistakes (, billions of dollars and years of work have been wasted due to the lack of reliability.

“The immediate, obvious damage is wasted NIH funding and wasted thinking in the field because people are using these results as a starting point for their own experiments.”

Thomas Südhof | Stanford University

The solution thus far has been to rely on the reproducibility of the results by carrying out small iterations. To validate new hypotheses, researchers first need to confirm the results their peers published. However, the infamous “publish or perish” trend in research is preventing the practice of this healthy methodology. Researchers are compelled to hastily present new results, failing which they might lose funding. This race to get published also implies that little is being done to ensure the quality of the results.

To add to this bleak portrait of the research world, research has significantly increased in complexity, with more knowledge and skills for researchers to acquire while having minimal time to learn and improve. As mentioned in my previous article, more tools have been built for data collection; the experiments have become more complex, but little has been done to facilitate the production of reliable results. In the absence of oversight due to the current laboratory structure, researchers accumulate small mistakes in their code to collect and analyze data. Their code is not programmatically tested, peer-reviewed, or documented.

Therefore, the data, central to research, become less and less dependable.

While other sectors (product management, human resources, software engineering, etc.) have developed rules and strived to optimize their working techniques and processes, public research hasn’t picked up the pace. Although research has existed for more than half a century in its contemporary form, its lack of consensual methodologies is evident. Few laboratories express interest in discussing human resources management, data acquisition, analysis, or project management. Therefore, there is no consensus.

Which methodology can we develop to improve the reliability of our results? Can we borrow ideas from other fields? Let’s examine how software engineering offers reliability and robustness.

Testing code

Manual testing is what every software developer learns to do. It means running the piece of code you just wrote and checking through the debugger, logs, or output whether the code meets the required functionality. For example, checking whether the button you just added to the graphical user interface is initiating the experiment or that the data are saved at the right place and in the right format. As the code evolves, developers must manually ensure that their new pieces of code don’t disrupt previously developed functionalities.

Instead of manually re-testing every bit of code, which would be a huge waste of time, developers decided to test their code programmatically. It means they write code that tests the functionalities of the previously written code. When they add new code, they can thus verify that it doesn’t break previously developed functionalities. The concept is first to write the functional code and then the test code. As the codebase increases in complexity and size, testing becomes rarer and rarer as developers are required to quickly switch to developing the next functionality. Consequently, it becomes more and more difficult to write tests (I highly recommend Working Effectively With Legacy Code from Michael Feathers).

Test-driven development (TDD) is a software development methodology that includes three main steps: writing the test first, writing enough code to make the test pass, and cleaning the code. The developer should repeat these three small steps to obtain consistent and reliable results.

While this technique seems like a magical tool for all the research problems in the world, it relies on strong implicit assumptions to achieve efficiency.

The first assumption is that it has detailed requirements. The developers’ team receives functionality requirements defined by clients or product managers. There will be modifications over time, but the developers are given a clear direction to take.

The second assumption is that the code will be more used than it was written or edited. This sounds obvious to all developers because that is the main reason for writing code in companies.

A third assumption is that the written code will have a long lifespan. TDD was developed keeping sustainability in mind. The code will be updated and upgraded over time by several generations of developers.

Having reliable, tested, and robust code is a target that every researcher should strive toward. Research thrives on reproducible experiments, and obtaining corrupted findings because of a silly bug is not a situation any researcher would want to be in. So, should you force yourself to follow such a rigid methodology and spend twice the amount of time developing the same amount of functional code?

The short answer is yes, but the implementation might depend on what you are trying to achieve.

Applying TDD in research

Software development in research is different from the work done in companies. It is a moderately minor part of the research work and might not constitute a priority. Applying the TDD methodology will double the time spent on coding for possibly no improvement in the research quality if the basic assumptions for TDD are unmet. Let’s explore the objectives, requirements, and conditions of the two types of codebase researchers work on: data collection and data analysis (cleaning, processing, and reporting).

Data collection or experiment

As mentioned in the introduction, researchers usually work on their research projects by themselves. They have thus little to no reusable code to help them in data collection as they must define what kind of data they want to record and under which conditions. While some can utilize no-code tools provided by their recording devices company, many need to develop the whole experiment by themselves. This is the case for experiments involving multiple sensor recordings, closed-loop experiments, etc. In this process, the code they write needs to be 100% bug-free as the entirety of their research relies on it. If the data are shifted or partially corrupted, the entire project is at stake!

Before starting an experiment, a researcher properly defines one or several hypotheses and an experiment’s protocol that are validated by peers or the researchers themself. Similar to the software engineering industry, researchers follow requirements and specifications before starting to code an experiment. The existence of those requirements validates the first assumption of TDD. With those, the researcher-developer will have all the necessary information to follow the test-driven development methodology.

One key aspect of research is the reproducibility of the results, which requires recording enough data to infer the statistical power of the results obtained. From longitudinal study to cohort study, the code developed to collect data will be used/run more than the times it has been written. Consequently, it meets the second assumption of TDD.

The lifespan of a data collection code is usually quite short compared with the expected lifespan of a company’s software. It ranges from a single experiment (1 month) to a whole Ph.D. (~4 years). The data collection code’s lifespan can be suddenly extended if the laboratory decides it is more beneficial to the team to have one shared code for data collection. The process then draws nearer to a company’s software development approach.

The code developed to achieve reliable and reproducible results must be proven bug-free through intensive testing. As we have seen, all three TDD’s assumptions are met in this case. Therefore, it would be highly beneficial for researchers to apply TDD to their data collection code.

Data cleaning, processing, and analysis

In many cases, researchers will write code to clean, process, and analyze the data collected after they have collected it. They will “explore” the data with new hypotheses and data processing techniques. Data analysis development, a unique form of software engineering, is closer to scripting than common software engineering and suffers from a clear lack of requirements and objectives. Plots and graphs are made on the go with the data analyses. The code is written and run once. Most of the time, it is the work of a single data scientist (developer), and its codebase usually doesn’t grow. Since the code is made to analyze one experiment only, its lifespan is relatively short. Therefore, TDD doesn’t seem relevant here as its assumptions are not met.

Then, should we stop relying on trustable and reliable results for our research? To answer those questions, we, researchers, should discuss what kind of methodology can be developed to ensure the reliability of our results.

Let me give you a real-life experience from my own career as data scientist where I could proudly claim that my code was bug-free.

I worked for a laboratory that needed advanced visualizations for a complex neuroscience project. There, I was tasked with developing some data processing and analysis tools. A few days later, I had completed all the plotting and post-processing code. Considering the massive amount of work and data, how could I prove to my client that my code was reliable? How could they trust it to validate their hypothesis? The experiment had no precedents, so I could not verify the reliability of my code based on previous experiments and results. I could not trust the data either since anything could have gone wrong during the recording. Thus, I had to create mock data that validated the collaborator hypothesis, effectively turning the hypothesis into software requirements.

Starting from the conclusion (well-defined requirements), I was able to write a tested first code that generated ideal data. It thus became easy to verify that my analysis code was bug-free as the plots showed the expected results.

I call this methodology Test-Driven Research (TDR)

To develop the aforementioned tool, I also had to dig into the literature to find the estimated type of the distribution (Gaussian distribution, Poisson distribution, etc.) and its associated parameters. This also allowed me to estimate the number of subjects (animals/clusters/neurons) to record for achieving excellent statistical power (validate or reject the default hypotheses).

Here is a quick summary of what TDR enabled me:

  • To properly define the hypotheses well before starting an experiment
  • To define the statistically adequate number of subjects for an experiment (statistical power)
  • To acquire all the information regarding neuronal patterns (field-specific knowledge)
  • To write to develop a highly tested, reproducible, and reusable ideal data generator
  • To create more reliable data processing pipelines, plots, and figures


We have seen how test-driven development, a software engineering methodology, can be applied to research to obtain reliable results from data acquisitions and analysis. Despite differences in the structure and goals, the high reliability that TDD clearly makes it a must-have for all researchers.

While data analysis is difficult to test as we write code and investigate relevant data parallelly, I propose the adoption of test-driven research (TDR) that will mitigate bugs in the data analysis pipeline. The reliability of the data analysis is greatly improved through the injection of ideal trustable data generated artificially with a well-tested generator (with TDD).

This article demonstrates that researchers should not stop looking outside of their field to solve their own issues. It’s through the unyielding efforts of everyone that research will bring forth knowledge and continuous innovation.

Test-driven research pipeline. The artificial experiment is a data generator that can be used to test the code of the data analysis pipeline.

What are your solutions to ensure the reliability of your results in research? Have you encountered a similar experience? Please share your thoughts about this new methodology!

Recommended reads

Clean Code by Robert C. Martin. The book covers the foundation of good practices for modern software engineering for all languages — a must-have for everyone writing code.

Code that fits in your head by Mark Seemann. The book offers tips and tricks on how to optimize your way of coding. It demands more knowledge in software engineering than Clean Code by Robert C. Martin does.



Julien Rechenmann
Nerd For Tech

Data science consultant for research laboratories and startups.