Scientific methodology & Test-Driven Development

Etienne Pierrot
Checkout.com-techblog
12 min readMay 4, 2022

Four years ago, I started to sort my father’s books and tried to find some of them that could interest me. Not an easy task: my father was a mathematician, a strength material engineer, a science physics enthusiast, and also a pioneer of software engineering. So, most of these books were about topics out of my area of understanding or about outdated Programmation languages (I don’t think it's a good career plan to start learning Fortran now). But I found this little book with an intriguing title: “What Is This Thing Called Science?” by Alan Chambers. At least I was able to understand the title of this one. I chose to pick this one and started to read it. This concise book is an initiation to the philosophy of science and explains the experimental scientific method.

While I was reading it, I started to find a lot of similarities with the development discipline that I chose to follow five years ago: Test-Driven Development (TDD). And the more I was comparing both methods, the more the analogy was obvious. And since the scientific method is the best way of producing knowledge, can we deduce that TDD is the best way of producing software? But for understanding why this analogy is relevant, we need to understand what is a scientific theory (and what is not) and how the way to improve it is similar to TDD.

Two theories that make predictions

Let’s look at two “theories” trying to make predictions: quantum physics and astrology. I trust quantum physics more than astrology. But why do I? It’s not because I understand quantum physics (I have spent some time trying to get it, but sadly I failed), or because scientists look serious or wear a white blouse (most don’t, in fact). Why didn’t I believe in astrology? Not because I think I don’t see why the position of Venus should affect my life. Quantum physics told us things way more mind-blowing than that. I don’t know astrology well, but I’m sure that astrology is built around coherent principles between them. And people that believe and practice astrology are most likely as smart. They have plenty of reasons to do. Even maybe, if I spent time trying to understand astrology, I could find that some of these principles could eventually make sense. But my intuition about quantum physics or astrology principles is not relevant to deciding if I should trust them.

I trust quantum physics because this theory makes actual predictions, which are very accurate. The accuracy of this theory has allowed us to build incredible technology that profoundly affects my life. When people heard about Heisenberg’s uncertainty principle, some thought quantum mechanics was not very reliable. Even shady people try to use the weird concepts of quantum physics to sell you scams about quantum consciousness (run away when people try to sell you things about quantum physics and wellness). The knowledge around quantum physics is built with actual experiments, not philosophical considerations. And these experiments are not approximative. The quantum mechanics theory is the science field where the p-value is the lower (level of accuracy of scientific studies, and lower is better). On the other hand, astrology didn’t perform so well in prediction. Sometimes, that works. Sometimes not.

The method

So if we need to build predictable systems, maybe we need to look to which method uses scientists to produce such robust theories. And mostly stay away from the astrologer’s approach. Even people who believe in astrology didn’t make critical decisions based on this prediction. Most of them will not buy an expensive car because astrology predicts they will earn a lot of money. We are not precisely scientists as developers, but we should still look at how scientists work. And if we look closely, most of the successful practices raised in the last 20 years are in some way an adaptation of the scientific method. As most software pioneers had scientific backgrounds, it was natural to use almost the same methodology for building trustful software. Peer reviewing is one of the most prominent examples of that. Science uses constructive criticism to progress. Scientifics are publishing papers to a scientific editor. These papers are reviewed by other experts in the field. And once the article is published, other scientists will try to reproduce the experiment, find errors, comment, and publish other papers. Even if this system is often criticised (cf Publish or Perish), we haven’t found a better system for now. The COVID-19 episode has proved that when scientists try to bypass this reviewing process to speed up research, we finally lose a lot of precious time.

On the other hand, I don’t hear many astrologers criticise their theory, not much controversy among astrologers. If astrology were very efficient, I would not blame them. But astrology performs poorly. The last time I was amazed by a prediction, It was not by a prediction made by an astrologer, but by an octopus: Paul the Octopus, during the 2010 World cup in South Africa. This kind of counter-performance should trigger an internal controversy. But no, astrology’s principles didn’t change. I’m also surprised that all significant scientific discoveries about the universe during this last century didn’t impact astrology.

Honest controversy is the fuel for improvement. It takes time and modesty, but it’s needed for producing trusted knowledge. It’s because some software engineers found the waterfall method not efficient enough as the Agile Manifesto appears. I remember when I first heard about Extreme Programming(maybe in 2005). This stuff of Peer Programming, Peer review, TDD, Refactoring, Continous Integration was very controversial. It was kind of punk! When you didn’t even use code versioning, Continuous Integration was kind of crazy. This kind of practice was marginal. The trending topic was the code generation from UML, complicated ”Enterprise” Tools, and big upfront design with long-term planning (never really implemented). Honestly: to be a developer in 2022 is way better. Extreme Programming has questioned how we’re used to thinking by proposing counterintuitive ideas and against common sense. This other way of building software has made our life better. We can be thankful to these intelligent people who have challenged a lot of dogma: Kent Beck, Martin Fowler, Rebecca Wirfs-Brok, Ward Cunningham (and many others).

Searching failure

When an astrologer spends time finding confirmation about these successful predictions, the scientist spends his time to search errors in existing theories. And this is the main difference between science and pseudo-science. Science is about to study falsifiable statements. E.g., imagine an astrologer told you, “You will be rich one day.” This statement is not falsifiable. It’s impossible to produce evidence that will prove to you that this statement is false. First, what does it mean to be rich? Maybe, some can think that being rich is not only about money. And even if it’s about money, there is no unambiguous definition of which amount of money you should have to be considered rich (i have everyday evidence of that in political debate). And what about “one day”, with no explicit limitation in time of the prediction. This statement is not falsifiable. The astrologer takes absolutely no risk by making this statement. And when astrologers are constrained to make falsifiable statements with rigorous protocol, they cannot have salient results.

If someone told me: “Every Sunday, it’s raining.” This statement is maybe wrong, but it’s in the domain of science. I need to find a Sunday where it’s raining and show that evidence. If a scientist formulates a falsifiable hypothesis and no one can falsify it over the years, we can trust it and use it as a theory. Falsification is at the heart of the scientific method. If you want to play the game of science, you need to take risks of having wrong. Therefore, it’s essential to use a well-defined concept without ambiguous meaning. Each word chosen should have an exact meaning.

This notion of falsifiability is very interesting when applied to specification. If specifications have room for interpretation, how the client and the developer can agree about a defect is a bug or a feature. E.g., A client asks a developer to build an accounting system. The client should not require: “The balance of the account should be accurate.” Because this is open for interpretation (especially if the developer is not a domain expert), the client should propose falsifiable statements like: “For a transaction, the sum of the entries should be equal to zero” or “The balance of an account should be the sum of all entries.” For proving that the developer did a bad job, the client just needs to find one case where this statement is not right. Behavior Driven Development tried to propose a way of communication between client and developer to reduce the “gap” of understanding. The Given — When — Then term attempts to establish a way to formalise unambiguous scenarios that are falsifiable. Also, we need to take care of the meaning of words. If clients and developers didn’t share the same understanding of the words they’re using, how could we expect to have something that works as expected by the client? Eric Evans, the inventor of Domain-Driven Design (DDD), spotted that with the very insightful notion of “Ubiquitous language”: one language about a specific domain that domain experts and developers should share and communicate within.

Test-Driven Development

About falsification, the second law of Test- Driven Development: (Write no more of a unit test than sufficient to fail) ask the developer to propose a not too ambitious falsifiable statement (test) that your system didn’t support. When the non-TDD practitioner tries to confirm that he has done a correct implementation by writing his test, the TDD practitioner starts by proving the system’s deficiencies. After that, you are allowed to write the simplest solution to a problem you have established. A scientific study published in 2003 ( Test Driven Development and the Scientific Method) suggests that TDD is a translation of the scientific method for software engineers. By quoting Rick Mugride, author of the paper :

S.M. (* scientific method) main elements Theories evolve in large or small steps and must be consistent with experiment results. Experiments examining the pre- dictions of the theory must be repeatable so that the results can be verified and the theory refined.

T.D.D. The design of a system corresponds to the theories of S.M, and the tests that are written are the experiments.

Tests, like experiments, are repeatable and can be run (and rerun) to test new versions of the design.

The Test-First approach is a fantastic way to have a substantial safety net against confirmation bias. By writing tests after implementation, a developer is highly influenced by his implementation. So influenced that sometimes we can find traces of the implementation in the tests everywhere. Implementation details that leak into tests suits will be a huge problem when you discover that your implementation wasn’t the ideal one. You will need to change your implementation and also your tests. And finally, your tests will not help you when you seriously need them.

Occam’s razor

The Occam’s razor allows scientists to choose which theory to choose between both to explain something. Maybe the less parsimonious is the good one. But until you prove that the more parsimonious is wrong, you should stick to it.

The third law of TDD is just an adaption of Occam’s razor for code: “You can’t write more production code than is sufficient to pass the currently failing unit test.” That’s another great advantage of TDD. He prevents you from falling into the worst plague of Software Engineering: over-engineering. Because it’s very complicated to find a relevant test that justifies an over-complicated choice, it’s easy to find complex solutions to simple problems. Still, finding a relevant use case that explains complex design is more challenging. By following this discipline, you never end up in a situation: “Don’t remember why I add this code? I will not remove it because I should have a good reason to put it.” After years, you can end up with a toxic legacy code, and nobody wants to take the risk to remove code because nobody knows the requirement behind it, and the software works in production and makes actual revenue. If you follow the TDD mantra, each piece of code is justified by a requirement. If you remove a line of code that you think is useless, you run the test set, and a red test will give you the requirement that justifies this piece of code. Up to you to decide if this requirement is no more needed.

Experiment / Test

As you see in Rick Mugride’s paper figures, the TDD process is slightly different. We run our tests more often than scientists run their experiments. In each TDD loop, we will run our tests many times. That’s obviously because running tests is cheaper than running an experiment. It’s so cheap then we decide to industrialise that with Continuous Integration. Each time we want to modify our system, we want to guarantee that we run all experiments that we did in the past to ensure that our new system behaves like before regarding the previously defined use cases and can implement the new use case. Scientists cannot afford this kind of luxury. Instead, the scientific community organises controversy through scientific publications to reach a consensus. And yes, this process requires some time, more than just running a set of automated tests. By the way, the COVID-19 situation was an epic demonstration that mainstream media has not understood how scientists proceed to reach a consensus. Many journalists and editorialists were shocked when they discovered that the scientists didn’t agree on everything. Many even claim that science failed (and changed their mind when Pfizer announced that they have an efficient vaccine). And sometimes, in some talk shows, scientists were blamed because they weren’t agreed with them and didn’t have a definitive and straightforward answer to journalists’ questions like “Does this treatment work? Yes or no?” or “Could I go to the beach this summer?

One of the most challenging things for a new TDD practitioner is writing the test ahead of implementation. Developers often feel most comfortable jumping into the solution rather than stating clearly the problem to solve with a test. But why? That didn’t sound reasonable to start to find a solution to a problem that is not clearly defined. Because maybe, the reason why the developer is not able to write the test ahead is that their understanding of the problem is not good enough. Perhaps the developer needs to spend more time inside the “problem’s space” rather than inside the “solution’s space”. One of the obvious reasons for this is that we are passionate about tech and less about the business domain as a developer. In the “solution’s place”, we will debate which kind of DB we will use, RabbitMQ or Kafka, REST or GRPC, which design pattern to use, Anemic model or Rich model … That’s a space where we are more comfortable. We have dedicated a lot of time to improving our hard skills to be good at it. We spoke about things related to a specific industry that often we don’t know enough in the problem space. We’re even struggling to understand the meaning of the domain experts’ words. Even worse, sometimes a domain expert will use the same word as another domain expert but differently. So, as developers, we prefer to rush to this good place that is the solution’s space and leave the problem’s space as soon as possible. The consequence of that? Software that didn’t solve the initial problem. And we will need to wait to put the software in the hand of the user for understanding that. TDD is not a miracle cure, but at least TDD pushes developers to spend more time in the problem’s space by writing a failing test rather than jumping directly into the solution’s space. Strategical DDD gives you an excellent tool for helping you to explore the problem’s space. And one interesting thing I noticed with frequenting the developer community: people interested in TDD is also interested in DDD. Both communities are essentially composed of the same people.

Conclusion

First of all, I want to mention that my description of the scientific method is an oversimplification. This method can change depending on the science field, especially for human and social science. But, even in these fields, studies should conform to a strict methodology. In terms of software implementation methodology, TDD is the most advanced proposal on the table. If you ask a TDD practitioner how to build software, all of them will answer that everything starts with a failing test. Even if there are different schools inside the TDD community and not every TDD practitioner will have the same approach, the big picture is the same: Red, Green, Refactor… Red, Green, Refactor… Red, Green, Refactor…

But even these different TDD schools (London, Detroit) are very well documented. If you ask the same question to non-TDD practitioners: each one will have a specific recipe, and it will be hard to describe a generic methodology. For example: at each point of the implementation process, a non-TDD practitioner will start to write a test? Never? Only when you have done with complete implementation? Module? Class? Function? There is a non-clear definition of this methodology. And so, it’s hard to criticise this methodology. At least, TDD proposes something that can be taught and criticised. And it’s imperative to take into account these critics to improve the method or answer these critics to convince non-TDD practitioners to adopt TDD. But that will be a topic for the next article …

Other relevant blog articles

--

--

Etienne Pierrot
Checkout.com-techblog

Software Engineer interested into TDD, DDD and Functional Programming