Code, Models and Covid-19

Why Peer Review and Open Source matter

Henry Story
16 min readMay 13, 2020

This article is a follow up on an earlier blog post Co-Immunology and the Web, which looked at the many dimensions of immunological thinking.
Here I look critically at how those concepts can be applied to code, by looking at the controversy about the Imperial College Model, then looking at the reactions to it, and finally showing some glimpsed of how Category Theory can help.

The Imperial College Model and Peer Review

The Imperial College Report by Prof. Ferguson published on 16 March 2020 was widely attributed to a change of strategy by the UK government from one of Herd Immunity to one of lockdown (but of course things are more complicated). The model had been used over the past 20 years to inform policy, yet amazingly enough, this was the first time that a request to review the code behind the model was made. This code was released reluctantly to various companies such as Microsoft, who then publish an improved version on GitHub. The first review of this code was very damning. I will here try to put that critique into context.

Of course, it is not up to me, a software engineer with 30 years experience, to elaborate a critique of the model that the code was attempting to implement, as I am not a specialist of epidemiological modelling. But with the considerable advantage of hindsight, we can question whether the model or interpretation of its results may have underestimated the feasibility of more flexible measures to combat the epidemic.

We can see this from the above diagram of confirmed Covid-19 cases, which shows that the diseases peaked in a wide selection of countries affected by the virus around the end of March. Was this perhaps due to the implementation of the lockdown policies? It would be natural to come to such a conclusion, were it not for the fact that Sweden implemented a strategy of light social distancing and herd immunity as explained by Prof Johan Giesecke in a lengthy interview on the UnHerd Lockdown TV. Yet Sweden seems to have a lower number of infected cases compared to the UK. Now Sweden is a different country, with a different culture, social system, weather and population distribution.

It is also true that countries such as France, Germany, Switzerland, Austria and Norway who introduced lockdown policies earlier have done a lot better. But what will happen when their lockdowns are relaxed? Will the virus then re-emerge and force those countries to go back into lockdown as suggested by Prof Giesecke, leading to longer-term damage to the economy for no real gain? We will have to wait and see. (see the discussion The Cause of Death)

Diagrams from Imperial College Report 13

It turns out that at the end of March the Imperial College team did make predictions for 11 European countries of which Sweden in their Report 13. There they considered how different interventions could help change the growth of the virus and the mortality rates. In the following diagram published there page 8, we can compare how the model evaluated the UK and Sweden’s situation. The UK had just imposed a lockdown on 23 March, and so the model predicts the growth rate R of Covid-19 dropping quickly towards 1 and below. Sweden, by contrast, should have remained on track with a growth rate above 2 according to the model, that is the number of infections and deaths consistently and quickly growing. But as we saw, there was no difference between both countries. In fact, and quite remarkably, all countries seemed to have stabilised precisely at the same time around the end of March as the article was published.

Now that the code is published it should be possible to find out how the model would have predicted Sweden’s situation two weeks down the line. And it looks indeed like Clive Best did do this as shown in the tweet to the left.
These types of reflections are leading to the emergence of a position as expressed in a Foreign Affairs article, that Sweden’s response to the epidemic may be the more sustainable way to go.

Other doubts have been voiced in the article “Six Questions that Neil Ferguson should be asked” regarding the history of his model over-estimating the danger of various epidemics. It is then all the more surprising that it took this epidemic — 20 years after the first predictions with influence on policy — for the code behind the model to be requested. How could a serious peer-review take place without access to the code?

Peer review is the key to testing assumptions behind an argument. It is especially important when dealing with models designed to predict a one-time event since, in that case, it is impossible to re-run the experiment. Did the model lead to a change of behaviour that resulted in the fatal incident being avoided? Or did the model exaggerate the dangers, leading to unnecessary steps being taken?
The testing through questioning that is the essence of peer review is a prerequisite for intellectual hygiene. It is the way to separate good ideas from bad ones and is at the foundation of academia, originating with Socrates 2389 years ago in the dialogue Theatetus, a dialogue on the nature of knowledge. The dialogical nature of reasoning can be traced all the way through the middle ages in the games of Obligationes, to current work on logic such as Meaning in Dialogue or in terms of constructive logic as done in Immanent Reasoning or Equality in Action. The last is particularly interesting as it elaborates a dialogical view of constructive logic, which is the logic behind mathematically-based programming languages that emerged from work in academia such as Haskell, Idris, Agda, Rust and Scala (notably Twitter is programed in Scala). The key insight is that one can see such languages based on constructive logic as a dialogue with their mathematical dual: falsificationist logic. This is the logic on which Popper based his scientific methodology, as explained in Dual Intuitionistic Logic and a Variety of Negations: The Logic of Scientific Research.

So the first remarkable thing about Neil Ferguson’s code is that it came with no tests. This is particularly problematic in a language like C or C++ in which it was written, as these come with nearly no built-in security features: the programmer has to know a lot about the architecture of the machine on which the code will be run to understand how it will be interpreted. It is easy in C to make basic programming mistakes where one part of the running program can overwrite memory used by another part of the program leading to completely erroneous results. To avoid such errors requires extremely conscientious programming. But such conscientious programming would include tests!

Why would anyone program in C? This is easily explained historically. Dennis Richie created C at Bell Labs in the early 1970ies as an abstraction above the even less friendly machine code to write the Unix Operating system (which you will find running Google, Android and macOS). For those tasks, knowing how the hardware works is a prerequisite and so it makes sense to give the programmer full access to the machine. In the intervening 50 years, things have evolved a lot. If one were to write an Operating System at present, one might first look to a programming language like Rust. Rust has integrated 30 years of research, especially the concept of linear types to allow programmers to write low-level code without the need for a garbage collector.
In the 1980ies and 1990ies, C++ was also used for application-level code. The reason being that consumer-grade computers were severely limited in power. In the early 1990ies, 20Mhz chips were fast. That did not give much CPU space either for clever compilers or for garbage collectors. At the time Java came out in 1995, things were starting to change. But it took another four years before the Just in Time Java compiler was good enough to run code at the speed of C.

We can, therefore, quite clearly date the early origin of Prof. Ferguson’s code to the 1990ies a time in which it would have made sense to write it that way.

But an epidemiological model should not need to be tied that closely to hardware. What is important is for the code to be able to clearly encode the ideas of the model, test those ideas, and improve them in quick iterations, to allow the model to evolve. Writing code in low-level languages such as C, goes entirely against such requirements, as it forces one to find programmers who know both about low-level machine details and high-level modelling issues. Such programmers will be rare and thus very expensive, and will probably want to do more exciting things.

Memory management in C is complicated due to the lack of protections and pointer arithmetic. This is compounded when we move to multi-threaded code, i.e. code that runs in parallel. This has become vital as the rate of increase in the speed of a single CPU has slowed over the past 15 years. Instead, we have seen machines grow the number of cores and threads per core. The UltraSparc T2 released by Sun Microsystems in 2007 came with eight cores and eight threads per core for a total of 64 threads per CPU running each at 1.6Ghz, only half of the speed at which threads run in this year’s laptops. In 2018 Oracle release the successor Sparc M8 chip with 32 cores for a total of 256 threads running at 5Ghz on one processor. Of course, systems can then be built with a large number of such processors, increasing even further the need for parallelisation. So an 8 CPU Sparc M8–8 server is able to handle 2048 threads.
This growth in parallelisation transformed the way developers had to program. In the 1990ies stateful Object-Oriented programming was in. Java was one of the first languages that came with threads and object locking features out of the box. The locking was important to avoid two threads overwriting data on the same object, which when it happens, can create so-called Heisenbugs that are often very difficult to spot and to reproduce. But these locking mechanisms turned out to be very difficult to reason about correctly, even in a language with garbage collection such as Java. The programmer had to navigate carefully between the Scylla of Heisbenbugs, and the Charybdis of deadlocked code. As a result, functional programming with stateless immutable data structures started growing in popularity, with languages such as Scala offering a way to move between both paradigms. This led to a huge growth in interest in Category Theory with books aimed directly at programmers such as Bartosz Milewski’s Category Theory for Programmers. It is thus not surprising that various bugs were soon found in the multi-threaded code of the Imperial College Model code.

We are thus presented with an interesting case of a program written in a language offering little protection, with no tests, that furthermore came in a single file of 15 thousand lines of code, indicating that basic principles of code modularity had not been considered. It turns out that parts of the code may have been translated from earlier code in Fortran, which comments in the code indicate was little understood. This lack of code modularity and transparency would have made it additionally difficult to alter or adapt the model to new concepts. Any attempt would have raised the spectre of new and unknown bugs entering the code base, that would not have been detectable as no tests were available.

All of this indicates an emphasis on efficiency over security, which could only lead over time to the project being unable to scale. To scale one has to build on tools that are as far as possibly mathematically proven so that one can move up an abstraction hierarchy without fear that one’s foundations are going to collapse. This is not uncommon, it seems, as indicated by this 2014 position paper “A Computational Science Agenda for Programming Language Research” that argues that new higher-level abstractions need to be developed that make it easier to develop clear and verifiable scientific models.

The above should help put in context and explain the criticism of the Code Review of Ferguson’s Model. Sadly that review ends with this personal suggestion that shows that the author does not understand the philosophical reasons for why testing and peer review are important, nor what the origin of the problem was. “Sue Denim” writes:

On a personal level, I’d go further and suggest that all academic epidemiology be defunded. This sort of work is best done by the insurance sector. Insurers employ modellers and data scientists, but also employ managers whose job is to decide whether a model is accurate enough for real world usage and professional software engineers to ensure model software is properly tested, understandable and so on. Academic efforts don’t have these people, and the results speak for themselves.

But this ignores major engineering projects developed at Universities and Polytechs, such as BSD Unix (Berkley), X-Windows (MIT), the Mach kernel (Carnegie Mellon), Haskell GHC (Glasgow), Scala (EPFL), the Coq proof assistant (INRIA), Agda (Chalmers), and many more… Many of these emerged from work to help bring mathematical certainty to programming or indeed dually to help automate and verify mathematical proofs. Universities are therefore quite capable of producing code of world-class quality. More should be done to help others join. This may require changing how academics are evaluated, allowing those that produce peer-reviewed code that is used or re-used count as a citation.

Instead of giving all the work to private enterprises, research institutions should instead work on open source frameworks written in languages that would enable the code to be developed at a much lower cost, be much more secure, more flexible, scalable and that would make it easy to test new modelling ideas, informed by as much quality Open Data as they can get hold of. This would allow such frameworks to be peer-reviewed, compared, improved and evolve for the better over time. A famous law named after Linus Torvalds, the originator of Linux, the Unix based Operating System that runs Google and Android, goes: “given enough eyeballs, all bugs are shallow”. At present eyeballs may not be enough. We also need provability, and so logic and mathematics.

Relying only on private companies who may not understand the value of their code being open, reduces the critical ability of the state to reach decisions in a health crisis, with severe costs to society and private enterprises. Governments need to be well informed by independent researchers as well as by companies. And companies need universities to develop the talent for them to improve their models. If the lockdown sceptics turn out to be right, not having such an open platform could have cost the world a tremendous amount of money for no real gain. If they are wrong, then such an open platform would have stemmed such criticism and allowed the right decisions to have been made earlier.

In conclusion, as argued in an earlier article Co-Immunology and the Web, the Covid-19 pandemic has shown us how our body’s immune system, that has evolved over millions of years — and evolves over our own lifetime, learning from past experiences fighting virii — can be complemented by personal hygienic practices (washing one’s hands), the social or co-immunitary ones (wearing masks, distancing, protecting the old and susceptible), and to the geopolitical immunities (trade and travel restrictions). We have seen in this essay, that immunological thinking also applies to the programming world, where it is usual in the industry to speak of code smells, of rotten or dead code, as well as of healthy ecosystem, … We have seen how peer review — a form of co-immunology — requires Open Source to allow the required critique to take place, a requirement if the code is to evolve in a healthy manner. Most surprisingly of all, we have seen in this pandemic that the models on which national-level decisions were made, had not undergone the co-immunitary checks. Still, the first step of inspection has now been taken, which if followed up on and enlarged should help make better-informed decisions in a future crisis.

Reactions in news and blogosphere

The story has now (17 May 2020) reached the wider press. Here are some I have been pointed to.

The first one in the Telegraph is balanced, looking at different models that were available, with a quote by Sir Nigel Shadbolt, head of the Open Data Institute, who emphasized the importance of different models.

This next one emphasises the problem of having a source file of 15000 lines of code known as one of not respecting “separation on concerns”. One of the authors is Dr Konstantin ‘Cos’ Boudnik, presented as vice-president of architecture at WANdisco, author of 17 US patents in distributed computing and a veteran developer of the Apache Hadoop framework. Note that Apache is the Open Source foundation that produces the server that runs most web sites in the World.
If the software is so problematic, this points to an even bigger problem: namely that there was no requirement that it be part of the peer-review, which is a simple requirement that should be made for all academic work.
Of course, other countries came to impose similar lockdowns even before Ferguson’s model. If all those countries were wrong, then the question as to why so many countries went so wrong would be the first question in need of investigation.

The next article comes to the defence of Neil Ferguson. Phil Bull, lecturer in Cosmology and author of “Ubuntu made easy” argues that not every software developer can review every piece of software. Understanding Java Graphical User Interface programming does not give one the background to criticise Linux Kernel code. I agree. Academic software modelling has different criteria for success, and different types of testing, such as peer review, Phil argues. Still peer review without the option to re-run the simulation experiment cannot be a full peer-review. The code here is a very important aspect of the experiment as it is a simulation. (Another important part not discussed here is that the data also needs to be inspectable).
Furthermore open-sourcing the code would have immediately led to improvements in it, as those are actually quite obvious ones. When Linus Torvalds started writing the Linux kernel in 1991, it was on his own admission a sketch of an OS. Putting it online allowed it to grow through community feedback in the form of patches that came in from all over the world. This improvement led to it being adopted by the Stanford students who then went on to start Google, where it runs much improved to this day, a cornerstone of their whole infrastructure. It is now running all Android Phones, a huge percentage of servers, and has been ported to run on every conceivable chip.

The idea though that scientific code reviews by commercial software developers should be ignored is strongly rebutted by Chris von Csefalvay an epidemiologist with a specialisation in the virology of bat-borne illnesses. He is actually in a position to turn Phil Bull’s argument against him, since Phil is in the field of Cosmology, possibly as distant to epidemiology as someone with experience working at Google.

It may be that the major problem scientists face is that code does not contribute to their standing in terms of the citation count. But why should it not? Why are software libraries written by academics that get to be widely used not counted and used in their assessment? This, in a way counts against good software developers in academia, whose work formally speaking counts for nothing. This aspect of the situation is considered by Ben Lewis, an astrophysicist also very critical of Ferguson’s code. This is in my view the right place to look for a change in academia: the system needs to be changed to require code and data for an experiment to be part of the code review, and for citation counts to extend to software re-use. This is what the movement for acknowledging the position of Research Software Engineer is trying to fix.

In the meantime the very good news is that due to the release of the code, others have been able to run it, to verify the results and see what other results it could have given.

As mentioned, the code is only one aspect of the model. Other aspects are the assumptions and algorithms used by the model, as well as interpretation of the results. The problem revealed by the code is mainly that it suggests that changing models would have been difficult. For example, the following post argues that Herd Immunity is reached much sooner if one takes into account social networks. Just as online some people have huge social networks, but most have very small followings, so in the real world some are superspreaders and most others are light spreaders. Superspreaders start passing the virus quickly to large numbers of people, and this then soon slows down as less socially connected people fail to pass it on widely. So instead of needing 80% herd immunity before SARS-COV-2 stops growing, it may be that this is reached barrier is reached already at 16%.

A similar point is made by this post

How Category Theory can help

It turns out that there are ways to model pandemics in terms of Category Theory using Petri Nets that would allow one to put together tools to very quickly model different and complex multi-dimensional scenarios involving travel between cities and states, transmission rates, evolving theories of immunities (e.g. potential T-cell immunity), etc… in a verifiable and understandable way.

Petri-Nets are brilliantly explained in the first 10 minutes of this talk:

These concepts have been used by a Stanford team including James Fairbanks, Evan Patterson, Sophie Libkind and Andrew Baas to bring Category Theory to the scientific programming language Julia (see github repository). The following talk explains how this can be used to model pandemics in a quick iterative manner that would be responsive to evolution in our understanding of the phenomenon. This looks like the way to go.

--

--

Henry Story

is writing his PhD on http://co-operating.systems/ . A Social Web Architect, he develops in Scala ideas guided by Philosophy, and a little Category Theory.