Why I’m sceptical of Lockdown Sceptics’ COVID-19 code review

Michal Tusnio
19 min readMay 10, 2020

--

Lockdown Sceptics have called for the defunding of all epidemiology research based on a review of Neil Ferguson’s Github repository used for the model behind the famous Imperial college paper. I recommend reading through that first — and now that you’ve definitely done so the question remains: is the codebase all doom and gloom? To be honest… not really, no, far from it.

The techy bit

Let’s kick things off with a look through all the technical arguments. Going from the top we have…

The code

[The Imperial code] isn’t the code Ferguson ran to produce his famous Report 9. What’s been released on GitHub is a heavily modified derivative of it, after having been upgraded for over a month by a team from Microsoft and others. This revised codebase is split into multiple files for legibility and written in C++, whereas the original program was “a single 15,000 line file that had been worked on for a decade

Code written by researchers or academics is, in my experience, lacking. It often suffers from the issue of being readable only by the people that wrote it, has little to no documentation and automated tests are an afterthought. It’s all quite understandable — it takes a few years of exposure to a professional environment to learn the industry standards and how to write code readable by others. Software engineers usually aren’t doubling as researchers, and researchers rarely end up becoming software engineers — that is they do not end up working for companies where they learn all the trade skills. Therefore it’s no surprise that academics will be using their software skills to support their research, and the codebase ends up being developed in isolation with all the quirks, issues and problems inherent in that approach. Yes, 15k lines of code in one file is a bad practice. The last sentence in Sue Denim’s quote above comes from John Carmack, and it’s a shame he’s not quoted further when he states on Twitter:

it turned out that it fared a lot better going through the gauntlet of code analysis tools I hit it with than a lot of more modern code. There is something to be said for straightforward C code. Bugs were found and fixed, but generally in paths that weren’t enabled or hit.

This is our only non-research source that comments on the original code and if we are to trust this opinion in its entirety, then in this case Carmack is all-in-all saying “Even though the codebase looks a bit like crap, automatic code analysis tools were more optimistic about it than about a lot of modern code he’s seen in the industry”. Hell, he even downplays the significance of issues by saying they were found in the less commonly run (or even completely disabled!) parts of code.

A request for the original code was made 8 days ago but ignored, and it will probably take some kind of legal compulsion to make them release it.

I agree it would be nice to have access to the original file, no harm in some extra info. However, if the results can be replicated with the current codebase it doesn’t seem like we need that file. There’s also the question whether the research team is in any way obligated to release this source code or whether it’s their own initiative. We do not know under what circumstances this code was written, who owns the copyright and whether it had to be made public to begin with. It’s not stupid to assume the release was pure good-will of the research team.

Non deterministic output

Non-deterministic outputs. Due to bugs, the code can produce very different results given identical inputs. They routinely act as if this is unimportant.

Well that sounds bad, doesn’t it? Then the article makes a few extra claims we’ll examine below:

Even if their original code was released, it’s apparent that the same numbers as in Report 9 might not come out of it.

That is a very strong assertion that needs some backing, so let’s keep it in mind and see what they say further,

The documentation says:

The model is stochastic. Multiple runs with different seeds should be undertaken to see average behaviour.

“Stochastic” is just a scientific-sounding word for “random”.

This is an extremely bold claim and accusation solely based on the word definition of “stochastic”, without any context whatsoever. If you look at the linked Wikipedia article you can seemingly find more proof that the researchers imply their code returns random results:

but now in mathematics the terms stochastic process and random process are considered interchangeable.

I’m not a statistician nor an expert at modelling of pandemics, and I imagine neither is the author, but I’m curious as to why they rejected the idea of the Imperial research team using “stochastic” in a stochastic modelling context, rather than as synonym for random outputs. Especially since this is further backed up by an abstract of a paper linked in the Github readme, which states “Three research groups using different individual-based, stochastic simulation models have examined…”. Stochastic is not just a keyword for “we return random data”. It’s a method where you use a large amount of runs with random variables in them to generate a distribution of all possibly outcomes, and by examining those you get an idea of what is the most likely model of the pandemic (some runs will return similar results, some runs will vary differently, but a lot will fall within the same range of deaths, infection etc). You don’t run one scenario, but tens of thousands. As Wikipedia explains:

Outputs of the model are recorded, and then the process is repeated with a new set of random values. These steps are repeated until a sufficient amount of data is gathered. In the end, the distribution of the outputs shows the most probable estimates as well as a frame of expectations regarding what ranges of values the variables are more or less likely to fall in.

Which makes perfect sense considering there is a regression test in code that generates consistent results of a single run, which should not be possible if the Sue Denim’s claim of randomness is true. And that’s why the Imperial researchers advise you to change the seeds as you go, otherwise you’d get exactly the same result tens of thousands of times, meaning analysing the distribution would be completely pointless. With different seeds you can now have a result that returns an aggregation of all potential scenarios. This explanation makes more sense than just picking an article for stochastic off of Wikipedia to claim random result.

Clearly, the documentation wants us to think that, given a starting seed, the model will always produce the same results.

Well yeah, and the existing regression test uses checksums on output files to verify they results are exactly as expected. The way checksums work is that ANY change in the file would give a vastly different checksum. So it stands to reason that the output is consistent with the same seed. However, the author here mentions a ticket that raised an issue with non-determinism:

I’ll illustrate with a few bugs. In issue 116 a UK “red team” at Edinburgh University reports that they tried to use a mode that stores data tables in a more efficient format for faster loading, and discovered — to their surprise — that the resulting predictions varied by around 80,000 deaths after 80 days

The Imperial team’s response is that it doesn’t matter: they are “aware of some small non-determinisms”, but “this has historically been considered acceptable because of the general stochastic nature of the model”. Note the phrasing here: Imperial know their code has such bugs, but act as if it’s some inherent randomness of the universe, rather than a result of amateur coding. Apparently, in epidemiology, a difference of 80,000 deaths is “a small non-determinism”.

There are a lot of things happening here. So let’s tackle the first one, the small non-determinism. That is true —according to the Imperial team, when multiple CPUs are used to run the code, there might be some small variations per run. Keep in mind what we said earlier though — it’s one run out of tens of thousands, so all in all it will get averaged out and small differences per run do not matter in the grand scheme of things. The final pandemic model is not based on a single run, but on running thousands of them. That’s why in the end you’ll get pretty much the same results, the law of big numbers will kick in. So yes, an 80k difference in deaths is “a small non-determinism” as far as a single run out of thousands is concerned.

So why did the Edinburgh team get a massive difference in result? According to the ticket, it was indeed caused by a bug when reusing networks that had been generated for the simulation. The bug was subsequently fixed, and we have no information about it skewing the final results in any sizeable fashion.

For a simulation of a country, using only a single CPU core is obviously a dire problem — as far from supercomputing as you can get.

That’s just the author’s opinion, they don’t present any evidence proving that the simulation is extremely sluggish. If a single-threaded full simulation finishes in a reasonable amount of time, and we have to remember that adding multithreading only introduces complexity, then I see no reason why this is such a dire problem. The whole point of multithreading is to cut down on time-to-finish, but if that is not a glaring issue then it’s pointless to be focusing on performance fixes.

It’s clear from reading the code that in 2014 Imperial tried to make the code use multiple CPUs to speed it up, but never made it work reliably. This sort of programming is known to be difficult and usually requires senior, experienced engineers to get good results.

Fair point, yes it does! Sue Denim then goes on to explain that the bug was fixed, and further notices:

In other words, in the process of changing the model they made it non-replicable and never noticed.

Well, no, they made a single run non-replicable in some circumstances. The model is composed of more of those, and if they all still average out correctly then everything is fine.

Why didn’t they notice? Because their code is so deeply riddled with similar bugs and they struggled so much to fix them that they got into the habit of simply averaging the results of multiple runs to cover it up…

Yeah, well, combining the results of multiple runs is the point of stochastic modelling. We might as well accuse them of using math.

In issue #30, someone reports that the model produces different outputs depending on what kind of computer it’s run on (regardless of the number of CPUs). Again, the explanation is that although this new problem “will just add to the issues” … “This isn’t a problem running the model in full as it is stochastic anyway”.

This is waaaay less severe than it sounds, this is a very well known issue in CS. It boils down to this — we cannot represent decimal point numbers in binary without losing some precision or getting an approximation. This often leads to divisions of seemingly simple numbers, eg. 7.6/0.8, giving results that are ‘slightly off’. See below a run on a Python 3.6 interpreter I had readily available:

>>> 7.6/0.8

9.499999999999998

That means results might differ depending on the CPU architecture of the machine where you perform your test on. How does that impact the issue? Well, remember the checksum in regression tests? The issue opener noticed their checksum were failing due to a minimal fluctuation in variables within the code when using different architectures but identical inputs. It seems that those small fluctuations would cause small changes in the final output, and any change in the final output kills the test. The responder then notes that yes, this can happen, but since the final result is based on a large number of runs those errors average out. One run doesn’t have to be perfect, the sheer quantity of runs compensates for that.

Although the academic on those threads isn’t Neil Ferguson, he is well aware that the code is filled with bugs that create random results. In change #107 he authored he comments: “It includes fixes to InitModel to ensure deterministic runs with holidays enabled”. In change #158 he describes the change only as “A lot of small changes, some critical to determinacy”.

And this makes perfect sense — they are aware of the issues, they know the sheer quantity of runs will mitigate them, but they keep on working to fix issues. That sounds like any other software project.

Imperial are trying to have their cake and eat it. Reports of random results are dismissed with responses like “that’s not a problem, just run it a lot of times and take the average”, but at the same time, they’re fixing such bugs when they find them. They know their code can’t withstand scrutiny, so they hid it until professionals had a chance to fix it, but the damage from over a decade of amateur hobby programming is so extensive that even Microsoft were unable to make it run right.

As explained multiple times above, running it multiple times is literally the whole point of the method. If they had also been hiding their code to fix it covertly, why would they leave out in the open all that seemingly damning evidence in code commits/PRs? Worst conspiracy ever.

No tests

To set things straight from the start — I agree that the test coverage is horrendous. The one regression test ensures only that any difference in output gets checked manually for discrepancies, but other than that there’s no testing whatsoever. I imagine it’s the legacy of academic code.

Regressions like that are common when working on a complex piece of software, which is why industrial software-engineering teams write automated regression tests.

Very true! The author makes a lot of good arguments for giving proper funding and embedding software engineers together with academics. Let’s remember that researchers are not software engineers with full blown industry experience, similarly you’d struggle to find a competent developer with experience in statistics and modelling of pandemics. I’m a software engineer myself, but I’m (slowly) getting around to accepting the fact that people have other professional interests than full time IT careers - and if we don’t recognise that we are setting the bar extremely high for anyone working at universities. Private companies hire researchers side-by-side with engineers, they don’t expect them to be experts in both fields equally.

The Imperial code doesn’t seem to have working regression tests. They tried, but the extent of the random behaviour in their code left them defeated. On 4th April they said: “However, we haven’t had the time to work out a scalable and maintainable way of running the regression test in a way that allows a small amount of variation, but doesn’t let the figures drift over time.

The link points to the issue described earlier relating to CPU architecture differences when rounding decimal point numbers.

What Imperial is saying here isn’t that they were defeated and just let all hell break loose, but that their tests are VERY VOLATILE and fail easily whenever the output data changes. That is a good thing, as that means at worst apart from catching all bugs they are also getting some false positives. Which means the test is not tuned to hide issues, it’s tuned to highlight any small change in output and report it back — even if it’s minimal and doesn’t have any impact on the full simulation.

The research team also clarifies that ideally they would like to introduce some tolerance for small variations in their test, but they haven’t had the time to build one that ensures those variations do not end up hiding changes in output which could cause potential problems.

Beyond the apparently unsalvageable nature of this specific codebase, testing model predictions faces a fundamental problem, in that the authors don’t know what the “correct” answer is until long after the fact, and by then the code has changed again anyway, thus changing the set of bugs in it. So it’s unclear what regression tests really mean for models like this — even if they had some that worked.

This is an extremely good point! And it is tackled nicely by this comment. We do not know what other methods the authors used to verify their findings. Of course that doesn’t mean we cannot ask or question it, but it seems that Sue Denim decided jumping to conclusions is the best course of action. But the broader question remains — how do you test it, how do you verify it, and what do we compare the output to. None of those questions were directed at the researchers though, and we are offered no evidence they had been sent over either via email or Github issues, they’re just left to linger as if the academics were purposefully avoiding them.

Undocumented equations & continuing development

Much of the code consists of formulas for which no purpose is given. John Carmack (a legendary video-game programmer) surmised that some of the code might have been automatically translated from FORTRAN some years ago.

One of the Github issues mentions that the codebase is reliant on the RANLIB library, which comes in multiple flavours including Fortran, and that is what Carmack might’ve been referencing. However, we do get an example

For example, on line 510 of SetupModel.cpp there is a loop over all the “places” the simulation knows about. This code appears to be trying to calculate R0 for “places”. Hotels are excluded during this pass, without explanation.

Yeah, that is weird, and the researchers give you an option to open an issue to ask that very question. What seems to be missing here is context — this hasn’t been an open source project until very recently. For years it’s been a closed source research project and only recently was it adapted for public release. This means that things will be undocumented, odd and without explanation — often because the programmers knew what they meant, or they were discussed between all contributors, or they’re obvious if you’re familiar with the method. Not sure how the fact that hotels are omitted is evidence of FORTRAN translation though.

This bit of code highlights an issue Caswell Bligh has discussed in your site’s comments: R0 isn’t a real characteristic of the virus. R0 is both an input to and an output of these models, and is routinely adjusted for different environments and situations. Models that consume their own outputs as inputs is problem well known to the private sector — it can lead to rapid divergence and incorrect prediction.

There’s no evidence in this statement that the model consumes its own inputs, it just has similar inputs and outputs. I’m happy to be proven wrong, it’s only that a quick glance at the documentation provided by the authors seems to imply there’s no consuming of output, just that you specify an input R0 and get calculations of R0 in specific contexts on the output (different in households, different in what the authors call “places”). But Sue Denim at least got to link a paper by Google, even though it seems completely irrelevant.

Despite being aware of the severe problems in their code that they “haven’t had time” to fix, the Imperial team continue to add new features; for instance, the model attempts to simulate the impact of digital contact tracing apps.

Adding new features to a codebase with this many quality problems will just compound them and make them worse. If I saw this in a company I was consulting for I’d immediately advise them to halt new feature development until thorough regression testing was in place and code quality had been improved.

First things first, as we’ve found out above, the issues are not as severe as the author has made them to be so far. With that in mind, it’s not surprising that a small team of researchers in the middle of a pandemic is trying to desperately balance bug fixing with adding new, potentially life saving, features. Other than that, this sounds more like a solid argument for increased funding to let actual consultants and engineers help with those problems.

The not-so-techy bit

Apart from the article missing the mark by a wide margin on the technical side, there are also some bits that look at best odd, at worst as signs of the author’s agenda. First we get this quote:

All papers based on this code should be retracted immediately. Imperial’s modelling efforts should be reset with a new team that isn’t under Professor Ferguson, and which has a commitment to replicable results with published code from day one.

There’s no reason to retract all papers as Sue Denim hasn’t really proved anything substantial in the end. There’s no backing to their statements, and the lack of code tests in the repo does not prove the lack of replicability or testing. Having had 30 years of professional experience they surely would’ve seen projects that generate valid results, which are mainly verified via manual testing and cross-referencing output data. Not saying it’s a good practice, but there’s more to it than just “no tests = nothing works”. As mentioned before, this explanation tackles that issue well. To make things worse, if you simply read the first few sentences of the abstracts linked in the project’s readme you’ll find that this is not the only model that was used in research papers. If there’s no contradiction between them that’s also a good sign of correctness.

It also seems odd that they call for removal of professor Ferguson from the team. I am sure that even if we assume he’s a horrendous software engineer — and to be honest deciding whether someone is a good developer or not is a whole other complex debate entirely — he’s still a world-class epidemiologist. There’s no reason to doubt his credentials in that field, the most the author could call for is for prof. Ferguson to never code again. There’s no proof he was malicious or incompetent in his field, just that his C code is not up to industry standards.

On a personal level, I’d go further and suggest that all academic epidemiology be defunded. This sort of work is best done by the insurance sector.

Finally we get to the bit that convinces me there’s a wider agenda to discredit scientists and erode public trust in research. To start off, the very reason this discussion is possible is because the research code was open sourced. Papers are in open access and can be scrutinised by members of the public. The logical step for Sue Denim would be to demand that research in general is easier to access, code is open sourced from the start with contributions welcomed, and ideally supported by software industry veterans. Instead they demand that the private sector takes over, which means we will have no access to the methodology or codebase, and will be at the mercy of whatever comes out of that black box. How can you accuse the researchers of trying to “hide the original file” while also promoting the idea of defunding open research in favour of closed-source research ran by insurance companies?! Especially since it’s not like the private sector has the best track record of running complex systems themselves, especially when profit trumps safety. Moreover, the only (or at least one of the few) country that’s responded with appropriate measures is one that boasts having an epidemiologist as their vice-president, not an insurance executive. I guess Taiwan has been lucky to get the good scientists while we have the ones straight from the Evil League of Evil.

Insurers employ modellers and data scientists, but also employ managers whose job is to decide whether a model is accurate enough for real world usage and professional software engineers to ensure model software is properly tested, understandable and so on. Academic efforts don’t have these people, and the results speak for themselves

This is true! I fully agree, they do not have those people. The reason they don’t have them though is not because the academic world harbours a deep hatred of software engineers, but that good expertise costs money they don’t have. The lack of funding is the very reason the codebase is in its state, it’s mad to expect a team of a few researchers to suddenly produce high-quality software that survives every angle of scrutiny done by industry professionals. The actual solution here is to increase funding, embed software teams within academia, and constructively support positive changes to the way academics build & maintain their codebases. At the moment software-wise we got way more than we paid for, and it’s extremely unfair to argue a squeeze of funding based on the fact that a piece of code without proper money behind it is bad. If I underpay people I can’t then underpay them more because they underdeliver, it’s pure madness.

Even more importantly, companies that do cutting-edge research employ researchers and engineers to handle the actual development, they don’t just expect people to be experts in all fields.

And finally on this point, this heavy attack makes it even less appealing to any academics to release their code, as it only proves that people will misuse that data to stage attacks.

The last thing I want to mention is the “credentialism” and Sue Denim’s attempt at anonymity to fight it. In the bio the author states:

I’ve chosen to remain anonymous partly because of the intense fighting that surrounds lockdown, but there’s also a deeper reason. This situation has come about due to rampant credentialism and I’m tired of it.

Fair, criticism is criticism regardless where it comes from. So I guess it makes no sense then to start the article off with:

My background. I wrote software for 30 years. I worked at Google between 2006 and 2014, where I was a senior software engineer working on Maps, Gmail and account security. I spent the last five years at a US/UK firm where I designed the company’s database product, amongst other jobs and projects.

And when it suits the author we learn that John Carmack is

“a legendary video-game programmer”

which is meant to further boost their point. What happened to that evil credentialism? Even worse, if you are going to go anonymous to avoid using your own credentials, but then you use your experience as an argument while also ensuring that no-one apart from Toby Young can verify any of that… you just made this whole exercise a bit redundant, didn’t you?

Conclusions

Is the codebase beyond all repair? Doesn’t look like it, and kudos to the team for releasing it - especially in light of the attacks they’ve sustained. If anything all of this proves we need more funding and mingling between academia and professional software engineers, not less.

What’s more concerning is that I feel that Sue Denim’s intentions behind their analysis were malicious. Lockdown Sceptic’s code review appears sound on the surface, at first glance I was convinced myself Sue Denim is onto something. However, the more you dig into it the more it looks like a disguised attempt to undermine trust in researchers, and frankly it’s quite troubling. The author reaches conclusions completely opposite to their complaints and seems to be pushing a specific world view. All in all, I don’t like it.

Final thing to be aware of — the Lockdown Sceptics would want you to believe that the model overestimates fatalities resulting from the pandemic. Even if you do believe that prof. Ferguson messed up horribly and the results are skewed, keep in mind that there is also the possibility he underestimated the impact of COVID-19. Both conclusions are equally valid, but you don’t see the Lockdown Sceptics entertaining the one unfavourable to their interests.

Big shout out and thanks to my friend Sining Yeoh for reading through all the drafts and helping shape the final article with a good few dozen of suggestions!

--

--

Michal Tusnio

Software engineer concerned about the direction of technology. Writes about the intersection of social issues, business, politics and technology.