Immunological dark matter

Karl Friston
7 min readJun 3, 2020

--

by Karl Friston (Professor of Neuroscience, University College London)
and
Deenan Pillay (Professor of Virology, University College London)

Immunological dark matter headlined a recent question and answer piece that incited a series of challenging and important exchanges on social media. Foregrounding ‘dark matter’ was probably an astute editorial move — but calls for some careful scientific qualification.

On dark matter

‘Dark matter’ was used to convey the notion that — in epidemiological models — there exist certain causes of epidemiological and sociodemographic data that may not be easily observed, and may therefore need to be inferred. In short, there are latent (a.k.a., hidden) causes that cannot be seen that are necessary to explain what can be seen. The particular dark matter referred to above comprises a subset of the population that participate in the epidemic in a way that renders them less susceptible to infection — or less likely to transmit the virus. Entertaining this kind of dark matter represents a departure from basic infectious disease epidemiological approaches that assume 100% population susceptibility. Technically, the evidence for this dark matter is overwhelming; in the sense that the evidence (a.k.a., marginal likelihood) of models with this subpopulation is much greater than the evidence of equivalent models without it. This raises some key questions:

What is the nature of this subpopulation? This question is important because including a non-susceptible proportion in epidemiological models determines the (asymmetrical) course of the outbreak, particularly its tail. In turn, this becomes important in terms of ‘unlocking’ policies and the potential for rebounds. Furthermore, it interacts with the putative mechanisms for a second wave that, under the models in question, inherit from a loss of population immunity. Finally, the prevalence of a non-susceptible population has quantitative implications for the efficacy and selectivity of testing and tracking.

In the dynamic causal modelling of the coronavirus outbreak, the foundational models considered a susceptible population whose size was estimated from the data. The susceptible population was defined stipulatively in terms of those individuals who would eventually develop (an enduring) immunity — an assumption which can now be challenged. This report was prepared for the RAMP initiative, to showcase the potential of variational Bayes in epidemiological modelling. It concluded with a narrative for what might happen in London. This narrative was originally used to illustrate the kinds of predictions dynamic causal modelling could make. These predictions were subsequently realised, lending the model a predictive validity that was not anticipated. Of particular interest here was the estimate of the susceptible population of London of 2.5 million, 80% of which would be immune by 8 May:

“Improvements should be seen by May 8, shortly after the May bank holiday, when social distancing will be relaxed. At this time herd immunity should have risen to about 80%, about 12% of London’s population will have been tested”

This corresponds to a seroprevalence of about 19% of London’s (9.5 million) population, which is close to the estimates of 17% for London at this time. On 10 May lockdown was relaxed following an announcement by the Prime Minister. If one assumes that the total population of London is sufficiently well mixed to license the usual epidemiological assumptions, this means that a substantial proportion of London’s population were not susceptible. Subsequent models considered the total population as a mixture of susceptible and non-susceptible individuals, for example when considering population flux between regions. At this point, the non-susceptible (dark matter) component became necessary to explain the timeseries of new cases and deaths. In terms of model structure, this meant estimating the proportion of the total population that was susceptible, instead of the size of the susceptible population.

There are several clear candidates for the non-susceptible, dark matter proportion, many of them central constructs in epidemiology. A short list could include:

  • A subpopulation that is sequestered from infection in virtue of being shielded or geographically isolated from infected cases. Both of these subpopulations may be time-dependent. For example, changing advice to people who are self-isolating. This means that even individuals embedded in a susceptible population will change their transmission characteristics over time. Those people who have yet to experience a local outbreak may, at some point, move from a non-susceptible to a susceptible population — as a wave of infection encroaches on their region. The modelling of this sort of dark matter clearly depends upon many geospatial factors, such as population density and population fluxes among regions (e.g. commuting).
  • A second candidate for dark matter comprises those individuals that show some pre-existing immunity; possibly via cross immunity with other beta coronaviruses. More recent data support this possibility — and may be measurable. In addition, certain host factors may render them less susceptible to infection, such as differential expression of the SARS-CoV-2 receptor. This would reduce their ability to transmit the virus, either by being resistant to infection, or through reduced upper respiratory tract viral replication.
  • This is distinct from the notion of super-spreaders that speak to a heterogeneity in the transmission dynamics — characterised in terms of overdispersion. This aspect of epidemiological spread (underwriting clusters of outbreaks and institutional amplification) has important implications for self-limiting mechanisms. For example, a subpopulation of super-spreaders would imply a more curtailed expression of the epidemic, with successive clustering events that ‘fizzle out’. Another perspective is that the subpopulation that matters is a subpopulation of people with a propensity to transmit the virus, while the rest of the population could be construed as at least grey — if not dark — matter. The notion of a ‘super-spreader’ can be misconstrued as a purely individual characteristic. However, it may reflect the environment an infected individual inhabits. For instance, of two individuals with the same viral load (infectivity), one may be a super-spreader by way of being a key worker in contact with many others, compared to the second individual who may be able to work at home.

These hypotheses are important from the point of view of forward or generative models of data; especially when trying to reconcile the relatively low number of infected and seropositive cases, in relation to fatality rates. In principle, all of these hypotheses can be installed into generative models (for example, by adapting code available here) and assessed in terms of their respective model evidence. However, this rests upon a key modelling issue:

On modelling

Discussions about whether to include states like the so-called dark matter necessarily call upon epidemiological modelling. The key contribution — of the variational procedures that underwrite dynamic causal modelling — is that they enable model comparison. In brief, it is not sufficient just to use Bayesian procedures to invert epidemiological models, this inversion has to furnish an estimate of model evidence so that one model can be compared with another. Bayesian techniques — that prevail in the epidemiological modelling literature — are almost universally based on sampling procedures. This has three important consequences:

First, sampling procedures are slow and computer intensive (requiring hours on high-performance computers). Variational procedures exploit mean field approximations to provide posterior densities and model evidence within a minute or so on a laptop.

Second, the complexity and expressivity of the models is necessarily constrained by the computational complexity of their inversion or fitting. Expressivity is a notion from machine learning. A key example here is the use of vanilla SEIR models. In most instances, these are not expressive enough to generate the kind of data at hand. A simple example of this is data on new cases. To generate these data, one would need to know latent causes, such as the selectivity of testing for people who are and are not infected — and the underlying prevalence of infection. In turn, these latent causes depend upon social distancing and so on. In short, the call here is for models of sufficient expressivity to generate available data.

The third problem with sampling-based approaches is that it is notoriously difficult to evaluate the model evidence from sample distributions. This usually means that people resort to large number approximations like the Bayesian Information Criterion. Conversely, variational Bayes offers a more veridical handle on model evidence (known as variational free energy or an evidence bound). This enables one to score different models and hypotheses and find the right level of ‘expressivity’ for the available data. This is (and was) the primary motivation for considering dynamic causal modelling and related approaches based upon density dynamics.

Bayesian model comparison is at the heart of dynamic causal modelling. It brings certain perspectives that can sometimes seem counterintuitive. First, there is no notion of falsification. In other words, claims like “this model can be falsified because of this and that” have no meaning. All models are evaluated in relation to each other, in terms of their relative evidence. Crucially, models change as they assimilate new data (via a process of Bayesian belief updating). In short, having the ability to compare different models means that one has the ability to explore a model space, as prior beliefs become more precise and more data becomes available.

The best model has the highest evidence and provides an accurate account of the data in the simplest way possible. The implicit penalisation of complexity is important because it precludes overfitting and underwrites generalisation — and predictive validity. When using Bayesian model comparison to identify the most plausible model, one is committing to a formal explanation of the data at hand. For example, the predictions about social distancing being relaxed on 12 June refer to social distancing as parameterised in the model (i.e., the probability of leaving home).

On playing the Feynman card

The variational procedures above rest upon the work of Richard Feynman on path integrals and people like Solomonoff on algorithmic complexity. The Feynman legacy is particularly interesting: the variational free energy — or evidence bound — was first introduced in the path integral formulation of quantum electrodynamics. This is interesting because Feynman was contending with exactly the same problem that epidemiological modellers confront today: he was interested in evaluating the probability density over the trajectory taken by an electron from an initial state to a final state. The implicit integration problem was clearly intractable. He therefore introduced a variational bound on the marginal likelihood. In effect, this converted an insoluble integration or sampling problem into a tractable optimisation problem. This is analogous to the problem we are currently facing; namely, evaluating the probability density over epidemiological trajectories from the current point in time to some endemic equilibrium.

--

--