COVID-19 — Hospitals Can Make Better Decisions With Data

Andy Mandrell
The Startup
Published in
4 min readApr 17, 2020

A common task in the outbreak of an infectious disease is making accurate and reliable decisions as quickly as possible.

source: Nathalie Lees, via The Economist

Complementary to the diversified and comprehensive models being built to understand the development of COVID-19, I will apply and analyze a time-series model to the population growth of COVID-19 hospitalizations. A common task with time-series modeling is smoothing; a common technique used to remove noise in data. In this context, the goal is to understand if there is a smooth, underlying growth process of COVID-19 hospitalizations given noisy count data.

First, we need to get the data. We will use count data from the early stage of the outbreak — January 1st, 2020 to February 28th, 2020. Why? We want to approximate the growth of hospitalizations at the beginning of the outbreak — we will see why this is important later.

To avoid a sleepy read of a mathematical proof, I will briefly explain the method used to model the count data. My goal is to use observed data to model 𝑋(𝑡), the number of hospitalizations due to COVID-19 on day 𝑡. Ideally, we would like to predict the value of 𝑋 for unobserved 𝑡 in the future.

In a generalized linear model, we can model 𝑋 with any distribution belonging to the exponential family whose mean (we will call latent variable Z) is an invertible function of a linear function 𝑇 (a random variable denoting the possible values of 𝑡). In a GLM, once we’ve specified the output distribution and link function, the goal is to find the maximum likelihood estimate for the parameters β. Once we have an estimate of the parameters β, we can use this estimate to predict the values of 𝑋 for different inputs 𝑇.

Flash forward the proof and we find the following formulas:

So far, this is promising. Intuitively, we would expect 𝑋 to have a mean 𝑍 on day 𝑡, where we assume that exponential growth for 𝑍 is reasonable (according to epidemiology). Formulas 2–4 are the derived link function, parameters, and output distribution for this generalized linear model. In this scenario, a natural choice for the output distribution is Poisson.

Now, let’s fit the model and also include 95% confidence intervals on 𝑋 and 𝑍.

Model with Output Distribution as Poisson

Something is unusual about this model — the confidence intervals for the modeled growth rate are exceptionally over-confident, and do not accurately capture the variability of the observed count data. This phenomena arises largely due to the narrow variance of the Poisson distribution, and is called over-dispersion. To fix this issue, we can replace the output distribution (currently Poisson) with a new one that accounts for over-dispersion. The Negative Binomial distribution also models count data but allows us to model a wider spread/variance relative to a Poisson distribution with the same mean.

Model with Output Distribution as Negative Binomial

Even after using the Negative Binomial distribution to improve our model, we need to be careful about how we interpret our results. Can we actually use this model to understand whether the growth rate of COVID-19 hospitalizations will be exponential in the future? Some factors to be mindful about in interpreting our results include:

  1. Subpopulations with faster growth may quickly hit saturation in that subpopulation, and then the remaining growth in the overall number of hospitalizations would be due to the slower growth of the other subpopulation(s).
  2. The number of confirmed COVID-19 cases is confounded by underreporting (most infected people do not get tested) and rapid increase in testing capabilities (the total number of people tested increases).
  3. An exponential growth model implicitly assumes that there is unlimited hospital capacity, since it does not account for saturation in the number of hospitalizations. It also does not incorporate death in the population.
  4. COVID-19 data (confirmed cases, hospitalizations, etc.) come on a time lag (which itself is likely stochastic).

It is important to highlight that both of these models, despite their limitations, are still powerful in generating insight to the public. Although exponential growth does not usually model an infectious disease outbreak as time increases, it has reliably shown to model the initial stages of an outbreak. If we can approximate and bound the underlying growth of hospitalizations at the initial stage of an outbreak, then we can proactively prepare equipment and estimate demands for hospitals. It is evident that if hospital needs can be fulfilled prior to an explosion in hospital visits, lives can be saved.

--

--

Andy Mandrell
The Startup

Data Engineer at Capital One, Data Science @ UC Berkeley