--

Making the O-zone plots

I publish O-zone plots and estimations of infection density for states. How are these calculated, and why? It became obvious early on that using confirmed cases was problematic. Suppose state one tests 100 people and has 25 positives, and state two tests 1000 people and has 25 positives. Which one has more infections? The intuitively obvious answer is the one that only tested 100 people. But how can we estimate the true number of infections from the number of tests, and the number of positives. I was already playing with this problem when I saw epidemiologist Marc Lipsitch retweet a blog post by Peter Ellis. Peter is a statistician and data science guru from New Zealand, sort of a Nate Silver from that part of the world, but more schooled in the theory of stats. Here is a link to that blog post.

Peter shows published data suggesting that the actual number of infections is proportional to the number of confirmed cases multiplied by the square root of the proportion of tests that were positive. This test positivity term can range from 0 to 1. I looked at US declines in infections and deaths using R (more on that later), which is a ratio of infections at two different times. It seemed to confirm his supposition, and also the proposal that deaths lag confirmed cases by 27 days.

That’s great, but we would really like to know the real number of infections. So, I can use test positivity to adjust a state’s confirmed cases to a new number of total infections. Many educated guesses, informed by people like Trevor Bedford and others, suggested the actual number of infections was 10–20 times higher than the number of confirmed cases. I figured, I could make a guess, and make a plot. So I did, and here is an example plot using Indiana of how confirmed cases becomes infection density.

That bottom plots shows total infections. I can add those numbers up, and come up with the proportion of people infected in the state total, and match that to serum antibody data. In the last week in April, 2.8% of Hoosiers were antibody positive. Add up the numbers in the above plot, and they match, because I used those numbers to find the constant to scale infections to achieve total infections from confirmed cases and test positivity. The formula

Total infection = 19.6 * Confirmed Cases * sqrt(test positivity)

defines my estimation process. Using this, you can infer the US infection fatality risk is 0.6%, and you can overlay deaths and infections in the USA. However, along the way I found that deaths and cases changed in their lag. In March when cases were shooting up, the lag from reported confirmed cases to deaths was about 7–8 days. Now, it is 27 days as seen in my R example above. Why? Well, very early on, you could only be tested if the hospital was about to admit you, and you could only be admitted if you needed oxygen. Today, you get tested at first symptoms, and there should be a 20 day difference between those two approaches. Anyway, here is that overlay of deaths and infection density in the US, but keep in mind the left edge of the plot of infection density should be stretched earlier another 20 days.

These data suggest that deaths are about to turn up Now, onto R and infection density, the details.

To calculate infection density, I take the sum of the 10 most recent days, and the number of tests in the 10 most recent days, and use the confirmed cases and test positivity to calculate the infections using my formula from above. Then, I normalize by the population in thousands.

To calculate R, I take the 10 most recent days of confirmed cases, and Hamming window them. This is a smoothing function that makes R track better from day to day. Then, I multiply them by the square root of the test positivity over the same 10 day period. I then take the ratio between this 10 day sum and the same sum 14 days earlier. Cases follows a 7 day periodicity, and I use two periods, or 14 days, to keep things tracking more smoothly. So far so good, but R is an epidemiological term, and is evaluated on the reproductive cycle of the virus. The mean cycle for COVID-19 is 5.2 days. So, I pro-rate this ratio to a 5.2 day ratio by raising it to the power (5.2/14).

If you do this for all 50 states, you can plot the position of each state based on its most recent R and infection density. States can be fairly compared to each other using these methods. Then, I added a goal, suggested by my friend Dave O. I shaded in the region in which infection density is 10 times lower than that which begins to cause hospital stress, and in which R is under 0.9. The safety O-zone. The other zone I added recently, its border is defined when the current R, if maintained for two weeks, would result in an infection density of 45, which is roughly the peak seen in NJ. I give my analysis code to anyone who wants to see it, and followers have suggested color coding the states by geography, and adding a three day shadow from the data three days ago, and by drawing a line in between, and by changing the shading to red and green for trouble and safety.

These definitions are imprecise. States rarely get into trouble before the infection density reaches 15, but parts of states can. We’ve seen Montgomery, Alabama in trouble when the state was not, and southern Florida and Tampa in trouble now, and San Antonio and Houston in trouble now (July 3). But we are also seeing South Carolina not in as much trouble as I thought with an infection density of 20. Elective procedures are only canceled in Charleston, and in other regions the hospitals added surge capacity early and kept performing elective procedures. Here is the current O-zone plot.

I started this to be able to fairly compare one state with another, and to let people be able to see how much trouble their state was or was not in. It defines metrics that are usable by hospitals. Several large banks and hedge funds have asked for my code to help guide their fundamentals. And some real good Samaritans have used the code to conduct county analysis in their states to identify problem areas. It was my hope only that this work would be helpful. My area of research is brain physiology, I use every bit as much of the statistical tools there as I use here, but for very different reasons. My Biomedical Engineering PhD involved a lot of statistics and signal processing, and I’ve been an avid coder since I was in the 7th grade. Hope you find it helpful, if you want code, email dblake AT augusta DOT edu.