**Aggressive and widespread: On systematically estimating the number of tests to detect COVID-19 infections**

## Srijan Bansal, Aadi Swadipto Mondal, Vishal Garimella, Abir De*, Animesh Mukherjee, Mainack Mondal

Indian Institute of Technology, Kharagpur, India, *Indian Institute of Technology, Bombay, India

*Disclaimer: This is a scientific exposition. We understand and appreciate the growing concern en masse about the relevance and the diversity of the number of studies being performed on the novel Coronavirus. Our intention is not to further intimidate the audience. We leave it upto the free will of the audience to proceed any further from this point.*

**Testing: The vanguard in the fight against COVID 19**

It was a bright sunny February-week in India. Suddenly over a mere span of three-four days we came to know about a disease that was (is) sweeping through the globe and leaving patients gasping for breath and dead bodies in its wake. That disease did (and still does) not have a cure and the responsible virus was not encountered by humanity before this outbreak. Within a month, that disease, by then named Coronavirus disease 2019 (COVID-19), was declared a pandemic. The adjective “pandemic” officially established the impact of COVID-19 on the whole world. COVID-19 also started to spread through India soon after this declaration. From that time till now, people in the front line of human resistance throughout the world — — the medical professionals, the policymakers and the data analysts are putting in a very strong fight against COVID-19.

However, in spite of the best efforts of those brave men and women, COVID-19 infected millions of people worldwide and killed tens of thousands of people so far. The rapid rate of infection as well as the novelty of Coronavirus created an acute problem for the health authorities around the world. Consequently, the confusion and worry within the general public is paramount. Everyone has obvious and unanswered questions: “Am I infected?”, “What happens if I am infected?”, “is there a cure?” In fact medical professionals, due to lack of an established treatment protocol, emphasized aggressive and widespread testing as the vanguard against the fight for COVID-19. For example Dr. Claire Standley, Assistant Research Professor, Center for Global Health Science and Security at Georgetown University nicely summarized the importance of testing [1]:

“*…widespread testing can help identify mild or asymptomatic cases early, and prevent those individuals from unknowingly spreading the disease further. Aggressive testing and intensive case management, when conducted hand in hand, can be effective at limiting (although not always stopping completely) community transmission…*”

This same sentiment is reflected by The Organisation for Economic Co-operation and Development (OECD) who mentioned Tracking-Tracking-Tracing as a way to stop local outbreaks [2]. Thus, currently testing is the best strategy to control the pandemic.

# But… How many tests should there be?

However even this avenue of defending against COVID-19 is not without practical problems. Widespread testing for Coronavirus is expensive and requires healthcare infrastructure (unlike say testing for flu which can be done at home). In fact in countries like India, due to the scale, it’s nigh impossible to match the proportion of tests per million population performed in countries like Singapore, South Korea or Germany. So policymakers had to think through the testing strategy: “*Who should be tested?”, “How to know the required amount of testing in a country?”, “Should we do random testing to know the extent of infection in our society or should we confine ourselves to test only the people who developed severe symptoms?”. *These are all pertinent policy questions which need to be answered before they can create strategies for widespread testing.

Unfortunately all countries today are running a tight ship — they are hard-pressed for time and resources. So most of the countries are currently following either a *more-tests-are-better *approach or a *we-will-test-you-only-when-you-show-symptoms *approach [3] (or both or none depending on the available resources and severity of disease spread). However, in either of these approaches a key question needs to be answered: What should be the number of tests? In fact, in the past few months some countries like India attracted significant flak from the international community and media about why their number of tests are not at par with the other countries [8]. Unfortunately, estimating the required number of tests in a country is not a simple question.

To demonstrate the complexity of the problem, consider the following: Is the same proportion of tests necessary in India as it is in Italy? Should Germany follow the footsteps of Singapore in terms of number of tests required? Should we concentrate on the number of infected patients or the total volume of population in a country while settling on the number of tests? Intuitively, from the point of view of defending against COVID-19 each country is unique. The interactions between people of different countries follow different cultural norms, the virus strains can be different, even the immunity level of the populations are different. Moreover, in the case of India, a country with 1.3 billion people, scaling the number of tests proportionately with the number of people is out of question. However, scaling the tests with the number of detected infections might create a vicious cycle: #tests will be low since #detected infections are low and vice versa. To that end, we acknowledge the requirement of testing and put forward a question which is both important and unanswered:

How to systematically estimate the minimum number of tests required in a given location?

Please note that our exploration is not about the type of testing (random/targeted) or the testing technology to be used. Those questions are already tackled by medical professionals and policy analysts. Rather our question is complementary to those efforts — given a location and its history of *reported infections* is there a need for an additional number of tests? If yes, how many tests at a minimum should be carried out?

**Idea: Modeling pandemic spread to the rescue**

T**he current conundrum:** The spread of the COVID-19 and developing statistical models to predict the disease spread are very highly correlated. To demonstrate this phenomena we did a preliminary check on https://arxiv.org — We checked the number of papers written on COVID-19 and the number of papers within that which are on modeling COVID-19. As of 24th May 2020, the search query “*covid-19 site:arxiv.org*” yields 5,490 results on Google and the search query “*covid-19 model site:arxiv.org*” yields 3,330 results on Google. In other words 3,330 out of total 5,490 papers, i.e., 60.7% of COVID-19 research in https://arxiv.org is simply on modeling the pandemic spread.

**The mounting skepticism: **Consequently, the academic community is currently a bit skeptical about the efficacy of many of these models to predict the effect of the pandemic. We share the same skepticism; however, we also feel a well-crafted model can help us not for accurately predicting the future number of infections, but for estimating the need of testing. The accuracy of estimation will essentially depend on how well the model takes key real-world dynamics into consideration.

**Our efforts:** As a proof of concept we will demonstrate in detail how we can build such a model in the later part of this investigation. However, for a moment assume that we have such a model and let us first present how that model will help us estimate the number of tests required in a population.

Let us assume the following set up for a country, e.g., India. We have a *good* statistical model Y(D; ꆪ) which mimics the spread of pandemic at day D after the start of pandemic. Further assume Xᵢ patients are actually reported in day Dᵢ in that country and the actual number of tests on that day is Tᵢ. We can safely assume that these Xᵢ patients are detected by Tᵢ tests. In other words, there is a positive-test factor fᵢ which signifies the proportion of tests which detected actual COVID-19 patients. Then our intuition is on day Dᵢ:

**Rationale:** Our rationale for using a predictive disease spreading model for our estimation of #tests are actually pretty straightforward. Let’s assume the model Y(D; ꆪ) takes into consideration the primary dynamics relevant to the pandemic spread. Then to create the model, first we need to use the sequence X₁, X₂, X₃, … , Xᵣ to fully set the model parameters using data till Day Dᵣ.

Once the model is created, by design it should follow the dynamics of disease spreading. Consequently, the difference between the prediction of the model on (r+i) th day, i.e., Yᵣ₊ᵢ and the actual #reported infections on (r+i) th day, i.e, Xᵣ₊ᵢ will provide us an estimation of the unreported (and likely asymptomatic) cases. Thus the policy makers should consider running tests for the population at least to capture the hitherto undetected (Yᵣ₊ᵢ - Xᵣ₊ᵢ) cases.

We note that, naturally, not all tests will identify COVID-19 infections (e.g., in case of random testing). Thus we considered a positive-test factor fᵣ₊ᵢ which expresses the fraction of Tᵣ₊ᵢ tests which effectively identified Xᵣ₊ᵢ COVID-19 infections. Consequently, we scale our number of minimum extra tests using fᵣ₊ᵢ and arrive at the estimated number of minimum extra tests policy makers should prescribe to find (Yᵣ₊ᵢ - Xᵣ₊ᵢ) hitherto undetected cases. We propose estimating fᵣ₊ᵢ as the running average of per-day proportions of tests till (r+i) th day which actually detected COVID-19 infections. Putting it all together, the scaled model residual (difference between predicted and actual value) should give us an estimate of the number of extra tests.

**Our proof of concept:** Naturally we wanted to verify our idea and actually provide an estimate of the number of extra testing that should be done in a populace. To that end, we created a *mobility-aware *interaction graph based model to use with our approach. This mobility aware model is a portrayal of our modest effort to simulate the phenomenon of *social distancing* that is nothing new in context of COVID-19 but is known to exist since the 5th century BC [4]. However, we have not seen much effort to model this phenomenon *from a mobility aware perspective* — neither in the earlier studies nor in the renewed context of COVID-19 research. We do not claim this model to be absolutely accurate, neither do we claim the magical power of this model to predict the future. However, we strongly believe that our model is expressive enough to mimic the proper dynamics of pandemic spreading and unearth latent patterns in the data. Consequently, we posit that our model is suitable to demonstrate the usefulness of our approach to estimate the number of tests.

# Proof of concept: Developing and Leveraging a mobility-aware graph-based model for estimating #extra tests

We again emphasize the fact that we do not want to create a super-accurate and extremely-predictive disease spreading model. In fact developing such a model is complementary to our efforts, and we can directly use any such model in our approach. However, for the purpose of demonstrating usefulness we just wanted to create a model which takes in account population dynamics relevant to the spread of COVID-19.

**Basic considerations in our model**

We needed to ensure that our model takes care of the underlying dynamics of spreading a disease. To that end we note that population mobility plays a crucial role in spreading an infection. In fact, we can model the general social distancing measures enforced in multiple countries and even lockdown as putting constraints on the mobility of the population. Thus, we needed to add a mobility component to our model.

Furthermore, we strongly felt that established models like SIR or SIRS, while useful for modeling the spread of disease, do not inherently provide an easy way to incorporate factors like population immunity, virus strain, population mobility etc. Thus, rather than those models we decided to conduct a ground-infection data driven simulation on interaction graphs to model the spread of pandemic. This approach intrinsically takes care of virus strains as well as other latent conditions (since the ground-infection data already inherently consider those factors). So we ultimately end up with a model which is quite possibly richer than simply assuming SIR or SIRS dynamics and estimating parameters.

**Data collection**

We leveraged the dataset from the “*COVID Rest API for India data, using Cloudflare Workers*”. It is an organization that provides API returning the requested data in JSON format. The organization claims to have sourced the data from The Ministry of Health and Family Welfare** **and other unofficial sources namely “*covid19india.org**”.*

There are different end-points each returning particular information. They are as follows:

- State wise distribution history from 14th March till the current date. Link
- State wise distribution of the current date. Link
- List of patients (
*Seems no more maintained*) from the start, i.e. 31st January 2020. Link - # samples tested and # positive cases found. Link

Our model also requires population data at the most granular level possible for creating the graph. We have used 2011 census data. Since there is a gap of almost 10 years from the last census, we have inherently assumed that the population at each granular level has increased by *1.12*. The 2011 census data can be found compiled in the following sheet.

We have also gathered the latitude-longitude (to the most accurate available) of each district/city in the population census using the GeoPy API.

**How did we create our mobility-aware graph-based model?**

**The underlying graph:** The model is based on a graph that depicts the entire India. Each node in the graph represents a city/district as available in the 2011 population census data, and the edges represent the distances between them. The distance referred to here is the Manhattan or L₁ distance between the latitude-longitude coordinates. Thus, two cities geographically spread-out have a larger distance than two cities nearby.

The graph formed is not fully connected. We hypothesize that usually people travel in between two states by passing through the capital or the mega-cities present in the state. For each state, we have considered the top three most populated cities as the mega cities of that state through which people can move in between the states. Each city/district of a particular state is connected to all the megacities of that state and all the mega-cities are itself connected among themselves. The graph is built using the *NetworkX* framework of python. The graph so generated is undirected. A visualization of the graph is presented in Figure 1.

**Initialization:** Each graph node has three parameters depicting the infected population, recovered + dead population and the total population. Up to day 40 (when the first death of a COVID-19 patient is reported in India), all the infected patients are seeded into the graph and then our model starts to simulate. Since, we feel that in some states, the initial data has been under-represented, we adopt a fall-back strategy where we fill some infected patients in the mega cities of the under-represented states based on their population. The geographical coordinates of each seed patient is found from the GeoPy API using the approximate address given and is allocated to a particular node which represents the nearest city.

**Handling asymptomatic cases (and incubation period):** A COVID-19 patient does not show symptoms as soon as they have contracted the virus (asymptomatic cases). There is an incubation period which varies from 1–14 days depending on the patient’s immunity strength and several other factors. To factor this into our model, we have used a delay mechanism. When a group of infected patients infect a set of healthy patients, the additionally new infected number is passed through a delay array that delays the day in which the infected patient will be counted as infected. The delay array is populated using Gaussian distribution. For example, say there are 100 infected patients who have newly contracted the virus in a day, the infection date of these 100 patients are distributed using Gaussian distribution in the subsequent 14 days.

**Infection methodology: **For each node depicting a city/district, the number of infected patients in that particular node is made to perform random walks, length of which is governed by the Levy-distribution [6]. This strategy is motivated by the strong parallels found between Levy walk and human mobility [7]. The Levy parameters are assigned differently based on the three phases in which we have divided the entire simulation period — *normal phase*, *lockdown phase* and the *post lockdown phase *(see Figure 2). The parameter α determines the sharpness of the peak of the distribution. Semantically, the Levy parameter controls the extent of social distancing.

After the levy walk, the new node positions of the infected patients are noted and the infected patients are removed from the older node. At each new node, a particular number of healthy people are infected. This number is determined by the following expression:

*#new infected = #healthy population * infection_rate + 1*,* *where the infection rate is a hyper-parameter of our model.

This set of newly infected patients are passed through the delay mechanism as explained earlier and added to the newly infected node. Further we consider that some infected patients are deceased or recovered. To take care of this, a particular percentage of infected patients in a particular node are declared to be inactive. We discounted recovered patients and assumed that they are susceptible to the infection for a second time. Such a consideration can be a straight-forward extension of our model. Moreover, recent findings suggest that such *re-positive *patients do not spread infection; Thus, in context of this work, such patients behave similarly to not susceptible to the infection [10].

**Special events:** To simulate rare-events like “*Tablighi Jammat**“, *we have introduced sudden bursts of infected people which are added to the current node. A random number is generated using the Poisson distribution multiplied with the #infected patients in that particular node and added with the current number of infected persons. This is not done in each iteration, a uniform probability distribution governs this.

**How accurately does our model follow the real data?**

We used data reported up to 41st day as warmup data for our model (as mentioned in section 4.2). Next, we use a feedback strategy and tune the model up to 75th day. We freeze the model parameters on day 75 (which corresponds to 14th April, 2020, the end of a strict-lockdown phase in India). As a first step, we checked how well our model fit the actual data till day 75 (i.e., while tuning the model).

**Model-fit till end of strict-lockdown phase in India:** Figure 3 shows the temporal variation of total #actual cases reported up to day 75 (end of strict-lockdown phase), as well as #cases estimated by our infection simulation model till date (with a 5% tolerance band for each day). Both of these curves very closely follow each other which implies that our mobility-aware graph-based infection simulation model closely follows the actual reported data. In fact the R-squared value for our model till day 75 is 0.998, signifying *very good fit *with actual infection data.

**Model fit beyond last lockdown phase in India: **Next, we checked how our model performs beyond the initial lockdown (i.e., post-lockdown) phase. We considered a spectrum of α parameter values, which signifies different intensity of human mobility beyond strict lockdown. The values range from 0.6 (if the strict lockdown persists), to values like 1.0, 1.2, 1.4 (if the lockdown is relaxed). Practically, yet another lockdown phase (although relatively *relaxed*) followed the strict lockdown beyond 14th April, 2020 (i.e., day 75). The result is shown in Figure 4. Naturally, out of all simulated (i.e., predicted) values the curve corresponding to α = 0.6 shows the lowest growth and as increased naturally the #predicted infections increased. However, *all *of these simulated values overestimate the actual #infections. In other words, #actual infected cases is substantially lower than the #cases predicted by our simulation.

However, we also note that till day 75 the #actual vs. #simulated (i.e., predicted) cases were closely matched (shown in Figure 3). Thus we conclude that post strict-lockdown period (day 75, which is also our tuning period), the significant deviation of #actual cases from our model (i.e., the model residual) indicates a large number of undetected cases (asymptomatic or otherwise). Consequently we use the formulation derived in section 3 to estimate the required number of extra tests based on the model residual.

**Putting it all together: Estimating the number of extra tests**

We simulate our model to determine the number of infected patients predicted and then use the formulation of section 3 to find #expected tests with model residual. We leveraged the day wise #tests and #actual tests data in this experiment. Figure 5 shows the #expected tests according to the mobility-aware interaction graph based simulation after the lockdown period. In this graph we took α = 0.8 to capture the effect the relaxed lockdown. However, the results remain similar for all other α values. We also show the #actual tests per day from our dataset (blue curve) and #extra tests (the green curve). We make two key observations from this graph from the point of view of policy makers.

**Projected number of estimated tests over time: **First, the #daily tests were somewhat appropriate in the early periods of the disease. This is shown by the fact that the green curve is below the blue curve in Figure 5 at the beginning. Interestingly, from that point, both the green and blue curves increased with time. However, the rate of increase of the green curve (#extra tests) is higher than that of blue curve (#actual tests). Consequently, around day 87 (i.e., 12 days after end of strict lockdown) the closeness of the green and blue curve suggests that according to the simulation India needed a 100% increase in the #actual tests.

**Invest and be aggressive early on:** Second, the increase of the green curve with time suggests an *avalanche effect*. The number of extra tests in the early days of the disease will be low, but as the number of undetected cases rise (due to lack of tests), many infected individuals roam free and infect more people (even within lockdown). Consequently, as time passes the green curve grows steeply and around the 100th day shows a need for around 400% (200,000 extra tests vs. 50,000 actual tests) increase in the #actual tests. Thus, the policy makers should consider aggressive and systematic testing (using a principled estimate) to detect COVID-19 in the early days of the disease, otherwise they might need to exponentially increase the #tests at later periods.

# Moral of the story and road ahead

We started with the question if we can systematically estimate the minimum number of tests required in a given location. We answer this question affirmatively. Indeed, we show that, we can leverage a model to systematically estimate the minimum #required tests. To that end, our approach hinges on a model which is rich enough to encapsulate population dynamics (i.e., mobility, interaction and effect of lockdown) and can take real infection data into account. We created such a model and demonstrated how it can be useful for policy makers to understand the requirement for the number of tests to detect and isolate COVID-19 patients. In fact our model, even in the face of limited available data demonstrated the need for early testing and recommended a five fold increase in COVID-19 testing over time. One of the most intriguing message that all these analysis together portray, at least to us, is “*if you do not do aggressive testing early on, you will be destined to do exponentially more testing as time progresses since every individual test not done can potentially give birth to the requirement of hugely many tests in the immediate future.*” We do not claim that our proof-of-concept model is the best one, but we would definitely underline that it is rich enough to consider factors responsible for spreading COVID-19.

However, we envision this exploration as a first step in this direction of systematically data-driven estimation of # COVID-19 tests. In fact, we identify a number of considerations that could have been made to improve the model as well as improve the estimate.

**Go for pooled testing: **We note that the improvement of testing methodology will affect the #tests. In this exploration we assumed a single test will find out whether a single individual is infected in this analysis. However recent work has explored pooled sampling for COVID-19 [5]. They show that such pooled sampling can reduce the number of tests by two folds given a 7% infection rate. Such technique will definitely improve efficiency of the current testing strategies. In fact our exploration makes a strong case for the need for policy intervention in the #tests (and possible use of pooled testing) given the exponential rise in #required individual tests.

In fact, we note that very recently Indian govt. decided to significantly ramp up the number of tests — 200,000 tests per day [9]. *This number is quite in line with our current recommendation. *This announcement provides a strong validation of the efficacy of our approach as well as how this research can help policy makers in near future.

**Create smaller sub-population specific models**: Yet, another avenue to make our model richer is to consider stratification of population. E.g., in this analysis we fit a simulated model for the whole of India. However, a next step will be fitting different simulations for different sub-sample of the population and then build a final model as a composition of all these smaller models. We can think of scenarios where the aged as well as pre-conditioned (diabetes, cardiac issues) need different simulation models and the #expected tests need to be distributed differently across differently vulnerable sub-populations. However, this will still be a straight forward extension of our exactly same approach, where different models will simply be trained with different seed data. We strongly believe any reasonable policy making authority will have access to this sub population data to create such smaller models.

**Use more granular data to enhance the model: **We used the city-wise mobility profiles of a population. However, authorities can use** **a more granular, i.e., locality- or ward-level mobility profile as well as locality-level number of infections to build an enhanced model which can in turn make the prediction (as well as expected number of tests) at the locality-level instead of country level.

To conclude, our work complements the ongoing work on finding a cure and devising new and more effective testing strategies. We started with a simple question — estimating the number of tests. Consequently we took into consideration the social dynamics (social distancing, lockdown), mobility profiles and interactions between people to create a data-driven simulation model to better understand the COVID-19 infection spread. This model helped us to create a proof-of-concept and show how we can systematically engineer a data-driven estimate of the number of required tests specifically tailored to a population.

# References

[1] https://www.sciline.org/covid/expert-quotes-testing#q2

[2] Testing for COVID-19: A way to lift confinement restrictions

[3]https://www.mohfw.gov.in/pdf/ICMRstrategyforCOVID19testinginIndia.pdf

[4] https://en.wikipedia.org/wiki/Social_distancing

[5] https://www.medrxiv.org/content/10.1101/2020.04.06.20052159v2.full.pdf

[6] https://en.wikipedia.org/wiki/L%C3%A9vy_distribution

[7] I. Rhee, M. Shin, S. Hong, K. Lee, S. J. Kim and S. Chong, “On the Levy-Walk Nature of Human Mobility,” in *IEEE/ACM Transactions on Networking*, vol. 19, no. 3, pp. 630–643, June 2011, doi: 10.1109/TNET.2011.2120618.