How a simple algorithm can increase the throughput of standard COVID-19 tests 10-fold

Lior Zimmerman
Analytics Vidhya
Published in
3 min readMar 17, 2020

COVID-19 has been declared a pandemic last week, costing thousands of lifes with no end in sight. One of the most critical hurdles in stopping the spread of the disease is the availability and cost of test kits for the virus.

The standard assay for testing the presence of the virus in a specimen is called Real-Time PCR (RT-PCR). Simply put, RT-PCR involves a chemical reaction that produces fluorescent light if viral DNA is present in the sample.

In the standard protocol, one sample per test tube/well is analyzed. Pooling several samples in one well is a solution that has been explored in several studies. In one such study, it was shown that 37% fewer RT-PCR tests were needed when pooling several samples in one well/test-tube. A typical pooling method involves combining several samples per well and if one of the wells is found to have a positive sample (by emitting a fluorescent signal above a predefined threshold), each of the samples in the pool is assayed separately which results in only a modest decrease in the number of tests.

Here I propose a very simple algorithm for determining which are the positive samples out of a larger collection of samples using only 1/10 of the number of tests required by the standard protocol. My method assumes 1% of the samples are positive for the virus, which is a reasonable assumption given the published data. An analysis on the cost effectiveness of pooling vs. the ratio of positive samples can be found here.

First, let’s determine: 1 … n to be a set of samples to be tested, assuming only 1% of them are positive. Let’s determine Xᵢ =1 iff sample i is positive (0 otherwise).

Given m wells (or test-tubes, m = n/10 — remember, we want to perform only 1/10 of the tests!) we construct the following matrix:

In the above matrix Sᵢ,ⱼ = 1 iff sample i was put in well/test-tube j. In each well, we put a random subset of n/2 of the samples (such that the sum of each row = n/2). Putting n/2 samples in each well is an assumption that can be optimized in later implementations.

This matrix allows us to construct the following system of linear equations:

Here, W₁ … Wₘ is the strength of the fluorescent signal in wells 1 … m (the sum of the signals of each of the positive samples in that well), and X is defined above to be: Xᵢ =1 iff sample i is positive for the virus.

Now we have a system of equations, where we need to find vector X. The S matrix and vector W are given to us by the experiment. Such an expression is easily solved by a simple algorithm like least squares with high accuracy. Here is an example of a python code taking the above assumptions, resulting in a 100% accuracy (Link):

--

--

Lior Zimmerman
Analytics Vidhya

Computational Biologist, Head of Protein Design @ Enzymit