Accelerating COVID-19 qPCR analysis with machine learning.
How to apply Machine Learning to detect SARS-CoV-2 and any target on qPCR runs.
Hello, my name is Santiago Goncalves and I’m a 26-year-old Software Engineer from Montevideo – Uruguay.
This piece is a continuation of the previous article, in which I describe how this research began and briefly explain what qPCR is and how it works. If you haven’t read it yet and are unfamiliar with how quantitative polymerase chain reaction works, please do so before proceeding with this article.
So, why use machine learning to pre-validate test results? For example, you can process mass amounts of tests with minimal error, allowing you to determine the overall prevalence of the virus without having to manually verify each test result in seconds. Machine learning algorithms also don’t get tired of staring at graphs all day as scientists do.
Breaking down qPCR analysis as a supervised learning problem.
The PCR process is made up of a sequence of temperature adjustments that are repeated 25–50 times in total (cycles). When the PCR machine –also known as a thermal cycler– finishes, it exports the cycles as an array of floating-point values for each sample processed, with each number representing the weighted fluorescence emitted by the excited fluorophore amplified each cycle.
We can use those cycles as features to develop any kind of Machine Learning or Deep Learning model by simply differentiating between positive/negative sequences and feeding them into the model.
In the following section, we will go through the code blocks used to make it possible. I promise I’ll try to keep it as simple as I can.
About the dataset.
As mentioned before, our dataset has 40 input features –each of those corresponding to a cycle–, and as our model’s output, the “RESULT” column .
- RESULT → 1 = positive PCR test.
- RESULT → 0 = negative PCR test.
Importing libraries.
Here, we will import some libraries that we will make use of, like pandas and other modules from sklearn.
Load the data.
Next, we load the data in our spreadsheet as a pandas DataFrame, and split our input and output features to feed into our model.
Checking classification algorithms.
We will test multiple classification algorithms using a cross validation score with a 5 kFold split in this code block, then print the cross validated score mean roc_auc for each model and plot them for comparison.
Mean roc_auc per cross validation score:LR: 0.999607
LDA: 0.997862
KNN: 0.999929
CART: 0.993527
GB: 0.996121
NB: 0.978364
SVM: 0.999964
In the previous illustration we can see that many of them did pretty ‘good’ even without feature engineering and data pre-processing with our well-curated dataset.
I’m sure that there’s still room to explore and improve these model’s metrics, but the main purpose of this article is merely to break down how to apply Machine Learning to treat qPCR analysis as a supervised learning problem. That being said and explained,I’ll wrap it up here.
I simply wanted to express my gratitude to everyone who was a part of this process in some way, as well as to you for taking the time (and interest) to read this article. I hope you enjoyed it and found it informative. Please do not hesitate to contact me if you have any questions or comments.
→ LinkedIn.
→ Twitter.