Data Science

Predicting COVID-19 with Data Science

Join NASA in developing a model to predict COVID positivity

Benedict Neo

Published in

bitgrit Data Science Publication

5 min readJul 30, 2024

Interested in space exploration and medical diagnostics? Want to make an impact?
Put your data science skills to the test by participating in this challenge!

Problem Statement

(The following information is taken from the NASA Breath Diagnostics Challenge description)

In the critical arena of space exploration and medical diagnostics, developing innovative tools for rapid and accurate health assessment has never been more crucial. With NASA’s focus on long-duration space missions and the global need for quick, non-invasive diagnostic methods, the quest for groundbreaking solutions is paramount.
This backdrop sets the stage for a groundbreaking initiative by NASA’s Science Mission Directorate (SMD). As part of its commitment to advancing space exploration and improving human health, NASA is proud to launch the Breath Diagnostics Challenge to leverage E-Nose sensor data to predict COVID-19 status. This key indicator of respiratory health stands at the core of our competition, reflecting the broader implications for both space medicine and terrestrial healthcare.
This challenge emerges at a pivotal moment, inviting participants from around the globe to delve into a rich dataset that includes sensor readings from NASA’s E-Nose device, collected from COVID-positive and COVID-negative individuals. With the dual goals of advancing scientific understanding and promoting practical applications, this competition is a clarion call to data scientists, medical researchers, and AI enthusiasts. Whether you’re deeply entrenched in the field of medical diagnostics or a data enthusiast eager to apply your skills to a cause of global importance, the NASA Breath Diagnostics Challenge offers a unique platform to contribute to meaningful advancements in health monitoring and disease detection.

The data 💾

Get the data by registering for the competition.

The dataset is composed of E-Nose sensor readings from 63 patients, divided into training and test sets. Each patient’s data is organized into a separate text file, containing readings from 64 sensors over a 14-minute period. The data collection process followed a specific protocol to ensure consistency:

5 min baseline measurement using ambient air
1 min breath sample exposure and measurement
2 min sensor “recovery” using ambient air
1 min breath sample exposure and measurement
2 min sensor “recovery” using ambient air
1 min breath sample exposure and measurement
2 min sensor “recovery” using ambient air

├── dataset
│ ├── train
│ │ ├── NTL E-Nose - Patient 1.txt
│ │ ├── NTL E-Nose - Patient 4.txt
│ │ └── …
│ └── test
│ ├── NTL E-Nose - Patient 2.txt
│ ├── NTL E-Nose - Patient 3.txt
│ └── …
├── submission_example.csv
└── train_test_split_order.csv

Each patient file contains:

Patient ID
COVID-19 Diagnosis Result (POSITIVE or NEGATIVE)
Numeric measurements for 64 sensors (D1 to D64)
Timestamp for each measurement (Min:Sec format)

The data is distributed as follows:

Train: 45 patients
Test: 18 patients

The goal 🥅

The objective of this challenge is to develop a classification model that can accurately diagnose patients with COVID-19 based on the E-Nose sensor data. Participants must create a model that:

1. Effectively utilizes the limited dataset of 63 patients
2. Accurately classifies patients as COVID-positive or COVID-negative
3. Generalizes well to unseen data

Submissions should be in the format specified in the `submission_example.csv` file, with predictions for the test set patients in the order specified by the `train_test_split_order.csv` file.

The evaluation metric is Accuracy, with the final rankings determined by performance on a private leaderboard revealed at the end of the competition.

The code can be developed using your preferred data science tools and platforms

Code is found on Deepnote

Read data

We set up necessary imports and configure logging.

Logging is crucial for tracking the progress and catching any issues during execution.

This is what our data looks like in txt files.

Let’s write a function that reads data from individual patient files.

It should handle both training and test files, checking for proper format and extract the patient ID and COVID-19 result.

For test files, the result is set to “UNKNOWN”. It reads the sensor data into a DataFrame, ensuring the columns have no leading or trailing spaces.

apply function

Let’s use that function to read all patient data files from the train and test directories. It uses glob to get all file paths and then calls the read_data function to process each file. The results are stored in train_data and test_data lists.

We have 45 train and 18 test.

Preprocess

Let’s preprocesses the sensor data by extracting summary statistics (mean, standard deviation, max, min) for each sensor in the DataFrame.

It constructs a feature dictionary for each patient, including the patient ID and the binary COVID-19 result (1 for positive, 0 for negative), and returns a DataFrame of these features.

We drop the result column from the test features since it's unknown. The patient_id column is also dropped as it’s not a useful feature for training.

For the training data, it separates the features (X_train) and target (y_train)

data

Here’s what our data looks like now.

scale

We normalize the training and test data using StandardScaler from scikit-learn. Normalization ensures that all features contribute equally to the model by scaling them to have a mean of 0 and a standard deviation of 1.

Fit model

We fit a simple logistic regression model and evaluates it using 5-fold cross-validation to estimate its performance. Cross-validation helps in assessing how well the model generalizes to unseen data.

The model is then trained on the entire training set and used to predict the outcomes for the test set.