Journal Club Review: “Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity”

Wade Schulz, MD, PhD
COVID Reviews
Published in
4 min readApr 3, 2020

The goal of COVID Reviews is to provide a series of reviews related to recent analytics-focused articles on COVID-19. Our hope is to provide short summaries and critiques in a journal club-oriented format that can be quickly digested to assess the methodologies used, populations/outcomes assessed, impact of work, and strengths/weaknesses of each article.

Article

Jiang X, Coffee M, Bari A, Wang J, Jiang X, Huang J, et al. Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity. Computers, Materials & Continua 2020;62(3):537–551.

Review by: Fred Warner, PhD; Bobak Mortazavi, PhD; Wade Schulz, MD, PhD*

NA / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

Authors’ Aim

There are two (related) aims, 1) to determine which clinical characteristics predict outcomes and 2) to predict from baseline data which patients are high risk for severe illness. The authors also aspire to develop a general framework, but do not include such a framework within the manuscript.

Why Article Was Selected

This article was selected as it is one of the first peer-reviewed manuscripts for a COVID-19 predictive model. The article was jointly published by investigators from a number of institutions, with a population of patients from two hospitals in Wenzhou, China.

Methods Employed by Authors

The authors predict a severe outcome — acute respiratory distress syndrome (ARDS) — from a set of 11 features in a total cohort of 53 subjects (5 of whom had the outcome of interest).

Features were selected from a collection of demographic, vital sign, laboratory, and symptom data and were ranked using information gain, Gini index, and chi-squared measures, as well as via forward variable selection, resulting in a final set of 11 features (ALT, myalgias, hemoglobin, gender, temp, Na+, K+, lymphocyte count, creatinine, age and white blood count).

The accuracy of various predictive algorithms (logistic regression, KNN, decision trees, random forests, and SVM) was assessed via 10-fold cross validation.

Results and Conclusions

Of note, only 5 subjects developed ARDS, the outcome of interest (out of 53 total with a positive SARS-CoV2 test). The accuracy of the algorithms varied greatly, and ranged from 50% (logistic regression) to 80% (KNN and SVM). The most predictive features were ALT, hemoglobin, and myalgias. Some other features — gender, temperature, age, and others — contributed to a lesser degree. Some important things — such as radiological features — were not predictive as they were common to many subjects.

Strengths of Article

The article addresses an important problem, that of determining early on which patients might require extra resources or attention during the current strained pandemic situation. It also makes the point that some unsuspected clinical markers may indicate increased risk for negative outcomes.

Limitations of Article

Several major limitations exist in the article, primarily due to the small study size with application of data-driven predictive models that typically require larger populations:

  • There are only 53 subjects and only 5 positive ARDS cases.
  • In building their models, authors use 10-fold cross-validation, which may not be the most effective way of dealing with the data problem. There are only 5 positive cases — so at least 5 of the folds have only negatives in the test set — a little disconcerting — and it also isn’t clear how the accuracy was measured in the process (particularly given lack of error measurement in the accuracy results, if each model were evaluated on every permutation). No separate validation set was withheld for result reporting (understandable with limited number of cases, but limits the generalizability and interpretability of results).
  • There is a lack of clarity around the baseline measurements. Many of the labs — including ALT and hemoglobin, both determined to be important features — were only measured for the 40 subjects coming from the main hospital. However, we do not know the hospital location of the 5 positive cases of ARDS. So, for example, it might be the case that only 3 of the 5 ARDS patients had an ALT measurement.
  • There is no indication given as to how the feature selection was really done beyond that they were ranked in various ways, and no indication of why ALT, hemoglobin and myalgia were considered the most predictive features.
  • While thresholds for determining actual classifications (from, say, a probability output) were determined in some algorithmic way, but this is not described.
  • For evaluating the results, accuracy isn’t the most informative measure for a situation like this, with a small and imbalanced data set. Other measures, like recall and precision, would provide additional detail on performance. Similarly, given the small number of cases, a close analysis of individual cases would provide important details regarding performance of the model and impact of specific variables.

Take-Home Message

This article is an important first step in developing COVID-19 predictive models and flags potential new features that may be relevant. However, the small number of subjects and even smaller number of events limits the practical relevance and generalizability of data. Similarly, the limited performance evaluation which only uses accuracy limits a true evaluation of the performance of the authors’ model.

--

--

Wade Schulz, MD, PhD
COVID Reviews

Dr. Schulz is an Assistant Professor of Laboratory Medicine and computational healthcare researcher at Yale School of Medicine