(NICHeD)ata Science Solutions: Classifying Pregnancy Morbidity Risk using Health Metric Deltas with IBM CloudPak for Data

Published in

IBM Data Science in Practice

11 min readMar 11, 2022

Author: Ainesh Pandey
Dev Team: Ainesh Pandey (Team Lead), Gabriel Gilling, Demian Gass

a pair of hands wearing blue nitrile gloves and holding a small jar labeled “Vaccine Covid-10” and a pen. in the background, the person’s body is shown partially wearing a labcoat, and in the foreground, one can see a clipboard and stethescope — Photo by Towfiqu barbhuiya on Unsplash

The United States ranks last overall among industrialized countries on the maternal material rate, with a rate of 17.4 per 100,000. To gain a better understanding of child and birth parent health, The National Institute of Child Health and Human Development (NICHD) posted a national challenge on the Freelancer platform with the following charter: to use the nuMoM2b dataset to identify future areas of research to reduce the occurrence of adverse pregnancy outcomes.

This blog explores the IBM Data Science and AI Elite team’s submission to the challenge, which ranked 1st for both the main Innovation Award and the secondary Health Disparities Award. The solution was developed on IBM CloudPak for Data.

Awardees

The Data

As stated by the Pregnancy and Perinatology Branch of the NIH, the nuMoM2b (Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be) initiative launched in 2010, tracking nulliparous (delivering for the first time) individuals. This prospective cohort study evaluated the underlying and interrelated mechanisms of several common adverse pregnancy outcomes (APOs), which can be unpredictable in people who have little or no pregnancy history. The study would help guide their treatment.

This initiative addressed a critical group of at-risk individuals who were understudied and represented 40% of U.S. births each year. The study aimed to help inform healthcare providers and their patients who are pregnant or considering pregnancy, and to support future research to improve care and outcomes in this group. The study focused on racially, ethnically, and geographically diverse pregnant persons through 8 clinical research sites and 12 sub-sites around the country. At four points during pregnancy, these individuals participated in a variety of tests to identify potential mechanisms of adverse outcomes and predictive factors for the outcomes.

The following were provided to participants in the challenge:

the nuMoM2b dataset as a CSV file
- 9,289 rows representing nulliparous individuals
- 11,633 columns of data for each individual
- a spreadsheet containing summary information on each of the 79 nuMoM2b datasets
a data dictionary, detailing each variable’s originating dataset, name, label, type, unit, and code list (if applicable)
database documentation for the public release of the data (descriptions of data collection forms, analysis methods, and other information)
a publication list of studies already carried out using the nuMoM2b data

No further external data was allowed in our solution.

Data Processing: Parsing Covariates, Deltas, and Targets

For this challenge, we decided to assess the impact of changes in features that were measured across maternal hospital visits on different adverse pregnancy outcomes. To that extent, we divided the challenge’s dataset into several components:

a covariates dataset with demographic and socio-economic information,
a deltas dataset capturing changes in features across multiple visits, and
a targets dataset with outcome variables related to maternal morbidity.

We were interested in understanding how changes in these features might predict APOs. This technique could present an exciting new area of research in the medical field: how the rate of change in standardized measures of health and body metrics throughout active pregnancies relates to APOs. The intent of this approach was to identify delta features which showed signs of being predictive of certain APOs after controlling for covariates such as demographics and socio-economic status. If we were to find evidence of this predictive power, it would tell us which metrics doctors and health professionals should be tracking during the antepartum and intrapartum phases of pregnancy when looking to screen for APOs.

Creating the Delta Features Dataset

First, we created the deltas dataset which measured the changes with respect to certain features (clinical measurements, sleep monitoring, and fetal biometry, among others) that were measured across multiple visits.

Before calculating these delta features, we first standardized all numeric features so that the deltas represented the change in a metric with respect to the population. Null values in both the encoded features as well as the numeric features were left temporarily untouched. Once this preprocessing step was complete, we began the process of creating deltas.

For numeric features, we simply calculated the difference in measurements between two visits.

For instance, systolic resting blood pressure was measured at Visit 2(V2BA02a1) and Visit 3(V3BA02a1). These two measurements were used to create a new feature, V2BA02a1_delta_V3BA02a1, which is the difference in standardized blood pressure measurements between Visit 3 and Visit 2, or $V3BA02a1 - V2BA02a1$.

For encoded categorical features, we took a similar approach in tracking changes in these features across visits by tracking the different combinations of changes that can occur within a feature.

For instance, U1CD01 and U2CD01 track whether the placenta is implanted on the ipsilateral side for the right uterine artery during the first and second visits, respectively. As an example, let us say that for a given patient, the U1CD01 value is 1.0 (Yes) and the U2CD01 value is 2.0 (No). We create a delta feature, U1CD01_delta_U2CD01, and give this patient the value 1.0-2.0. We also treat missing values for encoded features as a level.

Finally, once the delta features were created, we imputed the missing values within the numeric features using mean imputation. Having done this after calculating the deltas between standardized features, this was akin to assuming that where the values were missing, patients had the average amount of change as found in the population.

Creating the Covariates Dataset

When running predictive models, it is important to adjust/control for important covariates that are likely to account for the variation observed in the target variable. For this purpose, we created the covariates dataset, which sought to capture characteristics of pregnant individuals before their pregnancies.

As such, our covariates dataset consisted of the following information:

Demographic variables:
- gestational age at screening
- age
- race
- BMI
- education level
- gravidity
- smoking history
- insurance status
Other important variables we identified:
- total family income
- alcohol history
- illegal drug history

We imputed the mean for numerical features and the mode for categorical features. We then performed Z-score standardization on numerical features.

Creating the Targets Dataset

Finally, we needed to identify targets to train our models on. We started by identifying variables related to pregnancy outcomes, zeroing in on those related to maternal morbidity. We then manually iterated over additional features linked to complications arising out of pregnancy.

As such, we have a total of 18 potential targets coded as binary variables:

Pregnancy Outcomes:
- Stillbirth
- Termination
- Miscarriage
- Preeclampsia/Gestational hypertension
- Chronic hypertension
Postpartum Complications:
- Postpartum hemorrhage requiring transfusion
- Retained placenta
- Endometritis
- Wound infection
- Wound dehiscence requiring debridement, packing, etc.
- Cardiomyopathy
- Hysterectomy
- Surgery other than for delivery of baby or hysterectomy
Postpartum Mental Health Conditions:
- Postpartum depression
- Postpartum anxiety
- Postpartum bipolar disorder
- Postpartum post traumatic stress disorder
- Postpartum schizophrenia/schizoaffective disorder

The Modeling: Classifying APOs with Underutilized Modeling Approaches

The covariate and delta features formed the base dataset for modeling. To compare the importance of coefficients or features associated with the delta variables in our results, we had to normalize the base dataset using min-max normalization, so all features’ ranges were [0, 1]. This approach is not robust to outliers, but we had already standardized the numeric features before creating the deltas, so the impact of outliers was minimized. We then looped through the various targets we had selected, and modeled on each.

The outcome variables are categorical: we predicted a Boolean outcome indicating whether an observed pregnancy will result in the selected target morbidity. Historically, medical research surrounding classification problems has relied heavily on logistic regression techniques. However, we contend that there is much more value in ensemble and boosting methods, which usually have higher predictive power.

Ensemble methods, like random forests, bring a lot of benefits to the table.

Single models are usually subject to a bias/variance tradeoff. For example, an unpruned decision tree can classify every single training data point perfectly, leading to low bias and high variance (overfitting). However, a single decision stump would result in high bias and low variance (underfitting). In practice, we find that employing random forests (an ensemble of trees) leaves bias unaffected while reducing variance, allowing us to get the best of both worlds.

a graph labeled “the bias vs. variance trade-off” the x-axis is labeled “model complexity”. the graph is split in half between the “underfitting zone” to the left side of the graph and the “overfitting zone” to the right side. bias is shown to be high with underfitting and low with overfitting, while variance is low with underfitting and grows with overfitting. the generalization error reaches a minimum where the bias and variance curves meet

Ensemble methods can take advantage of many different types of models, each having a “vote” on the prediction of the output variable. This allows us to take advantage of the benefits of each included model, expecting the law of large numbers (or, in this case, larger numbers) to more often identify the correct classification.

n-trees shown in a random-forest diagram. each tree is shown as having a different set of inputs and different decisions. each one is shown to have one vote to put together a final vote with input with each tree.

In models like random forests, bootstrapping allows individual decision trees in the random forest to “specialize” on different parts of the feature space.

Similarly, boosting algorithms also offer higher predictive power.

Weak learners, like logistic regression or shallow decision trees, are good at finding general “rules of thumb” because of the associated low variance. On their own, they are not good at solving complicated problems. However, a bunch of weak classifiers that specialize in different parts of the input space can do much better than a single classifier. This is the basis of boosting.
Each consecutive weak learner in a boosting algorithm specializes in the part of the feature space the previous learners performed poorly on. The resulting classification is found by taking a weighted “vote” among all of the learners, with classifiers that are more “sure” of their prediction having a higher weight. In practice, we see that these boosted weak models outperform individual classifiers.
Boosting is often robust to overfitting. In practice, we often see the test set error continues to decrease even while the train set error stays constant (or even 0!).

a boosting diagram shown with four shallow trees, h1 to h4. each tree is shown to have different weights in each boosting round, as well as when each tree is correct in each round.

For the purposes of our analysis, we will run the following three models for each target variable:

A Logistic Regression model, with the regularization variable tuned
A Random Forest model, with various hyperparameters tuned
A Light GBM (gradient boosting machine), with various hyperparameters tuned

In practice, we generally see random forests and boosting algorithms outperforming logistic regression models for complicated problems with a large amount of data. However, many of our target variables have a small number of true cases. Therefore, it was plausible that logistic regression could outperform the other more complicated models for some of the output variables. Therefore, we let the model performance metrics themselves tell us which models did best (specifically, the F-1 score on the True class).

The Results: Identifying Successful Models

In understanding the results of our models, we were most interested in identifying:

which models appeared to have the most predictive power for certain target features, and
which features were most impactful in predicting certain APOs

Models that perform well in predicting our target features prove that a machine learning approach could be useful when trying to predict APOs in future patients. Finding features that appear to be important when predicting those outcomes provides us with some explainability and potential areas for future research such as why are these features so predictive of certain APOs.

First, we focused on models where the support for the minority class in our test set was greater than 50 (we took a 70/30 split between our train set and test set). This means that we dropped models where there were fewer than 50 observations of the “positive” instances in our test set (e.g. there were fewer than 50 cases of miscarriage in our test set, so we decided to drop those models from our analysis). We decided to take this approach because model results can be misleading and uncertain when the support is so low.

We were left with 4 target features: chronic hypertension, postpartum depression, postpartum anxiety, and preeclampsia. For these 4 targets, we selected the modeling approach (Logistic Regression, Random Forest, or LGBM) that achieved the highest F-1 score. We decided to focus on the F-1 score because of the large class imbalance in our data (most patients did not have any APOs occur). In this context, false negatives would have been extremely costly to us, and so we wanted to place an emphasis on recall. However, we also did not want to dilute our predictions with false positives. The F-1 score allowed us to optimize on both recall and precision.

For the models that remained, we extracted the top 10 most impactful features, including both covariates and the delta features. Because all of our best models were either LGBMs or Random Forests, we determined feature importance by Gini importance, which is a measure of how much the feature decreases impurity on average over all of the trees in the model. For each of the 4 target features, we analyzed the model performance and the 10 most important features driving those predictions. This provided us with insight into the feasibility of modeling certain APOs as well as the potential drivers of these outcomes.

a set of four charts showing feature importance. for chronic hypertension using LGBM, BMI is shown to be the most important feature. for postpartum depression using LGBM, a delta feature is shown to be the most important. for postpartum anxiety using LGBM, the same delta feature as is most important for the postpartum depression is shown to be the most important. for preeclampsia using Random Forests, BMI is shown to be the most important feature.

Insights and Future Applications: Avenues for Empirical Research into Causality of APOs

We believe that our results present promising signs for future research. For the four different APOs, we were able to build models with strong predictive power (with an average F-1 score of ~0.7125). With further iterations in parameter tuning, feature engineering, model refinement, and data acquisition, we believe that these models have the potential to unlock new predictive capabilities in the medical industry. There would be an immense benefit in being able to accurately predict an APO for a patient before it actually occurs, allowing time to take preventative measures.

Additionally, we gained insight into the features that were the most important drivers in our models’ predictions. While the feature importance results aren’t necessarily conclusive in terms of causality, they do point to possible drivers of these APOs. This provides us with starting areas for future research. If we find that there are certain controllable factors in a birthing parent’s health that may affect the likelihood of an APO, then this could inform the care that pregnant patients receive. We hope that these results lead to a more informed understanding of APOs and drive a more informed approach to care for pregnant individuals, thereby reducing the cost of APOs throughout the country.

Summary: APO Risk Profiling and Real-Time Proactive Treatment

This analysis was enlightening, allowing us to empower the medical field with insights that could lead to the reduction of adverse pregnancy outcomes, particularly in minority populations. However, we also believe that this work could be taken further to provide immediate value.

Because the models use data collected during standard doctors visits for pregnant patients, we could develop a real-time risk profile using this analysis. Medical professionals may be able to identify the risk of certain APOs by the second or third doctor’s visit, presenting the opportunity for proactive treatment to address concerns before they develop.

This analysis has been truly fulfilling for the IBM Data Science and AI Elite Team. We are excited by the possibilities for this project in the future, and we hope to develop a real-time tool for medical professionals to make a more immediate impact as well.

Github Link: https://github.com/gabgilling/dse-nichd
Keywords: data science, machine learning, pregnancy, adverse pregnancy outcomes, NICHD, NIH, nuMoM2b, IBM Data Science and AI Elite