Classification algorithm model to determine a fetus’s health

14 min readSep 19, 2023

Clique aqui para ler esse artigo em Português.

In this article, I present the step-by-step process I followed to build a classification model to classify the health of a fetus with the help of the Pandas Profiling and PyCaret libraries.

Summary

Fethus’s Health
General Objectives
2.1. Specific Objectives
Obtaining the Data
Variable Dictionary
Data and Library Importation
Exploratory Data Analysis
6.1. Why use Pandas Profiling?
6.2. Generate Report
Data Preprocessing
Creating Machine Learning Models
8.1. Why use PyCaret?
8.2. Configuration
8.3. Creation and algorithms comparison
8.4. Model Adjustment
8.5. Visualization
8.6. Predictions
8.7. Finalize Model
Test Suit Evalutation
Conclusion

1. Fetus’s Health

To take care of the health of the fetus as well as the pregnant woman, there is a medical specialty known as perinatology. This branch of medicine aims to prevent complications to ensure that everything goes well from conception to the birth of the baby.

There are various procedures that can be performed, including prenatal exams, monitoring in cases of pre-existing health problems in the pregnant woman (such as high blood pressure and diabetes), checking the development of the fetus, and screening for congenital and genetic abnormalities, among others.

This monitoring is carried out with the aim of preventing pathologies in both the mother and/or the baby, as well as avoiding mortality. The topic is so important that the United Nations (UN) has included the reduction of maternal and neonatal mortality rates as one of the Sustainable Development Goals.

Among the various examinations that can be performed to assess the health of the fetus, cardiotocography (CTG) is considered simple, low-cost, and non-invasive. It involves the collection of the fetal heart rate as well as uterine contractions. For these reasons, this project will be built using the data collected from this examination.

2. General Objectives

Develop an algorithm to classify the health of the fetus based on the results of cardiotocography (CTG) exams.

2.1. Specific Objectives

Perform an exploratory data analysis to understand the dataset and check for the presence of abnormalities that require appropriate treatments.
Create machine learning models using different algorithms.
Evaluate models to find the one with the best performance.

3. Obtaining the Data

The data used in this project was available on Kaggle, and where based on the study by Ayres de Campos et al. (2000). The file was saved in the cloud, and can be found at this link in case it becomes unavailable in the future.

4. Variable Dictionary

Understanding the dataset involves examining the variables present within it to enable a thorough analysis. Below is a summary of the identified attributes along with their respective meanings.

in alphabetic order

abnormal_short_term_variability: percentage of time with abnormal short-term variability
accelarations: number of accelerations per second
baseline value: Basal Fetal Heart Rate (FHR)
fetal_health: fetal health (1. normal; 2. suspicious; 3. pathological), target class
fetal_movement: number of fetal movements per second
histogram_max: maximum histogram value
histogram_mean: histogram average value
histogram_median histogram_min: minimum histogram value
histogram_mode: histogram most common value
histogram_number_of_peaks histogram_number_of_zeroes histogram_tendency histogram_variance histogram_width light_decelerations: number of light decelerations per second
mean_value_of_long_term_variability: long-term average variability
mean_value_of_short_term_variability: average short-term variability
percentage_of_time_with_abnormal_long_term_variability prolongued_decelerations: number of prolonged decelerations per second
severe_decelerations: number of severe decelerations per second
uterine_contractions: number of uterine contractions per second

Note Some variables were considered self-explanatory, therefore the description of them is blank.

5. Data and Library Importation

When starting a project, it’s necessary to install packages, import libraries that have specific functions to be used in the subsequent lines of code, and perform the necessary configurations for the code output. Additionally, you proceed with importing the dataset, saving it to a specific variable for later use.

# install additional packages
!pip install pycaret -q          # auto machine-learning
!pip install pandas_profiling -q # production of exploratory analysis report

# import libraries
import pandas             as pd                 # data manipulation
from pandas_profiling     import ProfileReport  # production of exploratory analysis report
from pycaret.classification import *            # criate classification models

#  import data set and save it in the 'df_raw' variable
data_path = 'https://www.dropbox.com/scl/fi/ztok9mb5oni0j14xb48ek/fetal_health.csv?rlkey=8gu4bmlif43a3bf0hvr15dfmq&dl=1'
df_raw = pd.read_csv(data_path)

6. Exploratory Data Analysis

This is an essential step in data science projects where the aim is to gain a better understanding of the data by identifying patterns, outliers, potential relationships between variables, etc. In this study, information that is relevant to guiding the objectives stated earlier (see General Objective and Specific Objectives) has been explored.

To do this, I will generate a report summarizing the data using the ProfileReport library. From this report, if necessary, a deeper analysis will be conducted. However, it will already provide us with enough information to identify anomalies such as outliers and data imbalances.

6.1. Why use Pandas Profiling?

The Pandas Profiling library offers the ProfileReport function, which automatically creates a complete exploratory data analysis report. It includes the following information:

Overview
With the number of records and attributes in the dataset, as well as missing and duplicate values.
Descriptive Statistics
For each variable, it will calculate: mean, median, maximum and minimum values, standard deviation, quartiles, and other statistical data.
Distribution
For numeric variables, it will create charts such as histograms, box plots, density plots, making it visually easy to understand the data distribution.
Frequency
For numeric attributes, it will identify the main unique values and their respective counts.
Correlation
A correlation matrix is generated to check the relationship between variables.

Therefore, the use of ProfileReport streamlines the routine of exploratory data analysis. However, it does not exempt the data scientist from conducting an in-depth study of the report to understand and interpret the data, as well as to identify any potential issues that require treatment.

6.2. Generate Report

# create report
report = ProfileReport(df_raw)

# view report
report.to_notebook_iframe()

The report generated above indicates the following:

The dataset consists of 2,126 records and 22 attributes.
Upon examining the first and last entries of the dataset, it appears to be well-filled and without anomalies.
There are no missing data.
There are 11 duplicate records, which represent 0.5% of the dataset.
Several strong correlations were detected, which is common when data is somewhat repeated but expressed differently, such as the mean, mode, median, maximum, and minimum values of the histogram, which is the end result of this type of examination.
The attribute severe_decelerations is imbalanced, which is expected since such occurrences are less common.
Several variables have a large number of zero values, which is also not a significant issue as it is common in this type of examination where the primary concern is data that deviates from the norm.
All data are numeric and expressed as float, although they are actually integers. While in a large dataset, this could significantly impact memory usage, it is not necessary to change the data types to int in this small dataset.

Therefore, the only necessary treatment is the identification and removal of duplicate records.

7. Data Preprocessing

In this section, I will proceed with 2 data processing steps:

Handling duplicate data
Data preparation for machine learning

As seen in the previous step, there is the presence of duplicate data in this dataset. Therefore, in this section, I will proceed with the identification and removal of this data.

# identify and view duplicate records
duplicates = df_raw.duplicated()
print(df_raw[duplicates])

It can be noted that the line shown above is in accordance with what was presented in the report. Let’s proceed with the deletion of this record.

# remove duplicate records
df_clean = df_raw.drop_duplicates()

Finally, a crucial step in any machine learning project involves data preparation, which includes splitting the data into two distinct sets: a training set and a testing set. The training set is used to train the algorithm, while the testing set is used only at the end of this study to understand the model’s accuracy and errors.

This will be done in a ratio of 90% for training and 10% for testing. Additionally, I will reset the index of the datasets to ensure no identification or relationship between the data. I will also display the sizes of the datasets.

# split data into test and training
## create variable 'test' with 10% of the original data frame
## with seed to generate reproducible results
test = df_clean.sample(frac=0.10, random_state=28)

## create variable 'train' with everything that is not in 'test'
train = df_clean.drop(test.index)

## reset index
test.reset_index(inplace=True, drop=True)
train.reset_index(inplace=True, drop=True)

# check dataset size
print(train.shape)
print(test.shape)
'''
(1902, 22)
(211, 22)
'''

8. Creating Machine Learning Models

The construction of a complete classification project using the PyCaret library involves six steps:

configuration
creation and algorithms comparison
model adjustment
visualization
predictions
finalize model

In the next stage, we will use the previously separated test set to evaluate the created model.

8.1. Why use PyCaret?

PyCaret is a user-friendly library that simplifies the process of building machine-learning models through automated functions for common model development steps. This makes it an extremely useful tool for quickly constructing machine learning models.

It’s worth noting that auto-machine learning tools like this do not replace the work of data scientists. They should be used to support and expedite the model-building process.

8.2. Configuration

The first step involves providing the necessary parameters to configure the data for our regression problem.

To do this, we informed which dataset should be used, in this case is df_clean; you need to specify the target attribute, which is the one you want the algorithm to predict; you have also set a 75% split for the training set, leaving the remaining 25% for the test set; you've set normalize to True so that the data is normalized (using the library's default, which is ZScore); transformation, when set to True, applies power transformation to make the data more Gaussian-like; remove_multicollinearity is set to activate the threshold value specified in multicollinearity_threshold, which determines the minimum absolute Pearson correlation value for identifying highly correlated features; finally, you've provided an integer value so that the results can be reproduced in any environment.

Multicollinearity refers to high correlation between independent variables. As we saw in the Exploratory Data Analysis, this is an issue in this dataset, and it can cause problems for the model to correctly identify the classes. By enabling the removal of multicollinearity, you exclude variables that are highly correlated with each other since this doesn’t add advantageous information to the model. This action improves the quality and relevance of independent attributes, allowing them to contribute significantly to class prediction.

# 1. configure regression
clf = setup(data=train,                       # data set
            target='fetal_health',            # target class
            train_size=0.75,                  # define size of the division of sets
            normalize=True,                   # normalize data to a single scale (zscore)
            transformation=True,              # transform data into normal distribution
            remove_multicollinearity=True,    # activate the 'multicollinearity threshold' value
            multicollinearity_threshold=0.95, # minimum value for Pearson correlation
            session_id=73)                    # seed to generate reproducible results

8.3. Creation and algorithms comparison

Here, all the available models in the PyCaret library will be created and trained, and then they will be evaluated using stratified cross-validation (by default, 10 folds). Additionally, the results are ranked by default based on the Accuracy score (but this could be changed if needed). I will keep this configuration since Accuracy will also be the metric used in the next step.

We will see the following evaluation metrics below:

Accuracy: proportion of correct predictions, to the total number of predictions made
AUC: (area under the curve) indicates how well the model can distinguish between two things
F1: harmonic mean of precision and recall
Kappa: agreement between model predictions and actual classes
MCC: measures the quality of predictions according to the confusion matri
Prec.: proportion of true positives, by total positive identifications
Recall: proportion of real positives, by the total identifications of real positives

# 2. create and compare models
best = compare_models()

Above, 15 different models were created and tested. Among them, the ones that showed the best results were the Extreme Gradient Boosting and the Gradient Boosting Classifier.

We can confirm the best-performing algorithm with the following code:

# print the best model
print(best)

Therefore, let’s continue using the Extreme Gradient Boosting.

XGBoost, also known as Extreme Gradient Boosting, belongs to the family of supervised classifiers known as Decision Trees. It has gained widespread use among professionals in the field due to its high degree of precision and accuracy in model creation.

Part of its success can be attributed to the multitude of hyperparameters that can be fine-tuned, enhancing the model’s performance. As a result, XGBoost can be applied to various types of problems, including classification, regression, and anomaly detection, across a wide range of industries.

# instantiate model
xgb = create_model('xgboost')

Note that the means of the metrics shown above are the same as for the model created in the list of all algorithms. This is because the model is created and evaluated in the same way as it was done previously.

8.4. Model Adjustment

The model will be optimized using accuracy, and as it is already the default for the tune_model function, there is no need to specify it.

Remembering that accuracy is one of the simplest metrics to evaluate a model. It is the proportion of the number of correct answers in the model to the total number predicted.

Its results range from 0 to 1.0, with the closer it is to 1.0, the better the result.

Its formula is given by:

# parameter adjustment
tuned_xgb = tune_model(xgb)

With the adjustment of parameters, we did not improve the R2 result: we went from 0.9488 to 0.9467. Therefore, it returns to the original state of the best result.

8.5. Visualization

In this section, I will generate various graphs to visualize and assist in understanding the created model.

First, let’s go to the Confusion Matrix, to evaluate the model’s performance in classifying the health of the fetus. With it we can check the errors and successes of the model.

plot_model(tuned_xgb, plot='confusion_matrix')

It can be noted that the model’s greatest difficulty is in correctly distinguishing groups 0 and 1, which totaled 18 incorrect predictions.

Let’s also check which attributes are most relevant to the model generated with the graph below.

plot_model(tuned_xgb, plot='feature')

In the Receiver Operating Characteristic (ROC) curve, we can assess how well the model can distinguish between the different classes of the problem. In this case, it’s distinguishing between a normal, suspected, and pathological fetus. The value given by the Area Under the Curve (AUC) varies between 0 and 1, with a value closer to 1 indicating a better-performing model.

plot_model(tuned_xgb, plot='auc')

In the image below we can check the precision, recall and f1 values for each of the classes in the study.

plot_model(tuned_xgb, plot='class_report')

Next, we can visually check the class transition areas (in a blue, red and green background) and the model predictions (in geometric figures and in colors according to the predicted class). This way, we can see the successes and errors of the created algorithm.

plot_model(tuned_xgb, plot='boundary')

Once again, we can see above that the greatest difficulty of the model lies in distinguishing classes 0 and 1.

Finally, the Learning Curve chart. With it, you can check the model’s performance as the training set size increases. In other words, you can understand how the model behaves with different amounts of data, which can lead to reflections on the need for more data or even the complexity of the model.

plot_model(tuned_xgb, plot='learning')

We can see above that the model was at the beginning of a possible convergence and, perhaps, it would be worth investing in acquiring more data to solve this problem. This is because the value of cross validation has shown a significant improvement according to the greater amount of data that has been passed to it.

It’s also a good idea to print an overall interactive evaluation of the model, available with the evaluate_model function, which provides various information about the constructed model. For example, the structure of the model's pipeline and the final hyperparameters used.

evaluate_model(tuned_xgb)

8.6. Predictions

Before finalizing the model, it’s a good practice to perform a verification where we make predictions using the test set that was separated using the setup function in the initial model creation step. This way, we check if there are no discrepancies in the results presented.

# make predictions
predict_model(tuned_xgb);

Note that the model performed slightly better than the result obtained in step 2, from 0.9488 to 0.9496 and within the standard deviation of 0.0181.

8.7. Finalize Model

In the last step of the model creation process, we finalize it with the finalize_model function, which will train the model using the complete dataset that was specified in the setup function. In other words, it will include the dataset that had been separated for testing.

# finalize modell
final_xgb = finalize_model(tuned_xgb)

We can print the final model to check the parameters used in the model.

# check parameters
print(final_xgb)

9. Test Suit Evalutation

To complete this study, we used a set of separate tests in the Data Preprocessing session so that we can evaluate the model with data that it has never had contact with before.

# prediction on unseen data
prediction = predict_model(final_xgb, data=test)
prediction.head()

It’s observed that the model achieved an accuracy within the range of values obtained during testing, specifically 0.9479, and falls within the expected standard deviation of 0.0181.

Additionally, two new attributes have been introduced: prediction_label, which displays the model's prediction for thefetal health (with 1 representing normal, 2 representing suspected, and 3 representing pathological), and prediction_score, which indicates the probability that the model calculated for the outcome being the one predicted in prediction_label.

10. Conclusion

The study aimed primarily to develop an algorithm capable of classifying fetal health as normal, suspected, or pathological based on the results obtained from a cardiotocography (CTG) examination.

With the use of the Pandas Profiling library, it was possible to obtain an overview of the dataset within a few seconds, generating a comprehensive report that assisted in exploratory analysis.

After proper data preprocessing, with the help of the PyCaret library, 15 classification models were generated and evaluated using various metrics. However, the decision was made to rank the results based on the Accuracy metric. As a result, the best model created by the Extreme Gradient Boosting algorithm achieved an accuracy of 0.9488.

Following that, hyperparameter tuning of the model was performed, but it was not possible to improve the model further. In the subsequent step of testing with PyCaret data, the model performed as expected, with a slight improvement in the evaluation metric, increasing to 0.9496. Finally, in the final test using new data, the accuracy resulted in 0.9479, falling within the expected standard deviation limits, demonstrating consistency in the created model.

Get to know more about this study

This study is available on Google Colab and on GitHub. Just click on the images below to be redirected.