DEVELOPING SEPSIS PREDICTION APPLICATION, USING FASTAPI AND MACHINE LEARNING. CATEGORIZATION.

JustinJabo
10 min readAug 20, 2023

--

1.0 What is Sepsis?

The body’s reaction to an infection can harm its own tissues and organs, leading to the potentially fatal illness known as sepsis. It is a primary cause of hospital fatalities and a significant worldwide health concern. People of all ages can be affected by sepsis, and effective treatment needs prompt medical attention.

Sepsis poses difficulties in the medical field because of its quick progression and hazy signs. To start therapies on time and enhance patient outcomes, early sepsis identification and prediction are essential. One approach to create predictive models that assist in identifying people who may develop sepsis is through the use of machine learning techniques. This makes it possible for medical personnel to take preventative action and possibly save lives.

1.1 Goals And Objectives.

The aim of this project is to develop a machine learning model that, using clinical data, can reliably predict sepsis. Our goal is to train a classification model that can accurately identify between cases of sepsis and non-sepsis by using a dataset that includes pertinent variables including blood test results, blood pressure, BMI, and patient age. The generated model will then be used with a Docker container to be deployed as a FastAPI-based API.

2.0 Data Overview

The data for this project is in a csv format. The following describes the columns present in the data.

ID — Unique patient ID number

PRG — Plasma glucose level

PL — Blood work result 1 (mu U/ml)

PR — Blood pressure (mm Hg)

SK — Blood work result 2 (mm)

TS — Blood work result 3 (mu U/ml)

M11 — Body mass index (weight in kg/(height in m)²)

BD2 — Blood work result 4 (mu U/ml)

Age — Patient’s age (years)

Insurance — Whether the patient holds a valid insurance card

Sepsis — Target variable: Positive if a patient in ICU will develop sepsis, Negative otherwise

You can find the dataset in the github repo below:

[GitHub Repos:]

[GitHub Sepsis Repository:]

2.0 Ask Stage

The questions that we plan to address at the conclusion of the analysis process are listed here. To direct the analyses, the following hypothesis was put forth and questions were raised.

2.1 Hypothesis

Null Hypothesis (H0): Age does not determine whether a patient will develop Sepsis.

Alternative Hypothesis (H1): Age determines whether a patient will develop Sepsis.

For our data, the p-value functions as a “strength meter”. It informs us of the strength of the evidence debunking a specific claim. In contrast, a large p-value indicates that the evidence is poor and that our hypothesis may not be true, whereas a small p-value indicates that the evidence is strong and suggests that our notion is probably correct.

Comparable to a “difference detector” between two groups is the t-statistic. The variation within each group is compared to the average value difference between the two groups. There is a greater difference between the groups under comparison when the t-statistic is larger.

To put it another way, the t-statistic indicates the magnitude of the difference between two groups, and the p-value assists us in determining if our hypothesis is likely correct or not.

We can conclude that there is a significant difference in the mean age between patients with and without sepsis based on the extremely small p-value of 3.45e-07, which indicates strong evidence.

There is a greater difference between the groups under comparison, as indicated by the large t-statistic of 5.1556614056454775.

Null Hypothesis: Age does not determine whether a patient will develop Sepsis.

Null Hypothesis Rejected!

2.2 Questions

For our Exploratory Data Analysis (EDA), we must ask the right questions:

1. Is the train dataset complete?

2. What are the ages of the youngest and oldest patients?

3. What are the youngest and oldest patients with Sepssis?

4. What is the average age ?

5. What is the ratio of patients who are positive for sepssis to the negative patients ?

6. What is the highest and lowest BMI?

7. What is the average BMI ?

8. Is there a corelation between the Sepssis status and the other attributes?

2.3 Univariate Analysis

Observations:

The majority of patients attended blood work 1 and 3.

The majority of patients blood pressure ranges from 60 to 80.

The majority of patients have a glucose level below five.

Most of the patients are under 40 years old.

2.4 Bivariate Analysis

It is evident that younger individuals experience a higher incidence of Negative sepsis than older patients.

This implies that our null hypothesis, which states that age has no influence on a patient’s chance to develop sepsis, is false.

Sepsis is less common in patients with lower Body Mass Index (BMI).

Map of correlation for every numerical variable.

We will later remove one column from each pair of columns that has a correlation of 0.5 or higher.

Let’s examine the insurance variable more closely.

3.0 Data Preparation and Processing

We now arrange the data such that it is ready for analysis. Here, the goals are data consistency and cleanliness.

3.1 Issues with the data

1. There are too many zeros in every column.

2. There is little description in the column names.

3. There can be unbalanced classes in the target variable “Sepssis.”

4. Several of the numerical columns have a large number of outliers.

5. Multicollinearity could result from correlations between some of the predictor variables.

3.2 Cleaning the Data

As an overview of my research and findings about sepsis prediction is presented in this article, I will concentrate on the main tasks completed on the DataFrames in this section of the data cleaning chapter.

1. Enter the median value in place of zeros in each column.

2. Rename the columns to make them simpler to read and more descriptive.

3. Use techniques such as undersampling or oversampling to address the unbalanced classes in the target variable.

4. Identify and eliminate outliers using visualization techniques like box plots and scatter plots.

5. Find the factors that are highly associated using correlation analysis, then think about changing or eliminating them.

4.0 Answering the Questions

Here, I use the code and visuals (visualizations) to merge the “Analyze” and “Share” phases (stages) of the data analysis process.

4.1. Is the train dataset complete?

The dataset has no missing values.

4.2. What are the ages of the youngest and oldest patients

The ages of the youngest and oldest patients are, respectively, 21.0 and 64.0 years.

4.3. What are the youngest and oldest patients with Sepssis?

The ages of the youngest and oldest Sepsis patients are, respectively, 21.0 and 64.0 years.

4.4. What is the average age ?

33.32 years old is the average age (The mean age).

4.5. What is the ratio of patients who are positive for sepssis to the negative patients?

Sepsis positive patient to negative patient ratio is 0.54.

4.6.What is the highest and lowest BMI?

50.51 is the highest BMI while 18.20 is the lowest.

4.7.What is the average BMI ?

The BMI is 32.34 on average.

4.8. Is there a correlation between the Sepsis status and the other attributes?

There is a moderate correlation between Sepsis status and BMI and Blood_Work_R1.

5.0 Feature Processing & Engineering

This section is dedicated to dataset cleaning, processing, and feature creation.

5.1 Check and Drop Duplicates.

5.2 Drop Unnecessary Columns.

5.3 Dataset Splitting.

My train data was divided (split) into a test set and an evaluation/test set.

5.4 Data Imbalance Check.

We are unable to select our model using Accuracy Score due to the imbalance in the dataset. We’ll use RandomOverSampler to oversample our minority class in order to resolve this problem.

This is what we end up with

Our outcome is as follows:

From the foregoing (The information above), we can confirm that the imbalance in the dataset has been corrected.

5.5 Feature Scaling

6.0 Machine Learning Modeling

Since there isn’t one algorithm that works best for all machine learning projects, choosing an algorithm is a crucial Challenge (difficulty). In general, we must assess a group of possible candidates and choose those who do better for additional assessment, evaluation and review.

7 distinct algorithms — all of which have already been implemented in Scikit-Learn — are compared and evaluated in this research project.

1. Logistic Regression

2. RandomForest Classifier

3. XGBoost Classifier

4. K Nearest Neighbors

5. Support Vector Machines

6. DecisionTreeClassifier

7. Gradient Boosting Classifier Model

6.1. Models Comparison

6.2. Evaluation of the chosen Model

What is k-Fold cross validation?

K-fold cross-validation is a technique for assessing and evaluating a machine learning model’s performance. Here’s a brief and straightforward overview:

  1. K folds, or K equal-sized subsets, are created from the dataset.
  2. K-1 folds of training data and 1 fold of test data are used each time the model is trained (the model is trained K times in total).
  3. K performance estimations are obtained from the performance data that are recorded for every iteration.
  4. The model’s overall performance is represented by the average of the K performance estimations.
  5. Compared to a single train-test split, k-fold cross-validation yields (Offer) a more accurate estimate of model performance and lessens (minimizes) bias.
  6. It is frequently utilized for model selection, hyperparameter adjustment, and algorithm comparison.

To sum up, K-fold cross-validation, which involves training and testing a model numerous times on various subsets of the dataset, aids in determining how well a model will perform on unseen data.

6.3. Hyperparameter tuning.

To be tuned for hyperparameters, two models were chosen:

a. RandomForest Classifier

b. Gradient Boosting Classifier

Up until now, we have split our data into two sets: a testing set used to assess and evaluate the model’s performance and a training set used to determine the model’s parameters. Hyperparameter tuning is the next phase in the machine learning process. The process of choosing hyperparameters involves evaluating the model’s performance with various combinations of hyperparameters, then choosing the ones that yield the best results based on a selected metric and a validation method.

a. RandomForest Classifier

In order to generate thousands of possible combinations, we first establish a list of parameters that the grid_search_cv will loop over.

We improved from 0.83 to 0.85 on our F1 score. In essence, our model is improved.

b. Gradient Boosting Classifier (Best Model)

We are going to go through this again. This is the result (score) we obtain.

Let’s now utilize Pickle to export our scaler and our best model so we can use them to create a web application using FastAPI.

7.0 Exporting ML Components

Fantastic! Let’s now construct our machine learning application. You must first understand what an API is used for before we can begin.

To gain additional insight and understanding, as well as to view my code or repository, kindly visit my GitHub account by clicking the easily accessible link below. This will take you to my Github repos: [GitHub Sepsis Prediction Repository:]

--

--