Uncovering Sepsis Occurrence Secrets through Exploratory Data Analysis

16 min readJun 14, 2023

I. Introduction

In the realm of healthcare, understanding the complex dynamics behind the occurrence of life-threatening conditions is of paramount importance. Sepsis, a potentially fatal condition resulting from the body’s extreme response to an infection, remains a major challenge for healthcare providers worldwide. Unraveling the secrets of sepsis occurrence can lead to improved early detection, timely interventions, and ultimately, better patient outcomes.

This project aims to delve into the pool of patient data, harnessing the power of data analysis and machine learning, to explore patterns and predictors associated with sepsis occurrence. By leveraging advanced computational techniques and drawing insights from comprehensive patient records, this research endeavor seeks to uncover hidden correlations, risk factors, and potential early warning signs that can facilitate earlier diagnosis and intervention.

In this article, the initial step entails conducting a thorough exploration of the data to gain insights into its nature. The focus will be on performing univariate, bivariate, and multivariate analyses of the sepsis dataset. Additionally, the article aims to test robust hypotheses that have been formulated based on the data exploration and analysis.

II. Nature of the Data

The dataset you provided consists of several columns representing different attributes and targets related to patient information. Here’s a summary of each column:

ID: This column contains a unique number assigned to each patient as an identifier.
PRG (Plasma glucose): This attribute represents the plasma glucose level in the patient’s blood. Plasma glucose levels are important for monitoring blood sugar levels.
PL (Blood Work Result-1): This attribute represents the result of a specific blood work test conducted on the patient. The exact details or purpose of this test are not specified.
PR (Blood Pressure): This attribute represents the blood pressure of the patient, measured in millimeters of mercury (mm Hg). Blood pressure is a crucial vital sign used to assess cardiovascular health.
SK (Blood Work Result-2): Similar to the PL column, this attribute represents the result of another blood work test. The specific details of this test are not provided.
TS (Blood Work Result-3): This attribute represents the result of yet another blood work test. The specific purpose or details of this test are not specified.
M11 (Body mass index): This attribute represents the body mass index (BMI) of the patient. BMI is a measure that assesses the relationship between a person’s weight and height and is commonly used to evaluate weight status and potential health risks.
BD2 (Blood Work Result-4): This attribute represents the result of another blood work test. The specific details or purpose of this test are not provided.
Age: This attribute represents the age of the patient in years. Age can be an important factor in understanding health conditions and risks.
Insurance: This column indicates whether the patient holds a valid insurance card. It is not specified how this information is represented (e.g., binary indicator or categorical values).
Sepssis (Target): This target column categorizes patients as either positive or negative for developing sepsis while in the Intensive Care Unit (ICU). Sepsis is a severe medical condition caused by the body’s response to an infection and can lead to organ dysfunction and life-threatening complications.

Basically, the dataset contains information about various patient attributes such as plasma glucose, blood work results, blood pressure, body mass index, age, insurance status, and the target variable indicating the development of sepsis in ICU patients. However, specific details about the blood work tests or insurance card representation are not provided.

III. Descriptive Statistics

A. Summary statistics of patient data variables

Key information

The ‘PRG’ column has a mean of approximately 3.82 and a standard deviation of around 3.36, indicating that the values are somewhat spread out around the mean.
The ‘PRG’ column ranges from 0 to 17, the ‘PL’ column ranges from 0 to 198, and the ‘TS’ column ranges from 0 to 846.
The ‘SK’ column has a median of 23, while the 25th and 75th percentiles are 0 and 32, respectively. This suggests a right-skewed distribution with a concentration of lower values.
The summary statistics show that approximately 68.6% of the data has an insurance value of 1, while the remaining 31.4% has an insurance value of 0. This indicates an imbalanced distribution of the ‘Insurance’ variable.
The ‘Age’ column has a minimum value of 21 and a maximum value of 81, indicating that the dataset includes patients with ages ranging from 21 to 81 years.

IV. Univariate Analysis:

Univariate analysis refers to the analysis of a single variable at a time. In the context of your dataset, it involves examining each variable individually to understand its distribution, central tendency, dispersion, and other key characteristics.
The first step is to generate kernel density estimation (KDE) plots for each variable and include summary measures such as mean, skewness, and kurtosis, and highlights potential outliers, allowing for a comprehensive univariate analysis of the dataset. Kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. It is a popular method for visualizing the distribution of data.

# Select columns to plot
cols_to_plot = ['PRG', 'PL', 'PR', 'SK', 'TS', 'M11', 'BD2', 'Age']

# Plot KDEs(kernel density estimation) for all columns
fig, axes = plt.subplots(nrows=len(cols_to_plot), figsize=(8, 40))
for i, col in enumerate(cols_to_plot):
    sns.kdeplot(data=df, x=col, ax=axes[i], fill=True)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Density')
    
    # Calculate mean, skewness, and kurtosis
    mean_val = df[col].mean()
    skewness_val = df[col].skew()
    kurtosis_val = df[col].kurtosis()
    
    # Add mean, skewness, and kurtosis as text annotations
    axes[i].text(0.6, 0.9, f'Mean: {mean_val:.2f}', transform=axes[i].transAxes)
    axes[i].text(0.6, 0.8, f'Skewness: {skewness_val:.2f}', transform=axes[i].transAxes)
    axes[i].text(0.6, 0.7, f'Kurtosis: {kurtosis_val:.2f}', transform=axes[i].transAxes)
    
    # Add mean line
    axes[i].axvline(mean_val, color='red', linestyle='--', label='Mean')
    
    # Add red dots to indicate potential outliers
    outliers = df[(df[col] > mean_val + 3 * df[col].std()) | (df[col] < mean_val - 3 * df[col].std())]
    axes[i].plot(outliers[col], [0] * len(outliers), 'ro', label='Potential Outliers')
    
    # Add legend
    axes[i].legend()
    
plt.tight_layout()
plt.show()

PRG variable, it appears that the distribution is positively skewed, suggesting the presence of some higher values. The distribution is also platykurtic, indicating a flatter peak and lighter tails compared to a normal distribution.

Based on the KDE plot analysis of the PL variable, it appears that the distribution is approximately symmetric, with a mean value of 120.15. The distribution is mesokurtic, suggesting a similar shape to a normal distribution.

The kde plot suggests that the blood pressure distribution is negatively skewed and has a more peaked shape with possible outliers.

The kde plot suggests that the distribution of TS (blood work result 3) is positively skewed and has a more peaked shape with heavier tails.This means that the tail of the distribution is extended to the right, indicating a higher frequency of lower values compared to higher values.This suggests that the distribution has more outliers or extreme values.

The kde plot suggests that the distribution of body mass index is slightly negatively skewed and has a more peaked shape with heavier tails. This means that the tail of the distribution is extended to the left, indicating a higher frequency of higher values compared to lower values. This suggests that the distribution has more outliers or extreme values. The kde plot indicates a positively skewed distribution for the blood work result with a more peaked shape and heavier tails. This indicates a higher frequency of extreme values or outliers. This means that the tail of the distribution is extended to the right, suggesting a higher frequency of lower values compared to higher values.

The kde plot indicates a positively skewed distribution of age, with a higher frequency of younger individuals. This indicates a more uniform spread of values without significant outliers or extreme values. This suggests that the tail of the distribution is extended to the right, indicating a higher frequency of younger individuals compared to older individuals.

Boxplot for Outlier Analysis

V. Bivariate Analysis

Pairwise Correlation Analysis

The analysis of the correlations suggests that attributes such as PL, M11 (BMI), and age may have a moderate positive correlation with the likelihood of developing sepsis. However, the other variables have either weak or very weak correlations, indicating limited or no meaningful relationship with sepsis development.

Group-by Analysis

Based on the provided count of sepsis cases by age range, it appears that the age range (20, 30] has the highest count of sepsis cases with 323 occurrences. This suggests that individuals in the age range of 20 to 30 are more susceptible to sepsis compared to other age groups in the dataset.

The disparity observed in the counts and mean values of sepsis cases by age range suggests that while the age range (20, 30] has the highest count of sepsis cases (323 occurrences), the mean value of sepsis cases within this age range is relatively lower (0.23) compared to other age ranges.

This discrepancy can be attributed to the difference in the population size of each age range. The age range (20, 30] has a larger population size, which results in a higher count of sepsis cases. However, when calculating the mean value, which represents the proportion of sepsis cases within each age range, the percentage of sepsis cases within the (20, 30] age range is relatively lower compared to other age ranges.

In other words, while the count of sepsis cases in the (20, 30] age range is high, the proportion of sepsis cases within that age range is relatively lower compared to other age ranges such as (40, 50], (50, 60], and (30, 40].

VI. Multivariate Analysis:

VII. Hypothesis Testing

Hypothesis 1:

Higher plasma glucose levels (PRG) are associated with an increased risk of developing sepsis.

Null Hypothesis: There is no association between higher plasma glucose levels (PRG) and the risk of developing sepsis.

Alternate Hypothesis: Higher plasma glucose levels (PRG) are associated with an increased risk of developing sepsis.

Justification: Elevated glucose levels have been linked to impaired immune function and increased susceptibility to infections, including sepsis.

Positive Group:
Mean PRG: 4.778846153846154
Median PRG: 4.0
Standard Deviation: 3.7557215116186895

Negative Group:
Mean PRG: 3.317135549872123
Median PRG: 2.0
Standard Deviation: 3.0181821629514967

T-Statistic: 5.172721512358376
P-Value: 3.154172341568826e-07

Mean PRG (Plasma Glucose) in the Positive Group (patients with sepsis) is 4.78, while in the Negative Group (patients without sepsis) it is 3.32. This suggests that, on average, patients with sepsis tend to have higher plasma glucose levels compared to those without sepsis.
The median PRG in the Positive Group is 4.0, whereas in the Negative Group it is 2.0. The median represents the middle value of a dataset, and it is less affected by extreme values. This further supports the observation that the central tendency of plasma glucose levels is higher in the Positive Group.
The standard deviation of PRG in the Positive Group is 3.76, and in the Negative Group, it is 3.02. The standard deviation measures the dispersion of data points around the mean. In this case, both groups have relatively high standard deviations, indicating considerable variability in plasma glucose levels within each group.
The t-statistic is 5.17, which indicates a significant difference between the means of the Positive and Negative Groups. A larger absolute t-statistic suggests stronger evidence of a difference between the groups.
The p-value is 3.15e-07, which is very small. This indicates strong evidence against the null hypothesis (no difference between the groups) and suggests that the difference in mean plasma glucose levels between the groups is statistically significant.
In other words, there is a significant association between higher plasma glucose levels and the risk of developing sepsis.

Hypothesis 2:

Abnormal blood work results, such as high values of PL, SK, and BD2, are indicative of a higher likelihood of sepsis.

- Null Hypothesis: There is no association between abnormal blood work results, such as high values of PL, SK, and BD2, and the likelihood of sepsis.

- Alternate Hypothesis: Abnormal blood work results, such as high values of PL, SK, and BD2, are indicative of a higher likelihood of sepsis.

Justification: Abnormal blood work results may indicate an ongoing infection or an inflammatory response, which are key factors in sepsis development.

Positive Group (PL):
Mean: 140.28846153846155
Median: 138.0
Standard Deviation: 32.80072259040371

Negative Group (PL):
Mean: 109.44245524296676
Median: 106.0
Standard Deviation: 27.120455674778167

T-Statistic: 12.302534453051374
P-Value: 3.678522495138333e-31

The positive group (sepsis) has a higher mean (140.29) compared to the negative group (109.44), indicating that patients with sepsis tend to have higher PL levels.

- The p-value (3.68e-31) is very small, indicating strong evidence to reject the null hypothesis that there is no difference in PL levels between the groups. This suggests that higher PL levels are associated with a higher likelihood of sepsis.

Positive Group (SK):
Mean: 22.221153846153847
Median: 27.0
Standard Deviation: 17.882578211575797

Negative Group (SK):
Mean: 19.680306905370845
Median: 21.0
Standard Deviation: 14.880122549396368

T-Statistic: 1.852114423904815
P-Value: 0.06450285034380407

The mean SK level is slightly higher in the positive group (22.22) compared to the negative group (19.68), but the difference is not as pronounced as in PL.

- The p-value (0.06) is relatively higher than the conventional significance level of 0.05, suggesting weaker evidence to reject the null hypothesis. This means that the difference in SK levels between the groups may not be statistically significant.

Positive Group (BD2):
Mean: 0.5651442307692308
Median: 0.499
Standard Deviation: 0.3828026047056809

Negative Group (BD2):
Mean: 0.43652429667519177
Median: 0.34
Standard Deviation: 0.301949365423994

T-Statistic: 4.511169534202187
P-Value: 7.765417586403595e-06

The positive group has a higher mean BD2 level (0.57) compared to the negative group (0.44), indicating a potential association between higher BD2 levels and sepsis.

- The p-value (7.77e-06) is very small, providing strong evidence to reject the null hypothesis and suggesting that higher BD2 levels are associated with a higher likelihood of sepsis.

Hypothesis 3:

Older patients are more likely to develop sepsis compared to younger patients.

- Null Hypothesis: There is no difference in the likelihood of developing sepsis between older and younger patients.

- Alternate Hypothesis: Older patients are more likely to develop sepsis compared to younger patients.

Justification: Advanced age is a known risk factor for sepsis, as the immune system weakens with age and may be less able to mount an effective response to infections.

Positive Group:
Mean Age: 36.69711538461539
Median Age: 35.0
Standard Deviation: 10.904929140599739

Negative Group:
Mean Age: 31.47826086956522
Median Age: 27.0
Standard Deviation: 11.913530900036795

T-Statistic: 5.254202967191448
P-Value: 2.0718778891881855e-07

The results indicate a statistically significant difference in age between the positive (sepsis) and negative (non-sepsis) groups. The positive group has a higher mean and median age compared to the negative group. Additionally, the standard deviation in the positive group is slightly lower than the negative group, indicating less variability in age among patients with sepsis.

Therefore, based on this analysis, there is evidence to support the hypothesis that older patients are more likely to develop sepsis compared to younger patients. The advanced age of patients may be a risk factor for sepsis, potentially due to the weakening of the immune system with age. Therefore, the NULL hypothesis can be rejected.

Hypothesis 4:

Patients with higher body mass index (BMI) values (M11) have a lower risk of sepsis.

- Null Hypothesis: There is no association between body mass index (BMI) values (M11) and the risk of sepsis.

- Alternate Hypothesis: Patients with higher body mass index (BMI) values (M11) have a lower risk of sepsis.

Justification: Obesity has been associated with a dampened immune response, potentially leading to a decreased risk of developing sepsis.

Positive Group:
Mean BMI: 35.385576923076925
Median BMI: 34.3
Standard Deviation: 7.195898164245342

Negative Group:
Mean BMI: 30.076470588235292
Median BMI: 29.9
Standard Deviation: 7.812731806515761

T-Statistic: 8.134971813407034
P-Value: 2.3972519626645312e-15

The results indicate a statistically significant difference in BMI between the positive sepsis group and the negative sepsis group. The positive sepsis group has a higher mean BMI (35.3856) compared to the negative sepsis group (30.0765). The t-statistic of 8.13497 suggests a substantial difference between the two groups.

Furthermore, the very small p-value of 2.39725e-15 suggests strong evidence against the null hypothesis (no difference in BMI between the groups). In other words, there is a significant association between higher BMI values and a lower risk of sepsis. This supports the hypothesis that patients with higher BMI values are less likely to develop sepsis.

NB:

It’s important to note that correlation does not imply causation, and additional factors or confounding variables may be influencing this relationship. Therefore, further research and analysis are recommended to gain a deeper understanding of the underlying mechanisms and potential causal relationships.

I decided to analyze this further by conducting a Stratified Analysis. In stratified analysis, I divided the dataset into subgroups based on BMI ranges and examine the sepsis incidence within each subgroup. This can help identify if there is a specific BMI range that exhibits a stronger association with sepsis risk.

# Define the BMI ranges
bmi_ranges = [0, 18.5, 24.9, 29.9, 100]
bmi_labels = ['Underweight', 'Normal', 'Overweight', 'Obese']

# Create a new column to represent BMI ranges
df['BMI Range'] = pd.cut(df['M11'], bins=bmi_ranges, labels=bmi_labels, include_lowest=True)

# Group the data by BMI range and calculate the sepsis incidence
grouped = df.groupby('BMI Range')['Sepssis'].count().reset_index()

# Plot the sepsis incidence by BMI range
plt.figure(figsize=(8, 6))
sns.barplot(data=grouped, x='BMI Range', y='Sepssis')
plt.xlabel('BMI Range')
plt.ylabel('Sepsis Incidence')
plt.title('Sepsis Incidence by BMI Range')

# Add data labels
for p in plt.gca().patches:
    count = p.get_height()
    plt.gca().annotate(f'{count}', (p.get_x() + p.get_width() / 2, p.get_height()), ha='center', va='bottom')

plt.show()

The information here suggests that individuals classified as overweight or obese have a higher incidence of sepsis compared to those classified as underweight or with a normal BMI.

Hypothesis 5:

Patients without valid insurance cards are more likely to develop sepsis.

- Null Hypothesis: There is no association between the absence of valid insurance cards and the likelihood of developing sepsis.

- Alternate Hypothesis: Patients without valid insurance cards are more likely to develop sepsis.

Justification: Lack of access to healthcare, as indicated by the absence of valid insurance, may delay or hinder early detection and treatment of infections, potentially increasing the risk of sepsis.

Chi-Square Test of Independence:
Chi-Square: 2.0712782081677066
P-Value: 0.1500956791860619

Since the p-value is greater than 0.05, we do not have sufficient evidence to reject the null hypothesis. The null hypothesis states that there is no association between insurance status and the likelihood of developing sepsis. Therefore, based on the available data, we cannot conclude that patients without valid insurance cards are more likely to develop sepsis.

VIII. Conclusion

One of the key findings of our analysis was a significant association between higher BMI values and a lower risk of sepsis. Through statistical tests, we found strong evidence against the null hypothesis, indicating that patients with higher BMI values are less likely to develop sepsis. This observation suggests that BMI can serve as a potential protective factor against sepsis. Further investigation and research are needed to understand the underlying mechanisms driving this association.

Stratified Analysis by BMI Range: To delve deeper into the relationship between BMI and sepsis, we performed a stratified analysis by dividing the dataset into subgroups based on BMI ranges. The analysis revealed that overweight and obese individuals had a higher incidence of sepsis compared to underweight and normal-weight individuals. This finding highlights the importance of considering specific BMI ranges when assessing sepsis risk. Identifying the BMI range that exhibits a stronger association with sepsis can aid in targeted prevention strategies and patient management.

Age as a Risk Factor: We also examined age as a potential risk factor for sepsis. Our analysis showed that younger individuals, particularly those in the age range of 20–30, had a higher incidence of sepsis. However, it is important to note that the analysis focused on sepsis cases, and further investigation is required to assess the overall relationship between age and sepsis risk. Age-related factors such as immune system function and comorbidities may play a role in sepsis susceptibility, warranting additional research.

It is important to note that this analysis is based on a specific dataset, and the findings may not be generalizable to all populations. Further studies with larger and diverse datasets are necessary to validate and expand upon these observations. Nonetheless, this EDA serves as an important stepping stone in understanding the risk factors associated with sepsis, ultimately aiding in early detection, prevention, and improved patient outcomes.

Visit my github for more details:

GitHub - aliduabubakari/Sepsis-Classification-with-FastAPI: This project is focused on the accurate…

This project is focused on the accurate and efficient classification of sepsis cases using the FastAPI framework…

github.com