Exploring the Hospital Readmission Dataset

3 min read3 days ago

For my midterm ML project, I’m using the Diabetic Hospital Readmission dataset, which is a vast, real-world clinical dataset with over 100,000 patient encounters. This dataset taught me exactly what real healthcare data looks. In this post, I’ll guide you through my EDA journey, which includes handling missing values, examining distributions, and identifying the first patterns that began to emerge. If you'd like to review the code, you can find it on my GitHub repository.

Step 1: Load & Inspect the Raw Data

The dataset came with a LOT of columns:

patient demographics
hospital visit counts
diagnosis codes
23+ medication columns
readmission labels (“0”, “>30”, “NO”)

Many values were “?”, not NaN. So I replaced them properly:

df = pd.read_csv("diabetic_data.csv")
df = df.replace('?', np.nan)

Step 2: Remove Columns With Too Many Missing Values

Some columns were basically empty: weight, max_glu_serum, A1Cresult, medical_specialty, payer_code e.t.c

Each had >70% missing values, so I dropped them:

df = df.drop([
    'weight', 'max_glu_serum', 'A1Cresult',
    'medical_specialty', 'payer_code'
], axis=1)

This simplified the dataset massively.

Step 3: Fix Inconsistent Text & Fill Missing Values

The dataset had inconsistent formatting:

“Male”, “male”, “MALE”
medication values like “No”, “NO”, “no”
blank diagnosis codes

I cleaned everything by lowercasing and replacing spaces:

for c in cat_cols:
    df[c] = df[c].str.lower().str.replace(' ', '_')
Then filled missing values:

df[cat_cols] = df[cat_cols].fillna('NA')
df[num_cols] = df[num_cols].fillna(0.0)

Step 4: Simple Visual EDA

Now my favourite part: visualizations. Nothing fancy.

Just clean plots answering simple questions:

1️⃣ What gender gets readmitted most?

Men and women had very similar readmission rates. No strong signal here.

2️⃣ Does age influence readmission?

Older age groups showed slightly higher readmission rates, but not dramatic.

3️⃣ Previous inpatient visits — very strong predictor

Patients with **more previous inpatient visits** had a much higher chance of being readmitted.

This became one of my strongest features later.

4️⃣ Time in hospital

Most patients stayed between 1–4 days.
Extremely long stays were rare.

Step 5: Convert the Target Variable

The original labels were:

0 → readmitted within 30 days
>30 → readmitted but not urgent
NO → not readmitted

I simplified to binary classification:

df['readmitted'] = df['readmitted'].apply(lambda x: 1 if x=='<30' else 0)

It’s not perfect — but predictable enough for a midterm project.

What I Learned This Week

Healthcare datasets are often messy and require careful cleaning, as real medical data frequently includes missing values, unusual codes, and inconsistent categories.
EDA reveals what features matter before modeling
Cleaning + EDA = 70% of the project

This publication is the beginning of something exciting for my career and for me personally. Let’s see where this journey takes us.

And also thank you so much Alexey Grigorev and the datatalksclub team.

Training Models, Raising Kids