Exploratory Data Analysis; Synthetic Healthcare Data

Oyinda Sangowawa
9 min readMar 15, 2024

--

Photo by National Cancer Institute on Unsplash

A. Overview

The sole aim of this project is to identify general patterns in the healthcare dataset. The dataset contains information about different Patients & their medical records as well as information about the Hospitals, Doctors, Insurance Providers etc. An Exploratory Data Analysis was carried out with Python in Kaggle’s Jupyter notebook, find my notebook here. The data used in this project is a synthetic one sourced from Kaggle.

B. Meaning of Types & Features

This section provides a summary of the different features as well as a sample of what the data looks like.

Transposed sample of the data

The entire dataset has 10,000 rows with 15 columns.

  • Name: This column represents the name of the patient associated with the healthcare record.
  • Age: The age of the patient at the time of admission, expressed in years.
  • Gender: Indicates the gender of the patient, either “Male” or “Female.”
  • Blood Type: The patient’s blood type, which can be one of the common blood types (e.g., “A+”, “O-”, etc.).
  • Medical Condition: This column specifies the primary medical condition or diagnosis associated with the patient, such as “Diabetes,” “Hypertension,” “Asthma,” and more.
  • Date of Admission: The date on which the patient was admitted to the healthcare facility.
  • Doctor: The name of the doctor responsible for the patient’s care during their admission.
  • Hospital: Identifies the healthcare facility or hospital where the patient was admitted.
  • Insurance Provider: This column indicates the patient’s insurance provider, which can be one of several options, including “Aetna,” “Blue Cross,” “Cigna,” “UnitedHealthcare,” and “Medicare.”
  • Billing Amount: The amount of money billed for the patient’s healthcare services during their admission. This is expressed as a floating-point number.
  • Room Number: The room number where the patient was accommodated during their admission.
  • Admission Type: Specifies the type of admission, which can be “Emergency,” “Elective,” or “Urgent,” reflecting the circumstances of the admission.
  • Discharge Date: The date on which the patient was discharged from the healthcare facility, based on the admission date and a random number of days within a realistic range.
  • Medication: Identifies a medication prescribed or administered to the patient during their admission. Examples include “Aspirin,” “Ibuprofen,” “Penicillin,” “Paracetamol,” and “Lipitor.”
  • Test Results: Describes the results of a medical test conducted during the patient’s admission. Possible values include “Normal,” “Abnormal,” or “Inconclusive,” indicating the outcome of the test.
for column in df.columns:
num_unique_values = df[column].nunique()
print(f'Number of unique values in {column}: {num_unique_values}')

The above image is a brief decsription of the number of unique(distinct) variables each feature has. For clarity, features are the column names and variables are what the columns contain.

For less errors when specifying features, they have been renamed;

# Renaming features for easy readability
new_names = {'Name': 'name', 'Age': 'age', 'Gender': 'gender',
'Blood Type': 'blood_type', 'Medical Condition': 'medical_condition',
'Date of Admission': 'date_of_admission',
'Doctor': 'doctor', 'Hospital': 'hospital',
'Insurance Provider': 'insurance', 'Billing Amount': 'bill',
'Room Number': 'room', 'Admission Type': 'admission_type',
'Discharge Date': 'discharge_date', 'Medication': 'medication',
'Test Results': 'test_results'}
df.rename(columns=new_names, inplace=True)
df.info()

The dataset comprises of various Categorical & Quantitative features with appropraite data types. Quantitative features are stored as integers & floats, while Categorical features are stored as objects(strings) and all features are correctly parsed.

C. Analysis of Distribution of Features

The purpose of this section is to shed more light on the distribution of the features.

Quantitative Features

Quantitative features are numerical features and are expressed as either being integers (int) or floats. They are usually continious.

# identify quantitative features
quantitative = df.select_dtypes(include=['int', 'float']).columns
quantitative
Quantitative Features

To get more insights on the quantitative features, it is important to find out the summary statistics of each of them.

  • Age: The average age is 51 and the age range is between 18–85
  • Bill: The average bill is 25,516.81 and the bill range is between 1,000–49995.9

Age

Billing Amount

Categorical Features

Categorical features are typically non-numeric and represent different levels or groups without a natural ordering between them. Their data types could either be objects or strings.

# identify categorical columns
categorical = df.select_dtypes(include=['object']).columns
categorical
Categorical Features

D. Data Pre-processing

Data preprocessing is a crucial step in the data analysis pipeline where raw data is cleaned, transformed, and prepared for analysis.

  1. Data Cleaning

Addressing missing values is crucial in EDA to ensure the accuracy, reliability, and validity of the analysis results.

# checking for missing values
df.isna().sum()

There are no empty columns within the data.

2. Feature Engineering

Feature engineering in this context can be described as creating new features from existing ones for purpose of further analysis.

# create new feature; age_group
def age_group(age):
if 18 < age <= 35:
return 'Young'
elif 36 < age <= 50:
return 'Middle-aged'
elif 51 < age <= 85: # Corrected the upper limit for the "Old" category
return 'Old'

# Create a new column 'age_group' based on the age using the defined function
df['age_group'] = df['age'].apply(age_group)

# Display the DataFrame with the new 'age_group' column
df[['age_group']].sample(5)

A new feature called Age group was created from the Age feature. From the summary statistics of the age feature, the ages range between 18–85. In this new feature, the different ages have been categorised into different groups.

  • 18–35 = Young patients
  • 36 — 50 = Middle-aged patients
  • 51–85 = Old patients
# create new feature; days hospitalized

df['date_of_admission']= pd.to_datetime(df['date_of_admission'])
df['discharge_date']= pd.to_datetime(df['discharge_date'])
df['days_hospitalized'] = df['discharge_date'] - df['date_of_admission']
df['days_hospitalized'] = df['days_hospitalized'].astype(int) / 86400000000000 # Converting to number of days

df[['days_hospitalized']].sample(5)

A new feature called Days Hospitalized has also been created by substracting the Date of Admission from Discharge Date for further analysis.

3. Remove irrelevant columns

Removing specific columns from a dataset is recommended if they would not be useful to the analysis.

# removing irrelevant features (name, room & doctor)
df = df.drop(['name','room', 'doctor'], axis=1)

The Name, Room & Doctor features have been removed because they are too ambigious and won’t be needed during the EDA.

E. Exploratory Data Analysis

This section provides a deep dive into each feature & visualizes significant relationships between two or more features.

  1. Gender

This piechart shows that the patient’s record has a good number of male and female patients, with the female population being 1.5% more than the male’ which isn’t a significant difference.

2. Medical Condition

Insights gotten from the medical conditions indicates that Asthma is the most prevalent condition and Diabetes is the least.

3. Admission Type

Most Admission types are Urgent and the least are Elective admission Types.

4. Test Results

The difference between the Test Results isn’t significant. Abnormal Test Results are more common and Normal Test Results are the least.

5. Age Group

A significant portion (53.02%) of the patients population consists of Old people. Young people make up 26.16% of the population and Middle-aged people make up the rest which is the least.

6. Insurance Providers

Analysis of associations & Group differences

In this section, the unique relationships between the features are explored.

  1. Billing Amount & Age Group

The above chart implies that Old people generate the most income for hospitals and Middle-aged people generate the least. This is understandable seeing as Old people make up a significant part of the population and Middle-aged people are the least.

2. Hospitals & Billing Amount

The chart above shows the top 10 hospitals that generate the most income. There are over 8,000 different hospitals.

3. Billing amount & Insurance

Cigna, Aetna & Blue Cross seem to be generating the same amount of Income, while United Health Care & Medicare have a slightly lower difference than the others. This explains why Cigna, Aetna & Blue Cross are also the most common insurance providers across the records.

4. Blood Type & Gender

Blood groups O+, B-, AB+, A+, AB-, A-, B+ are the most popular amongst female patients, except for O- which is predominantly more amongst male patients. AB- has the highest number of females.

5. Medical Condition & Gender

There are more females than males that have Cancer, Obesity, Asthma & Diabetes. Cancer has the most number of female patients. Slightly more male patients have Hypertension & Arthritis.

6. Age Groups & Gender

Female patients are predominantly more than males amongst Old & Middle-aged patients. There’s just a slight difference in both genders amongst the Young patients but Females are still more.

7. Age Groups & Insurance

Old Patients: Aetna & Blue Cross are the most used Insurance providers, while the least used Insurance provider is Medicare.

Middle-age Patients: Medicare & Cigna are the most used Insurance providers, while United healthcare and Blue Cross are the least used.

Young Patients: Cigna & Blue Cross are the most used Insurance providers, while Medicare is the least used.

8. Old patients & Medical Conditions

Cancer is the most common disease amongst Old patients & Hypertension is the least.

9. Middle-aged patients & Medical Conditions

Hypertension is the most common disease amongst Middle-aged patients & Diabetes & Obesity are the least.

10. Young patients & Medical conditions

Hypertension is the most common disease amongst Young people & Diabetes is the least.

F. Insights

This sections highlights major insights obtained from the Exploratory Data Analysis.

  1. The dataset has 9,378 different patients, 9,416 Doctors, and 8,639 Hospitals.
  2. Female Patients are more than Male Patients.
  3. Hypertension is the most prevalent medical condition in Young & Middle-aged people, while Cancer is the most prevalent in Old People.
  4. Old patients are significantly more, while Middle-aged patients are the least.
  5. Most Age groups use Cigna & Blue Cross as Insurance Providers. This explains why Cigna & Blue Cross generate the most income.
  6. The Hospital that generates the most income is Smiths & Sons.
  7. A significant amount of female patients have blood group AB- & a significant amount of male’s have O-.
  8. Most female patients have Cancer & most male’s have Hypertension

G. Limitations

As mentioned earlier, the dataset is a synthetic one and as a result a lot of the entries are not logically accurate. This meant that I couldn’t expand on a lot of the insights I discovered, but the dataset still served it’s purpose as my intention was majorly to use for Exploratory Data Analysis.

H. Conclusion

This is my first public project and working on it has been interesting and exciting. Despite the fact that it’s a synthetic dataset, the methodology of the analysis can also be applied to real life scenarios. If you have any questions or would like to discuss more about this project, please reach me on linkedin or email.

--

--