Healthcare Management Analysis Project with SQL

A study on how hospitals managed care for Diabetic patients in the United States (1999-2008)

Huong (Tris) Nguyen
Learning SQL
9 min readDec 22, 2022

--

Created by Canva

INTRODUCTION

Diabetes is one of the top 10 causes of death globally according to World Health Organization. The number of new patients and re-admitted patients is raising year by year significantly. Therefore, the pressure for hospital management is also increased to be able to serve patients in all conditions. Some of the concerns from the management are:

  • The hospital's performance in managing the hospital capacity for all patients.
  • Budgeting the procedures and specialists for each patient case.
  • How patients are treated regardless of their race and identification for better care.
  • The success rate for emergency cases.
  • Summarize patients’ history for the overall views.

This project used data from 130 medical clinics for Diabetes in the United States from 1999 to 2008. It is not a recent dataset, but it shows some good insights and can be used for future prediction. After the analysis, I explored:

  1. Most patients stayed less than 7 days and a maximum of 14 days in the hospital.
  2. There were several lab procedures and medications for each patient with different specialists which could be considered to reduce the cost if necessary.
  3. As an immigrant country, the United States has people from many ethnic backgrounds. The analysis found that there was fair treatment at healthcare facilities with similar procedures and lab work.
  4. The longer the patients stay at the hospital, the more lab procedures are required.
  5. Around 33% of patients needed emergency services but stayed shorter at the hospital which reflects the good management system.

THE DATA

The data set was suggested by Avery Smith from Data Analytics Accelerator Program. You can access the data set from the link below:

Prediction on hospital readmission

General information about the dataset is summarized in the table:

From https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#

There are 2 datasets for this project: patient_demographics and patient_health in SQL File and were imported to MySQL WorkBench for further analysis. I used several functions to analyze this project including CONCAT, CTE, UNION, JOIN, CASE WHEN HAVING, DISTINCT, and ROUND functions.

THE ANALYSIS

In this article, I will go through 7 challenges to explore the dataset by using multiple functions as I mentioned above.

Challenge 1: The health care management wants to know the distribution of time spent in the hospital in general. Based on this information, they can manage the hospital's capacity for emergency and general patients. Therefore, the first analysis is to investigate the total hospital time for all patients.

In this analysis, I want to see how many patients stay in the hospital for different lengths. To start, I selected the total_days of hospitalizes and count the number of patients based on their time in hospital by grouping them in the same total_days with the order of total_days sorted in ascending order. Finally, I visualized the count number by a simple ‘*’ histogram. SQL is not a tool for visualization but with this trick, it is way easier to show the data.

SELECT ROUND(time_in_hospital,1) as total_days, COUNT(*) as count, 
RPAD('', count(*)/100, '*') as bar
FROM patient.health
GROUP BY total_days
ORDER BY total_days;
Total days in the hospital and total patients

As the result, the majority of patients stay less than 7 days in the hospital (all counts above the red line), mostly around 3 days or less.

Challenge 2: Budget management- A brand-new hospital director wants a list of all specialties and the average total of the number of procedures currently practiced at the hospital.

I selected the data from the health file and focused on the column named medical_specialty and num_procedures as they contain the information we needed. I also filtered out N/A data like “?” by using the WHERE clause. I used GROUP BY and DISTINCT functions to extract specific data for specialties without repeating the same values. It also helps in counting the total specialties in the hospital. I ROUND the AVG function for the number of procedures with 1 decimal. And finally, I organized the data in descending order.

SELECT DISTINCT medical_specialty, COUNT(medical_specialty) as total, 
ROUND (AVG(num_proceders),1) as average_procedures
FROM patient.health
WHERE NOT medical_specialty = "?"
GROUP BY medocal_specialty
ORDER BY average_procedures DESC;

Despite my function of 1 decimal round-up, there were some results with no decimal which raised concern. However, there is only 1 patient for those specific specialties, I might think it was an input error or recording. To narrow down the list of specialties, the management then wanted the specialties with at least 50 patients and more than 2.5 procedures on average.

I used the same query as above but added the filter for total equal or more than 50 with average_procedures more than 2.5. Because the total column was calculated based on the count of medical specialties, the HAVING function was used for conditioning aggregated groups and AND operators for procedures number required.

SELECT medical_specialty, COUNT(medical_specialty) as total, 
ROUND(AVG(num_procedures), 1) as average_procedures
FROM patient.health
WHERE NOT medical_specialty = "?"
GROUP BY medical_specialty
HAVING total > 50 AND average_procedures > 2.5
ORDER BY average_procedures DESC;
The specialist met the requirements

The result showed that only 5 specialties met the requirements including:

  • Surgery-Thoracic
  • Surgery-Cardiovascular/Thoracic
  • Radiologist
  • Cardiology
  • Surgery-Vascular

Challenge 3: Integrity- The Chief of Nursing wants to know if the hospital seems to be treating patients of different races differently, specifically with the number of lab procedures done.

I needed the information about race in the demographics table and the num_lab_procedures in the health table for the analysis. Therefore, I had to combine the 2 tables into one to be able to extract information from them by using the INNER JOIN function on the patient_nbr column.

SELECT d.race, ROUND(AVG(h.num_lab_procedures),1) as average_num_lab_procedures
FROM patient.health h
JOIN patient.demographics d
ON h.patient_nbr=d.patient_nbr
GROUP BY d.race
ORDER BY average_num_lab_procedures DESC;
The result of average lab procedures for different ethnical backgrounds patients

And the result showed there are 6 listings for race and the average number of lab procedures were quite close to each other with the highest result being for African Americans. Overall, there is no difference in treatment based on race at the hospital. However, the “?” and another listing also showed 44 lab procedures which could affect the result if they are identified.

Challenge 4: Do people need more procedures if they stay longer in the hospital?

First, I wanted to see the general measurement for the number of procedures at the hospital.

SELECT MIN(num_lab_procedures) as minimum, ROUND(AVG(num_lab_procedures),0) as average, 
MAX(num_lab_procedures) as maximum
FROM patient.health;

Based on this information, I divided the number of procedures into 3 different categories: few (0–25 procedures), average (25–55 procedures), and many (>55 procedures). To add the condition and return the specific values, I used the CASE WHEN function. Then I investigated the correlation between the number of procedures versus the length of hospitalization in total by listing the average time_in_hospital and grouping them together.

SELECT  ROUND(AVG(time_in_hospital), 0) as days_stay, 
CASE HEN num_lab_procedures >= 0 AND num_lab_procedures < 25 THEN "few"
WHEN num_lab_procedures >= 25 AND num_lab_procedures < 55 THEN "average"
WHEN num_lab_procedures >= 55 THEN "many" END AS procedure_frequency

FROM patient.health
GROUP BY procedure_frequency
ORDER BY days_stay;

It is now clear to see that the longer the patient stayed at the hospital, the more lab procedures were required for them and in reverse which would higher expenses for both patients and hospitals.

Challenge 5: You just got an email from a co-worker in research. They want to do a medical test with anyone who is African American or had an “Up” for metformin. They need a list of patients' ids as fast as possible.

At the first glance, I needed the patient ID which can be extracted from both tables. However, the requirement was asking for the conditioning filter for only African Americans (which can be found in the race column in the demographics table) OR up for metformin which contains in the health table. Therefore, I needed to combine 2 tables together by UNION as below.

SELECT patient_nbr FROM patient.demographics WHERE race = "Africa American"
UNION
SELECT patient_nbr FROM patient.health WHERE metformin = "Up";

As the result, I had a list of patient IDs who are African American and have “up” for metformin. I went further to count the total amount of target patients by using CTE.

WITH total_patients AS (
SELECT patient_nbr FROM patient.demographics WHERE race = "Africa American"
UNION
SELECT patient_nbr FROM patient.health WHERE metformin = "Up"
)
SELECT COUNT(patient_nbr)
FROM total_patients;

And the total number of patients is 1037 which reached the requirements.

Challenge 6: The Hospital Administrator wants to highlight some of the biggest success stories of the hospital. They are looking for opportunities when patients came into the hospital with an emergency (admission_type_id of 1) but stayed less than the average time in the hospital.

For this challenge, I needed to filter again for emergency cases only but time in the hospital is less than average which needs to be calculated and I counted the total successful cases.

WITH average_time_hospital AS(
SELECT AVG(time_in_hospital) as average
FROM patient.health
)
SELECT COUNT(*) as successful_case
FROM patient.health
WHERE admission_type_id = 1
AND time_in_hospital < (SELECT* FROM averaqge_time_hospital);

The result was 33684 patients.

Then compare this number with the total of patients.

SELECT DISTINCT COUNT(*) as total_patients
FROM patient.health;

There was a total of 101766 patients reported. From this number, I can easily calculate the percentage of success rate was 33% which is reasonable for emergency and short-term stays only at the hospital.

Challenge 7: The requirement is to write a summary for the top 50 medication patients, and break any ties with the number of lab procedures (highest at the top) by following the hospital’s format.

In this challenge, I needed the information from the health table including patient_bnr, num_lab_procedures, num_medications, readmitted (or not), and from the demographics table (race). Therefore, I needed to combine the 2 tables again by using the INNER JOIN function and LIMIT with 50 results.

The printing format included the aggregated data and string so I used CONCAT to return the values and CASE WHEN statements for filtering the “readmitted” condition for each patient.

SELECT CONCAT('Patient ', h.patient_nbr,' was ',d.race,' and ',
CASE WHEN h.readmiited = "NO" THEN " was not readmitted. They had " ELSE
" was readmiited. They had " END,
h.num_medications, " medications and", h.num_lab_procedures, " lab procedures." as Summary
FROM patient.health h
INNER JOIN patient.demographics d
ON h.patient_nbr=d.patient_nbr
ORDER BY h.num_medications DESC, num_lab_procedures DESC
LIMIT 50;

And the result is ready to report to the management.

I had to fix this code for at least 3 times for the space between the characters so it will be readable. Do not forget to add extra space!

I enjoyed working on this project as I reviewed the SQL knowledge that I learned during the summer and practiced analyzing the insights from the data. However, I would like to investigate further into other features such as:

  • How the ages of patients affect their level of diabetes.
  • Re-admitted rate and how to improve it.
  • How aftercare service works and the follow-up schedule for each patient.
  • Cost analysis for each specialist and how to reduce them.
  • Waiting time for service: emergency, regular check-up, etc.

THANK YOU!

Thank you so much for spending time to go through this project with me. Even though this is a simple project, it is a milestone on my journey to be a data analyst in the future. I would be happy to discuss any details if you have any questions about this project.

Let’s connect on LinkedIn:

https://www.linkedin.com/in/huong-tris-nguyen-847067111/

Go from SELECT * to interview-worthy project. Get our free 5-page guide.

--

--

Huong (Tris) Nguyen
Learning SQL

My stories of switching my career from lab analyst to data analyst- Happy Sharing :) Let's connect: https://www.linkedin.com/in/huong-tris-nguyen-847067111/