Heart Disease: A Geospatial Analysis

Matt Damberg
8 min readOct 21, 2023

--

www.linkedin.com/in/mattdamberg

Having a previous background as an ER nurse, medicine, specifically pathophysiology (the study of disease process) has always fascinated me. One of the most fascinating aspects of this field are the trends and patterns that appear when you view this type of data through different lenses such as income and geographic location.

This data set is from Kaggle and is titled Heart Attack Risk Prediction Data Set. I will provide a link to the download file at the bottom of this article. I completed this project in a multi-step process beginning first with data cleaning and formatting in SQL and then importing to Power BI for visualization.

Before we begin, if you find this article/project interesting or useful please consider following me on here on Medium and connecting with me on LinkedIn. I love to share my work with others, especially when I know that it has helped them in some way. I also really enjoy receiving feedback, so don’t hesitate to share your thoughts.

The Project

When starting any project it is critical to have a question you are seeking to answer thus giving the project purpose and direction. This means defining a problem which you can then use as a blueprint to build your project around. For example, when I set out to do this project I had one over arching question which kept popping into my head; Does geographic location and income influence your risk for heart disease?

Data Set

In total, there are 8,763 total rows in this data set and a total of 26 columns. These consist of a Patient ID primary key and various health and other identifying statistics pertaining to these patients such as vital signs, lab values, and geographic identifiers. With a total of 8,763 rows, we can infer that there are 8,763 total patients. Given the continent column indicating that there are patients selected from all continents (except Antarctica), this seems to be a limited sample size.

Data Cleaning

The data from this set came in a fairly clean and workable state in its raw form. Cleaning for this set mainly consisted of fixing formatting errors for the date column and rounding several of the columns with FLOAT types.

Once this was completed, the data was brought into Power BI and I began making some modifications using Power Query. I began by creating several date related columns including a Year column, a Month Number and Month Name column. Finally, a geographic hierarchy was created encompassing the hemisphere, continent and country.

Visuals

Now having a question to answer, I was able to being creating my visuals accordingly. I decided to add some gauge charts across the top of the dashboard displaying the averages of important lab values across the patient population with a KPI card displaying the total number of patients and number of patients with a high BMI, Cholesterol, and their average income. These, in addition to the other visuals on this page can be filtered by country. The target value, indicated above the gauge is the max value before being considered abnormal and placing the persons at higher risk for heart disease. A visual displaying the total number of patients in the data set is also provided below these gauge charts.

The bottom left corner of the dashboard displays a table showcasing pertinent columns which include lab values, lifestyle characteristics, laboratory values and income by Patient ID. This table allows for a detailed analysis and view of specific heart disease risk factors by both income and country and will help me answer my project question when I conduct my analysis.

To round out this page, a slicer was placed in the bottom right allowing you to filter the page by country. This was implemented in the form of a drop down menu. I placed a button on both pages connected to a bookmark which clears any and all filters applied to visuals and brings you back to the unfiltered view of both pages.

The second page displays a global map showing the total number of patients at risk for heart disease based on a geographical hierarchy containing Hemisphere, continent and country allowing for drill down and up. Below this is a bar chart which shows the number of exercise hours vs sedentary hours by continent. Finally, a line chart was placed in the bottom right of the dashboard. The chart has 2 lines one showing income the other showing average BMI (Body Mass Index), both compared by country.

These visuals will help me in the next section where I conduct my analysis. Using these visuals, I will identify trends and patterns in the data and use it to create insights related to patient health and ultimately answer my project question.

Analysis

When looking at this cleaned and visualized data set there are several things that I feel needs to be addressed before we get into the analysis. The total patients for this survey were 8,763. There are almost 8 billion people on this planet and a little over 8,500 seems like much too small a sample size for a true representation of worlds population.

Secondly, the distribution of the patients selected from each continent are anything but consistent. For example, the continent of Africa had 873 people chosen for this study, and North America had a total of 860. According to an article from Statista (Statista Research, 2023), the continent of Africa makes up about 17.9% of the worlds population while North America makes up around 4.7%. This is a large discrepancy in the distribution of the sample size and the size of the sample itself will ultimately skew the insights pulled from the data.

Risk Factors

Before I can answer my project question I should first review what exactly makes a person at risk for heart disease. The list of risk factors is several miles long so for the sake of brevity I will limit it to only the ones listed in the data set.

  • High Blood Pressure (>140/90)
  • High Heart Rate (>100)
  • Obesity
  • Elevated BMI (>24.9)
  • Sedentary Lifestyle
  • High Triglycerides (>200)
  • High Cholesterol (>150)
  • Family History
  • Smoking
  • Poor Diet
  • High Stress

The dashboard displays these values using conditional formatting as green if they are within normal range or as red if they are outside of the normal range.

As a reminder, my question to answer is “Does geographic location and income influence your risk for heart disease?” Beginning with the largest level of granularity, I calculated the percent of patients at risk for heart attack by continent and found them all to be within 3 percentage points of one another. North America had the highest at 37 percent.

  • Africa: 36%
  • Asia: 35%
  • Australia: 36%
  • Europe 34%
  • North America: 37%
  • South America: 36%

Next, I calculated the percent of patients with elevated BMI(Body Mass Index) by continent. Here we have a bit wider range of values from 63% in Africa to 69% in North America and Australia. While we do have some separation in values, this is not substantial and all values are well above normal indicating obesity, poor diet and increased risk for heart disease globally based on this data set thus far.

  • Africa: 63%
  • Asia: 67%
  • Australia: 69%
  • Europe: 68%
  • North America: 69%
  • South America: 66%

Finally, I calculated the percent of patients with elevated cholesterol by continent. Values are again tightly grouped and demonstrate little variation between continents.

  • Africa: 71%
  • Asia: 71%
  • Australia: 74%
  • Europe: 71%
  • North America: 73%
  • South America: 71%

These numbers also indicate exceedingly large percentages of the population for all continents are at risk for heart disease but does not specifically suggest that your geographic location plays a role.

Sedentary lifestyles are another huge risk factor for heart disease and one of the ways to mitigate that risk is with exercise. On the second page of my report I added a stacked bar chart comparing the number of sedentary hours vs the number of exercise hours by continent. This chart again showed a tight grouping of values with average sedentary hours per continent between 5.9 and 6.2 hours per day. In contrast, the grouping for average exercise hours is even smaller ranging from 3.4 to 3.5.

Income is a factor which affects many aspects of health such as the food we eat, access to medical care, and many more. The distribution of income per patient ID for this data set ranges from near $300,000 to $20,000 a year. When viewing the average incomes, the values were substantially higher than I had anticipated and when averaged by continent varied far less than I thought they would.

  • Africa: $158,538
  • Asia: $158,470
  • Australia: $158,228
  • Europe: $157,903
  • North America: $157,908
  • South America: $158,994

To me, these values seem incredibly inflated and scream selection bias to me. According to an article from The Human Capital Hub (Mushay), the average yearly salary in Africa is around 760 US dollars (about 10 times less than North America). As I just stated, this all but confirms either bad data or selection bias where individuals were not selected at random and were instead vetted before being picked. This taints the integrity of the data set and severely limits its accuracy and by default, its usability.

With that said, based on the this limited and biased available data, there does not appear to be a correlation of an increase or decrease in your risk for heart disease based on neither income nor geographic location.

Recommendations

In order for this data set to be one of integrity and accuracy, several things need to be changed. First and foremost, the sample size is way too small to accurately represent the world population and therefore a much larger size is needed. A potential solution to this could be to take a scaled down ratio of each countries respective population to include in the data set. This would help ensure a more accurate geographic distribution of patients.

Lastly, the selection bias present in this data set would need to be eliminated in order to provide accurate insights. Ensuring that patients from all socio-economic statuses are included in the study could help prevent that bias and create a more well rounded group and will more accurately represent the respective countries.

Conclusion

I hope you enjoyed my analysis of this data set. While it did not turn out the way I had anticipated it would, it still provided excellent real world experience as situations like this occur regularly. It therefore becomes imperative to recognize the signs of an ineffective and skewed data set as this will spare your business (and yourself) a lot of headache and potential monetary loss on bad and inaccurate insights. If you found this article interesting or useful please consider following me here on Medium as well as on LinkedIn. And as always, don’t hesitate to leave your thoughts!

Works Cited

Dataset Link

https://www.kaggle.com/datasets/iamsouravbanerjee/heart-attack-prediction-dataset

--

--