How Relevant is your HeatMap in Machine Learning Model

Aditya Srivastva
Analytics Vidhya
Published in
4 min readMar 21, 2021

--

Proving relevant features for your machine learning model

Photo by Clay Banks on Unsplash

Many of us agree with me that human mind understand graphical representation way better than any numeric forms of data. That’s the place where graphs comes into the picture. Many ML developers uses heatmap in machine learning model. But do we actually understand the meaning of it? Even we do, does it actually justify your model?

Today I will be digging deep into Seaborn heatmap and justifying using a ML model so that it answers our questions.

The actual purpose of this article to understand the meaning of the heatmap rather than creating the ML model. So we will be having little bit of background setup and little EDA(Exploratory data analysis) and more of heatmap understanding.

So close the door, grab a coffee☕ and lets start.

Problem: We will be having a classification problem data which state that does a person tent to have a heart disease if he/she has the reading involve these factors. (find the data here)

  1. age: age of the person in years.
  2. sex: 1 for male 0 for female
  3. cp: chest pain (0,1,2,3)
  4. trestbps: rest BP
  5. chol: cholestoral
  6. fbs: blood sugar on fasting.
  7. restecg: electrocardiographic
  8. thalach: maximum heart rate.
  9. exang: exercise with angina
  10. oldpeak: heart condition while exercising.
  11. slope: slope of the heart while exercising
  12. ca: indicate the blood movement
  13. thal: thalium stress (the more the danger)
  14. target: 1-person tent to have disease, 0-person dosnt tent to have any heart disease.

Alright lets start with little EDA

What kind of data do we have here?

Head set of the data top 5 results

Target will be our dependent variable and rest will be independent variable. From the file we can say that it is a balanced dataset. Now let’s see the seaborn heatmap to find the correlation between target and other columns.

Data co-relation on heatmap

The thing about heatmap is as shown on right side. The more green the more positively co-related, the more red negatively co-related. So we can say that “cp” is highly co-related to “target” so is “thalach”. This is what the heatmap is indicating. lets find it out if this is correct.

1. Impact Analysis of “thalach” on “target” (positive co-related)

All the data we have is for “thalach” is like in screenshot.

We will segregate these data in to two section 140<data1<180 and other is 140>data2>180. we have below.

SDS has around 200 data and around 63% chances of having a heart disease (target = 1). On the other hand TDS has 100 data and only 38% chances of having a heart disease (target = 1).

Conclusion: This indicates that the person has higher heart rate tentative to have higher chances of heart disease. Which is indicated same on heatmap.

2. Impact Analysis of “sex” on “target” (negative co-related)

If we look closely there are approx. 73% chances of women to have heart disease but approx. 46% of chances of men to have heart disease.

🤷‍♂️ hey ….! what you get from this?

Okie let me explain. See the value of sex varies from 0-female to 1-male. Positive coefficient means that if the value of “sex” increase the value of “target” should increase, But in current scenario it’s vice-versa. If the value of sex is increasing the value of target is decreasing. Explaining the reason of negative coefficient.

Cool 😎 I heard you…this is negatively related let me drop this feature.

🚫 Hang on that’s the point you should not.

I have heard many developers saying that if the feature is negatively related — drop it. Actually any feature which is negatively corelated is as important as positive corelated. What matter is their values, if the coefficient value is very near to 0.0 (either positive or negative) then we should consider dropping it. Elsewise it is important for our model regardless of positive or negative.

Importance of Data Investigation

One should always try to understand the type of data one is working.

Let me tell you by giving an example. If you take a look at the “cp” column data it varies from 0–3. Which means that…

  • 0: Typical angina
  • 1: Atypical angina
  • 2: Non-anginal pain
  • 3: Asymptomatic

Fine what does it mean? “cp-0” means this types of pain are very much related to heart and “cp-3” means these pain are not about heart pain. Which mean if the CP value is less person tent to have heart disease. According to this analogy this suppose to have the negative coefficient value but actually it has the positive coefficient value.

This means we need to understand the data which we are working. It suppose to be root of you model.

--

--