Can We Really Trust in AI? A Heart Stroke Estimator Example.

Tolga Akiner
The Startup
Published in
11 min readNov 28, 2020

A simple description of the ‘decision making’ would probably be imagining the process as a simple function having input and throwing an output. This function approach is not only a way of organizing your calculations or the most foundational component of calculus but beyond these, it is also a fundamental component of how the ‘decision making’ mechanism works. This phenomena is involved into our lives in different ways such as the food you want to have for the dinner, destination for your next vacation, choosing a discipline in college, people you want to spend your life with and numerous other examples. It is redundant to talk about the number of scientific studies on this topic involving different fields such as psychology, neuroscience and philosophy. So I’m not trying to sail through a well established scientific literature, but I rather want to build this article upon a very new concept that I strongly relate to decision making process: being data-driven and reliability on the models and the data…

The concept of ‘data driven decision making’ has been shaking the world of business intelligence as demonstrated by different sources like this one where the idea is described as the combination of very sophisticated technologies such as databases, automation, advanced visual analytics, dashboards and so on so forth. The motivation behind this recent interest is the increased reliability of this theory: Being data driven is better than gut feeling and this has been intensely claimed by different people through different use-cases such as the huge success of Amazon and Netflix by being strategically data driven. So if we assume that this topic is very interested and hopefully is worth to write a story about it, then let’s frame the question to change the point of view a little bit to see the broader picture: We need accurate models that are more satisfactory than our inner voice, so how can we ensure the reliability of analytical models that we can undoubtedly insert into our logic flow with the piece in mind?

This question brings us to reliability, scalability, robustness and efficacy of the current state-of-art in data science and the related fields around. Different business questions require different accuracy metrics and reporting structures where decisions can be made on a very surface level results in some use-cases; however, we know that this is not the case in some application areas such as health care. As far as I know (and would be glad to be informed further about this) the very critical decisions such as surgery or diagnostic is not being made by a classification model so far. Even if we assume that Artificial Intelligence (AI) has a higher accuracy than a human counterpart (let’s suppose the accuracy is nothing but just the success in patient treatment after surgery), I would still highly suspect that there would be any consensus on letting an algorithm decide on the future of a human being. Similar example could be generated through the autonomous vehicles where the question is who is going to be accountable if any problem such as a deadly accident occurs? Or the unreliability of facial recognition for criminal detection where people have been discussing over the bias in the data and resulting injustice.

One of my favorite WebMD images. How much can AI collaborate with us for decision making?

In this article, I’d like to present a model that I built last year and revisited very recently to raise this question around the reliability of AI specifically in health care which has even further increased with the Covid pandemic. This amazing article clearly shows the current AI based patient diagnosis efforts and how they outperform humans for classification tasks like heart disease, skin cancer and eye disease, yet patients are still reluctant to embrace AI for different reasons as demonstrated by a survey. In addition to be a very hot AI application field, health care is a great example to talk over these ideas because there is no space for false negative in this field as anyone could anticipate.

So assuming that Machine Learning (ML) models will always have some level of false positives, how are we going to find the common ground from the utilization standpoint? Unfortunately, this is a question that I am not able to answer in this article, but I strongly believe that as being ML practitioners, we should keep raising these questions to trigger further discussion and brainstorming that can lead us to potential solutions. For example, if we achieve ML models that are somehow measured as more accurate than human inspections, would that change the perspective of the society? And I’m also wondering how people will feel about the models like the one I’m walking through in this article that is a heart stroke estimator. Is the common sense still very skeptical, or is there any shift to an optimistic mindset which is seeking for future improvement and potential solutions? Would you be OK with an AI putting a diagnosis on your condition and your treatment?

The model I’m presenting this article is a heart stroke estimator which has been interesting to me due to a couple of reasons. The first and the foremost is I do believe in AI to create some difference in future patient diagnosis efforts, and heart stroke classification can be considered as a part of this broad problem. Patient diagnosis is one of the most rigorous processes in health care industry in general and there are bunch of problems associated with it such as human error, non-standardization (different health care practitioners might diagnose differently) and the limited availability of the treatment (only half of the world’s population has access to health services). And considering the nature of this problem, it is a great candidate for an AI based classification effort. We have the problems with the current approach, very sophisticated and skillful models being developed every single day, growing amount of data and the last but not the least, a very large amount of money exchanged in the industry. What can we ask for more?

Well, sky is the limit for the futuristic AI philosophy and implications, so let me talk about some hands-on work as well. The dataset is a Hackathon challenge including patient level data of variety of different health conditions and a binary heart stroke labeling providing a nice playground for some EDA and classification. The code and the data is available on my Github. I’m not inserting many code gist/snippet in this article hoping that the notebook in the repo is more convenient for the reader.

Most of my data scientist friends and I like having a quick glance at the raw data which helped me to identify lots of NaNs, so let’s look at the missing part and then do some cleaning.

smoking_status       30.626728
bmi 3.368664
stroke 0.000000
avg_glucose_level 0.000000
Residence_type 0.000000
work_type 0.000000
ever_married 0.000000
heart_disease 0.000000
hypertension 0.000000
age 0.000000
gender 0.000000
id 0.000000
dtype: float64
dataset[‘bmi’]=dataset.bmi.fillna(dataset.bmi.mean())
dataset.dropna(inplace=True)
dataset=dataset.drop(‘id’, axis=1)

As we’ve just got rid of the missing values, let’s dive right in the correlations and see how some health conditions or related parameters might affect each other. I personally believe that feature engineering/analysis can have a significant impact in health care related problems as shown in this example:

I often see lots of correlation plots without any detailed explanation in different data science blogs; however, there is usually a lot to unpack in these visuals. So let’s try to interpret these two figures above as much as we can and see if you people will agree with me. I’ve analyzed the data (before cleaning so there may be a very little shift between these numbers and the figures but it is really ‘little’) by filtering in the excel spreadsheet to draw the following numbers. Well, I don’t think that we have to stick to pandas for everything, just a choice of preference.

Starting with an almost redundant conclusion, average glucose level (AGL) and BMI are lower for younger people. No surprise… There are only 3 people who had a heart stroke out of 11546 whose age is less than 25. I think this is also easy to digest.

There are some other correlations which might be more tricky to interpret though. For example, there is only 1 heart stroke out of 34 people who have AGL higher than 265 and that person is a 68 years old smoker with hypertension and previous heart disease. Don’t let this slice of data fools you, 1/34 ratio is almost 3% which is higher than the overall heart stroke ratio of the entire dataset that is 1.8% and AGL has only a mild correlation (~0.2) with heart stroke as can be seen in the Pearson plot above. We just need to rely on the general rule-of-thumb principle of any data analysis effort; it all depends on where you are looking at…

A similar example is there are 90 people with BMI larger than 60 but not even a single heart stroke but there is still a slight positive Pearson relation (~0.1) between these two variables. This is aligned with this reference who clearly indicates: “The association between excess weight and stroke risk has been controversial.”

One other thing I’d like to present here is a highlight clustering analysis which would be more useful if we did not have unlabeled data. Since we’ve just talked about BMI, AGL and age, and also since these are numerical columns, let’s look at only these three parameters to make things simpler. I’m going to start by a quick elbow analysis based on inertia:

And then choosing 2 clusters, I want to look at their distribution in a 3D environment:

One thing I’m wondering is if Kmeans divided up the data along AGL because of the larger interval in this dimension, but I don’t want to make this article too long by walking around some rabbit holes. My point here is I’d call the blue points as ‘risk groups’ if we had not have any labeled data and I’d also do a further analysis on age because its Pearson correlation with heart stroke is larger than the Pearson correlation of AGL vs. heart stroke. Considering the fact that we don’t always have labeled data in health care especially in enterprise applications, such approach might have had the potential to reveal some insight.

Do you also feel like enough of an EDA? Sure, we can build and train a model and then make some predictions right now. Since I have imbalanced data along with some binary/categorical variables, I did some hot-encoding, random over-sampling and scaling before feeding the training data into a classifier. Unfortunately, I was not experienced enough to document this code well back in time, so even though I remember testing some different classifiers, I don’t have their results. And since random forest is giving some decent-looking numbers, I’m just going with this one. Here is the confusion matrix:

And that time when you see 0 false negative… Precision and recall values larger than 99.6%… Bells are tolling in my head, a little spark is embarking upon and viciously leading all my nerves to one and only one conclusion: This is too good to be true… Well, it probably is too good but I do not have any explanation for the skeptic I’m having at the moment and I’d be more than happy to hear some feedback on this one. When I look at all 5831 prediction of 0 (no heart stroke), I see only 1258 active smoker, 314 heart condition and 686 hypertension data points so one can argue that the model might be more prone to predict ‘no heart stroke’ just because the characteristics of the dataset but this is not a complete explanation for having not even one false negative prediction.

But in the end, this is all the model, details, data, code and these are the result with all transparency… I’m hoping that I have not done any major mistake that would deceive us with this partly stinky result but minimum false negative is exactly what healthcare professionals expect from an AI application. I’m not saying that false positive is not a big problem. Please think of a quick hypothetical example where this model would be a patient diagnosis platform. 24 people would have been getting a treatment that they don’t need and spending time and money, not to mention the sorrow and frustration they have. Therefore, this 0 false negative does not justify the model for production level deployment but if we, as humanity, want to use AI for more critical applications than a spam classification, playing chess or a movie recommendation engines and go beyond and above like patient diagnosis, minimal false negative is definitely a strong foundational piece that we can build the further progress upon.

Here is my favorite paragraph, the takeaways. What can you get off of this article if you made it this far? Despite the fact that the civilization has been changing some paradigms around information extraction, data-driven decision making by means of different AI applications, there are some challenges that are still hanging in the air. Patient diagnosis is an AI based classification problem that has been taking quite amount of interest at both academic and enterprise levels in the last decade and the heart stroke estimation model of this article is just a high level example with the hope of demonstrating some application aspects. As shown via this model and results, false predictions might be minimized, going well below human accuracy and in my opinion, the health care industry should focus on how to leverage these recent advances instead of being stuck at legal and accountability barriers.

I love Medium and what I love even more is communicating over Medium, meeting with new people, discussing new ideas and being exposed to new stories and perspectives. Open source concept has been enhancing the AI efforts, our knowledge, and it has been built upon communication between theoreticians, practitioners and end-users as I discussed in my other article. So please feel free to connect and ping me for any kind of further discussion:

--

--

Tolga Akiner
The Startup

Have a keen interest around data science, some knowledge about NLP, lots of concerns for the environment, and a PhD in Computational Material Science