By Mr. Data Science
Data Science has had a huge impact on the field of medical science. Some of the areas where it is making a difference include:
- Medical image analysis
- Genetics and Genomics research
- Creating new drugs/Drug Discovery with Data Science
- Predictive Analytics in Healthcare
- Data Analysis of healthcare data
Topics like the discovery of new drugs are a little beyond the scope of this article but we can still take a look at some examples of predictive and exploratory data analysis of healthcare data. In example 1 we’ll look at data on cancer and how we could approach medical data analysis. Examples 2 and 3 will look at predictive analysis and the metrics we can use to determine the effectiveness of different models.
The use of data to better understand topics in medical science is not new. Some of the earliest uses of data analysis in medical science were attempts in the early nineteenth century to understand cholera. At that time, the causes of cholera outbreaks were not understood so the disease was common in cities throughout Europe and North America. Between 1832 and 1866, there were three major waves of cholera that spread across the world via trade links. In France heat maps of Paris were created to show how badly the different districts of the city were impacted and in 1854 Dr. John Snow collected and plotted data which showed a strong correlation between cases of cholera and certain water pumps in the city of London (at that time most people used public pumps or wells to get drinking water). This is probably one of the earliest examples of what we now call data analysis. Dr. Snow had not discovered the cause of Cholera but using some data visualization he had shown a link between the spread of the disease and the water pumps used to get water.
In the early days of data science there were some people who saw the potential of data analysis in medicine. The article , ‘Medical data mining: knowledge discovery in a clinical data warehouse’, was written in 1997 and describes an early example of using big data analysis to advance medical science. It did unfortunately take some time for data science to be taken seriously in the medical field but there are some areas where it is already making a significant difference, for example with medical images and diagnosis.
In the last decade or so machine learning, then neural networks have been used to assist clinicians in the diagnosis of disease. The article : ‘3D Deep Learning on Medical Images: A Review’, briefly outlines the introduction of CNNs (convolutional neural networks) and in particular using CNNs and 3D medical imaging to diagnose disease.
Before we get started, let’s set up your environment:
To follow along with this article you will need to install the following python libraries:
Data Science and Medicine
Example 1 — Exploring cancer data by state
The data used in this example is in three csv files:
All 3 files are available on Kaggle
Lets start by importing the necessary libraries and reading the files
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf_c = pd.read_csv('medical/Cancer.csv')
df_co = pd.read_csv('medical/Cancer_Occurrence.csv')
df_s = pd.read_csv('medical/State.csv')
We can use pandas built in functions to visualize the information and its format as shown below
The first dataset contains the different types of cancer covered in the dataset such as lung cancer, breast cancer, etc… the shape of df_c indicates there are 64 unique types of cancer. The second dataset contains information about cancer occurances by state and the last provides information about cancer occurance by race.
Using groupby to explore cancer dataset
Counting the number of cases of cancer occurances by state can be performed using the pandas groupby function. First we will group by State_name and subsequently sum up the Count column.
s = df_co.groupby(['State_name'])['Count'].sum()s.sort_values(ascending=False).head()State_name
New York 111276
Name: Count, dtype: int64
The number of cancer occurances by state are not surprising when you consider that larger states (by population) are more likely o have cancer cases.
If we look at the top five states by population:
You can see the top five states by population are the same states as above but the order is different. By population we would expect Texas to have more cancer cases than New York but the data we have suggests cancer is more common in NY than in TX.
From here, we could dig a little deeper:
- Is cancer really more common in NY?
- Could it have something to do with health insurance, a higher percentage of people in NY have health insurance, does this mean cases of cancer are not being detected in TX?
- Does cancer have something to do with geographic location?
There are many possible explanations for this observation, I’ll leave it to you to find out why this is true
The main point here is that data science and computers allow us to research diseases and other medical issues unlike ever before. Such an analysis in the nineteenth century would have taken MUCH longer.
As a technical takeaway, we used the concept of groupby to group rows by state then sum the ‘count’ column.
In the next example we will look at how data science allows us to do things that would have seemed like magic to a nineteenth century scientist. To motivate the following section, do you think we can use something like resting heartbeat and blood pressure to predict who is at risk of a heart attack?
Example 2 — Using data to predict heart disease
The data used in this binary classification example is available on Kaggle. First, lets import the dataset.
df_1 = pd.read_csv('medical/heart.csv')df_1.head(1)
Additional information about each column is provided below
- age — measured in years
- sex — male = 1; female = 0
- cp — Chest pain type: Typical Angina = 0, Atypical Angina = 1, Non-anginal Pain = 2, Asymptomatic = 3
- trtbps — resting blood pressure (in mm Hg on admission to the hospital)
- chol — serum cholestoral in mg/dl
- fbs — fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- restecg — resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)
- thalachh — maximum heart rate achieved
- exng — exercise induced angina (1 = yes; 0 = no)
- oldpeak — ST depression induced by exercise relative to rest
- slp — the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
- caa — number of major vessels (0–3) colored by flourosopy
- thall — 2 = normal; 1 = fixed defect; 3 = reversable defect
- output — the predicted attribute — diagnosis of heart disease (angiographic disease status) (Value 0 = < diameter narrowing; Value 1 = > 50% diameter narrowing)
The goal of this analysis is to determine if a person is at risk of a heart attack given the available data. This is a binary classification problem where we will define class 0 = not at risk, class 1 = at risk.
First, lets check for nulls:
The results indicate there are no nulls, and the data in all columns is numerical so there is no need for any type conversions. One thing that we might want to do, however, is normalize the data. In its current state, there is a significant difference in the range of values for the different columns.
We can visualize correlations between the different variable using a python visualization library called Seaborn. This is one of many python libraries that can be used to create a correlation matrix.
The default plot size is a little small so using matplotlib we can increase the figure size, this makes it easier to read the numerical values on the plot.
plt.rcParams['figure.figsize'] = [10, 10]
using the annot=True parameter will display the correlation values on the plot, the correlation values are provided by the pandas function corr():
sns.heatmap(df_1.corr(),annot = True);
This correlation shows, for example, there is some negative correlation between age and maximum heart rate achieved (thalachh) and some positive correlation between the ‘Output’ and ‘Chest pain type’ (cp) columns. The ‘Output’ column is the value we are trying to predict so looking at this plot we can get some idea which features (columns) have a stronger correlation with the output and so play a bigger role in the prediction.
There are other things we can do to change the visual appearance of this plot, for example we can define the range using vmin=-1, vmax=1, center= 0
sns.heatmap(df_1.corr().round(2),annot = True, vmin=-1, vmax=1, center= 0);
The aesthetic appearance is subjective, different people have different opinions, but the above change does bring out the higher correlation values a little more clearly for both positive and negative correlations. Immediately, we can now see that the variables: restecg, fbs, chol and trtbps have almost no correlation with the output.
We’ll try a support vector machine to predict the output. We will scale the column values and use scikit learns test-train-split to divide the data into training data and test data. SVM is a supervised algorithm so it requires training data.
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, roc_curve
Next, we will divide the data into X and y, where y is the variable we are trying to predict. We’ll also scale the data to remove any bias due to some colums having a higher numerical range than other columns.
X = df_1.drop(['output'],axis=1)
y = df_1['output']
scaler = RobustScaler()
# scaling the continuous featuree
X = scaler.fit_transform(X)X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)
Finally, we can rund the model using the scikit learn function described below.
clf = SVC(kernel='linear', C=1, random_state=42).fit(X_train,y_train)
# predicting the values
y_pred = clf.predict(X_test)
# printing the test accuracy
The model’s performance can be evaluated using a classification_report
print(classification_report(y_test, y_pred))precision recall f1-score support
0 0.86 0.86 0.86 29
1 0.88 0.88 0.88 32
accuracy 0.87 61
macro avg 0.87 0.87 0.87 61
weighted avg 0.87 0.87 0.87 61
To improve the accuracy of our model, we could experiment by trying different algorithms, different test-train ratios, etc… I’ll leave it as an exercise for you to explore how these changes impact model results. Overall, this example demonstrated how you can use machine learning, specifically the support vector machine algorithm, to solve this binary classification problem.
In the next example we will try to predict heart disease using logistic regression.
Example 3 — Predicting heart disease using logistic regression
The data used in this example is available on Kaggle.
The Logistic Regression model predicts the probability of something given some data. In this example, we will solve the binary classification problem as we did in example 2, but using a different model type. Note that classification is a very common task in data science so you can apply what you learn here to a wide variety of problems.
df_2 = pd.read_csv('medical/framingham.csv')df_2.tail(5)
There are just over four thousand rows and sixteen columns in this dataset. The education and glucose columns have a significant number of nulls.
The quickest way to deal with nulls is to drop the rows that contain nulls but keep in mind that you are losing information.
df_2 = df_2.dropna()df_2.isnull().sum()male 0
Next, we need to separate the variable we are trying to predict (‘y’) from all the other variables (‘x’):
Now we must split X and y into training data and testing data so we can evaluate our model’s performance:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30,random_state=42)
We also need to transform (normalize) the numerical data, otherwise some columns would have much higher numerical values than other columns which could skew the predictions.
from sklearn.preprocessing import StandardScaler
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
recall = cm/(cm + cm)
There are different metrics we can use to evaluate the success of a predictive model.
- Accuracy = what fraction of the test data was correctly classified
- Confusion Matrix = how many true positives, true negatives, false positives and false negatives
- Recall = TP/(TP+FN) where TP is true positive and FN is false negative
In other words recall is people correctly identified as being at risk of a heart attack divided by the sum of people correctly identified as being at risk of a heart attack and people incorrectly identified as not being at risk of a heart attack. From a medical point of view these false negatives are important, we want to reduce these as much as possible. The take away from this example is don’t rely on just accuracy, look at recall and the confusion matrix as well. A perfect model will reduce false positives and false negatives to zero.
A quick review of what you’ve learned:
To wrap things up, lets summarize what we have discussed. At this point, you should have a good understanding of:
- How to use groupby and sum data with a more complex structure
- How to use a SVM to solve a binary classification problem
- How to use metrics, other than just accuracy, to measure the effectiveness of a model.
Thank you for taking the time to read this article. If you have any feedback or suggestions for improving this article, we would love to hear from you.
- Prather, J., Medical data mining: knowledge discovery in a clinical data warehouse, date retrieved=05.03.2021, link
- Singh, S., 3D Deep Learning on Medical Images: A Review, date retrieved=05.03.2021, link