Ensemble Machine Learning
Using a range of ensemble machine learning frameworks to predict the presence of heart disease.
In this practical tutorial, we will undertake exploratory analysis of a dataset that contains records of 303 individuals with varying degrees of symptoms that may be used to indicate the presence of heart disease, as determined by the patient’s angiographic disease status.
We will then apply ensemble methods including Adaptive Boosting and Gradient Boosting algorithms to build a model that accurately predicts a patient’s heart disease status based on the attributes contained within the dataset.
Common measurements were taken from each patient including blood pressure, type of chest pain, electrocardiographic abnormalities, resting blood sugar and so on. The original dataset can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/.
We use the processed cleveland data which contains 14 attributes of the original dataset. The attributes are:
- Age: age in years
- Sex: sex (1 = male; 0 = female)
- Cp: Chest pain type (Value 1 = typical angina, Value 2 = atypical angina, Value 3 = non-anginal pain, Value 4 = asymptomatic pain)
- Trestbps: resting blood pressure in mm/Hg on admission to the hospital
- Chol: serum cholestorol in mg/dl
- Fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- Restecg: resting electrocardiographic results (Value 0 = normal,
Value 1 = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), Value 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
- Thalach: maximum heart rate achieved
- Exang: exercise induced angina (1 = yes; 0 = no)
- Oldpeak: ST depression induced by exercise relative to rest (refer Restecg)
- Slope: the slope of the peak exercise ST segment (refer Restecg) (Value 1 = upsloping, Value 2 = flat, Value 3 = downsloping)
- Ca: number of major vessels (0–3) colored by flourosopy
- Thal: thalium stress test result (Value 3 = normal, Value 6 = fixed defect, Value 7 = reversable defect)
- Num (the predictor attribute): diagnosis of heart disease* (angiographic disease status) — (Value equals 0 : < 50% diameter narrowing, Value greater than 0 : > 50% diameter narrowing)
*For this tutorial, we will assume that 0 means no heart disease, and 1,2,3,4 means heart disease.
Data Management, Cleaning and Preprocessing
Before proceeding, we need to examine the data to gain more understanding about its underlying properties. This can be completed by heading to the Data Manager module in AutoStat®.
Let’s have a look at the data and undertake some basic descriptive statistics.
Changing Data Types
It also appears that some of our numerical variables are actually categorical. In the case of the Thal variable, we have three possible values. Although they present as numbers (3, 6 and 7), we know from the data description that these numbers simply represent three different categories. Let’s change the data type to categorical.
Descriptive Statistics and Missing Values
After we convert all the variables to the correct data type, we can still see that there are missing values in varibles Ca and Thal from the Descriptive Statistics output.
As we have explored in our other tutorials, there isn’t a hard-and-fast approach to dealing with missing values. Depending on the situation and underlying data, we may impute a certain value (such as mean of the other values) in place of the missing values; treat them as zero; or remove those observations entirely. In this case, as we only have a few missing values, we’re just going to remove the ones with missing values.
In this scenario, the presence of heart disease is not binary, in that the diagnosis of heart disease is based on a spectrum of outcomes (where 0 = no heart disease, and an increasing severity of heart disease from outcomes 1–4). For this tutorial, we are going to define the values into one of two possible outcomes — they either present with heart disease, or they don’t.
To do so, we need to create a new variable, that buckets the variables into binary outcomes. The new variable is called Target, and is created by converting 0 values to “No Presence” (No heart disease presentation) and converting 1, 2, 3, 4 values to “Presence” (heart disease presentation).
Splitting into train and test
When you are performing classification using machine learning or statistical models, you may wish to split the data into a training and a test data set. In this case, we want to have an equal split over the stratum of Target, and we have chosen to use 80% (238) of the data to train the models and the remaining 20% (59) to test the predictive ability of the model.
Now we have taken care of data housekeeping, let’s undertake some exploratory analysis.
Visualising the data
Before we undertake any modelling, we should visualise the data and variables so that we can gain an understanding of the relationships between each other, and check for obvious data error issues.
In this analysis, we are interested in finding out which variables are related to Target. So we’ll project the data using a range of charts that will help us understand the relationships between the variables.
Let’s explore a few chart types.
Bar chart of Target with other categorical variables, breaking the visualisation down into various subgroups:
Distribution of Age with subgroup Target:
In this chart, we can see the distribution of patients with and without heart disease by age. Interestingly, we can see that the age distrubution of people who do not have heart disease follows a normal distribution, whereas the age distrubution of people who do have heart disease is left skewed. This may indicate that elder people are more likely to get heart disease.
Pairplot of all numerical variables:
The pairplot shows the histogram and density of each numerical variable and also shows the correlation of each pair of variables. From the pairplot we can see that, among others:
- Chol and Oldpeak are right skewed, and Thalach is left skewed.
- Age has strong positive relationship with Tresbps, and a strong negative relationship with Thalach.
Scatterplot of age vs. Thalach (maximum heart rate achieved):
The graph shows that Age and Thalach have strong negative relationship and perhaps unsurprisingly, people are less likely to present with heart disease with low Age and high Thalach.
Predictive modelling using ensemble methods
Adapative Boosting (ADA-boost)
Adaptive Boosting (AdaBoost) is a supervised learning algorithm that is trained by sequentially forming an additive model of base learners (for example, small Classification and Regression Trees (CARTs)) which may only predict better than a random guess.
A probability distribution is defined over the training data and updated with the estimation of each new base learner to allow the next learner to focus on training errors made by its predecessor.
Additionally, coefficients are estimated for each base learner which quantify its overall influence on future predictions following training.
In other words, AdaBoost combines multiple weak classifiers into a strong classifier. Each weak classifier is a stump which is a simple decision tree of its own, with only one node. The AdaBoost method dictates that the error of the previous classifier is used to train the next classifier. For a deeper dive on the mathematics behind Adaptive Boosting, check out our documentation here.
The Model Builder module allows you to set up your train and test data, select machine learning model of choice, choose model-specific test options (training sets, cross-validation etc.), drag and drop your target and predictor variables, and of course, choose your model parameters.
To start with, let’s drag and drop our train/test datasets and pick AdaBoost.
Now we can see the list of variables in the middle pane. Here, we can pick our target and response variables according to what we want to model. Keep in mind a couple of things:
- The # symbol indicates that the variable is specified as continuous
- The abc symbol indicates a variable is categorical
- Variables which are in black text indicate there are no missing values. If the variable name is in red text, missing values are present.
Now you need to define your model. To do this, drag your Forecast (outcome) variable into the appropriate space (see below). In our case, this variable is Target. Then drag the explanatory variables into the lower space. In our case, we want to predict Target against all the other variables in the dataset.
Now we can set the parameters of the AdaBoost algorithm for this model.
A simple explainer of some Adaptive Boosting parameters:
- Number of Estimators means how many stumps (simple decision trees) you want to place in the algorithm.
- Algorithm SAMME.R uses the probability estimates to update the additive model, while SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function) uses the classifications only.
We have choosen Number of Estimators with 10, Algorithm with SAMME. We can now run the model using the Analyse button at the bottom middle of the screen.
Once you implement the analysis, you are taken to the Model Output module. The results page is divided into two tabs:
- Model Evaluation: Model Output, Confusion Matrix, Best Statistical Parameter, Classification report and AUC chart
- SSP: Graphs of variable Importance
As we can see with number of estimators set to 10, we get the model accuracy of 88.1356%. The variables with the three highest importance, or contribution to the model performance, are Thal, Cp and Ca. Back to our data dictionary:
Thal: thalium stress test result (Value 3 = normal, Value 6 = fixed defect, Value 7 = reversible defect)
Cp: Chest pain type (Value 1 = typical angina, Value 2 = atypical angina, Value 3 = non-anginal pain, Value 4 = asymptomatic pain)
Ca: number of major vessels (0–3) colored by flourosopy.
Based on the result of our this example model, we can see that the outcome of the patient’s thalium stress test, type/severity of anginal pain, and the outcome of their blood vessel fluoroscopy have a notable impact on the presence of heart disease.
Feel free to try a number for Number of Estimators and examine the results:
It seems like Number of Estimators reach the local minimum around 23. When Number of Estimators>23, the model is overfitted with high variance. When Number of Estimators<23, the model is underfiited woth low variance but high bias.
Gradient Boosting (GBM)
Gradient boosting is a supervised learning algorithm which sequentially trains a series of base learners (typically weak learners which may have a level of predictive power slightly better than a uniform random guess) to make predictions. Because their functional representation is additive, a large number of learners can be trained to develop a relatively sophisticated model.
Both Gradient Boosting and AdaBoost create weak learners in sequence. The main difference with the Gradient Boosting algorithm is that the weak classifiers can produce decision trees with more than one node. Additionally, it is a more generic algorithm to find approximate solutions to the additive modeling problem. For more information on Gradient boosting, check out our documentation here.
Drag and drop the datasets, select Gradient Boosting as the model of choice, and drag in the same variables into the appropriate sections as we did with AdaBoost. Now we can set the parameters of the algorithm.
A simple explainer on some Gradient Boosting parameters:
- Gradient boosting algorithms can follow a number of different loss functions. Deviance, for example, is a loss function that <<INSERT HERE >>>
- You can select a number of different learning criteria. Friedman Mean-Squared Error, for example, follows that << insert here >>
- Maximum Tree Depth, Minimum Split, Minimum Leaf Samples and Subsample controls the shape of each tree.
Varying the number of trees (estimators), tree depth, learning rate, number of splits and leaves will give you varying results. While more advanced parameter tuning practices are beyond the scope of this tutorial, AutoStat® will automatically select the parameter values which generate you the lowest testing error. This is what we call Bayesian Hyperparameter Optimisation. For more information, check out this handy Medium article here.
After we have ran the GBM model, we can now analyse the results of the model.
As you can see from the Best Statistical Paramenter, the best maximum tree depth is 5, the best minimum samples split is 3, the best minimum samples leaf is 3 and the best learning rate is 0.0743. The accuracy we get is 88.1356% which is slightly lower than AdaBoost.
Building a data science pipeline
A critical element of data science is the ability to turn your project into an end-to-end automated workflow that re-runs either on-demand, or as new data becomes available. Pipelines are automated workflows that allow the organisation, decision-maker or researcher to productionise their prototypes and integrate them with existing workflows.
While we might not need to see the Cleveland Heart Disease project updated on-demand or as new data becomes available (alas, the dataset was published 30 years ago!), this is how our project might look as a production-ready pipeline.
In the next tutorial, we will explore how to build and deploy data science pipelines so that any data project can be productionised from end-to-end, from accessing the data itself, through to cleaning and preprocessing; to visualisations, model specification and finally, dissemination of results via APIs (to be consumed elsewhere), user-friendly dashboards, and publishable reports, all without writing a line of code.
See for yourself
Want to start running your own machine learning projects?
Head to our website for a free trial here: