Heartbeat Audio Classification using Deep Learning

Detect abnormal heartbeats using state-of-the-art ML & DL models to flag early onset of heart diseases

12 min readDec 9, 2022

This project was built by Nittala Venkata Sai Aditya, Gandhi Disha, Saibhargav Tetali, Vishwak Venkatesh and Soumith Reddy Palreddy

Background

Let’s have a heart to heart conversation, shall we?

Photo from Harvard T.H. Chan School of Public Health

Cardiovascular diseases (CVD) are the leading cause of death, accounting for ~29% of all global deaths. More than four out of five CVD deaths are due to heart attacks and strokes, with one-third of these deaths occurring prematurely. These premature deaths can be prevented by identifying those with highest risk of CVDs and ensuring they can receive appropriate treatment.

Imagine tapping the power of machine learning to classify heart beats as normal or abnormal. Such a method which can detect signs of heart disease in the first level of screening could have a significant impact on world health. This tremendous potential is what drew our team to this project.

The first level of screening process is now easier with the widespread usage of mobile phones. A person can use apps which can record their heartbeat through the phone’s microphone . We aim to use these audio recordings of a person’s heartbeat to build a machine learning model that can classify whether their heartbeats are normal or abnormal. These updates will ensure that an individual can get timely treatment before the situation spirals out of control.

Phones have made it much easier than before to get early warnings on heart diseases. Photo from ECHO

The Data

A challenge associated with this project was to acquire good-quality labeled data. However, heartbeat data from hospitals would not suffice. As our project aims to provide the first level of screening for CVDs by using heartbeat sounds from people’s phones, we wanted our data to also reflect this. Here we encountered our next challenge in finding heartbeat data from this source.

After spending some time online, we discovered Peter Bentley’s ‘Classifying Heart Sounds Challenge’ which had the data in the exact way we needed. This data came from two sources:

From a clinical trial in hospitals using a digital stethoscope
From the public via the iStethoscope Pro iPhone app

Our dataset consisted of 585 labeled audio files and 247 unlabeled audio files. For the purposes of this project, we used the labeled audios for building the machine learning model.

There were five major heart sound classes in our dataset. These include:

Normal — Strong rhythmic heartbeats with a distinct lub-dub auditory pattern
Murmurs — There is a “whooshing, roaring, rumbling or turbulent fluid” between the lub and the dub
Extra Heart Sound — Heartbeats sound as though there is a “galloping noise”
Extrasystole — An out of rhythm heart sound usually due to extra or skipped heartbeats
Artifact — Basically, not a heartbeat and characterized by a wide range of different sounds (feedback squeals and echoes, speech, music and noise)

Exploratory Data Analysis

The first part of the project was to understand the data and see if we could get any insights for the different heart sounds. For us to classify into normal and abnormal heart sounds, it was imperative that we see if the samples were different enough to segregate any new data coming into normal and abnormal classes.

We notice some interesting observations from the amplitude waveplots for the different classes shown below:

The first graph, i.e., the normal heartbeat graph, shows a uniform distribution of amplitudes and consistency between the lubs and dubs of the sound wave
Compare that to the murmur heart sound, it seems less consistent and has a lot of sound waves coming in between the lubs and dubs. This could be a sign of some whooshing coming in between the lub and dub
Extrasystole sounds have a much higher amplitude compared to others and there is a lot of irregularity between the different sound waves indicating that there may be a skip of the heartbeat characterized by extrasystole heart waves
Extrahls graph is also irregular compared to the normal class. It has a few high amplitude sound waves which could signify the galloping sounds characterized from such heart conditions
Artifact graph just shows the noisy data that one may encounter when they incorrectly try to take a recording. Since one of the data sources is an app, external sounds may have been mistakenly recorded here

Amplitude waveplots for each heart sound class

Now, let’s put all these waveplots together in one figure. We see that extrasystole heartbeat sounds have high amplitudes compared to others and all the different heartbeat classifications have an irregular rhythm compared to the normal heart sounds.

Comparing the heart sound classes in one plot

Next, we decided to look at the class distribution. Looking at the class distribution, it was quite evident that there is a skewness in the frequency of different classes. Normal category heartbeats are the most abundant both in terms of frequency and audio distribution.

It became evident to us that we needed to do something to handle this class imbalance. But wait, what is class imbalance and why do we need to handle this?

In simple terms, class imbalance refers to the case when the target variable in a dataset has more observations in one specific class than the others. A model trained on this data would typically be inaccurate because the model receives significantly more examples from one class (‘Normal’) which causes it to be biased towards that particular class. It does not learn what makes the other classes different and fails to understand the underlying patterns that allows it to distinguish between classes.

We will discuss our approach to solving this class imbalance under the Modeling section.

Data Augmentation

With just 585 audio files to train our model, we decided to increase the size of our dataset by generating synthetic data. We followed the two most popular data augmentation methods used for audio data:

Adding noise
Changing pitch and speed

The below pictures show the differences between the original audio file and the synthetic audio files. We can see that there are small differences in the waveforms which will make our classifier robust to unencountered data while still retaining the waveforms' general structure which will help in classification.

Waveplots comparing original vs augmented audio

Feature Engineering & Extraction

Our machine learning models cannot understand the audio data if we input them as it is. We need to first extract features from the audio files to convert them into an understandable format.

You may ask how in the world can you extract features from a sound wave? Isn’t that like too much science? Well, fortunately for us, there are python libraries specially for these purposes. If you ever foray into music and audio analysis, the librosa package in Python should be your go to. In short, it will give us the below extracted features which give us different information about the sound waves and help us select features which ultimately will benefit our machine learning model.

You can learn more about librosa library here.

The extracted features include:

ZCR (Zero Crossing Rate) — The zero-crossing rate (ZCR) is the rate at which a signal transitions from positive to zero to negative or negative to zero to positive. Its value has been extensively used in both speech recognition and music information retrieval for classifying percussive sounds.
Chroma — Chroma-based features, which are also referred to as “pitch class profiles”, are a powerful tool for analyzing music whose tuning approximates to the equal-tempered scale
MFCC (Mel Frequency Cepstral Coefficients) — mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC
RMS (Root Mean Square) — a measure of the audio’s loudness
Melspectrogram — a spectrogram with frequencies in mel scale

Thus, we can see that our features capture audio information in both the time domain (ZCR, RMS) as well as the frequency/energy domain (Melspectrogram).

Data Modelling

Now, after we have augmented our dataset by adding synthetic data and extracting the relevant features, we obtain 1755 rows and 162 features. This dataset was split into training and test sets at a 80–20 ratio.

The following ML models were trained on the training set and their performance evaluated on the test set:

Random Forest Classifier

Random forest is an ensemble learning method that operates by constructing a multitude of decision trees. For a classification task such as ours, the output of the random forest is a majority vote, i.e., the class selected by most trees. We used this as our baseline model.

Light Gradient Boosting Machine

Boosting techniques usually outperform bagging techniques (like random forest) as boosting creates a strong classifier from an ensemble of weak classifiers. Hence, we decided to use LightGBM which is much faster to train compared to an ordinary GBM.

CatBoost

CatBoost is another boosting technique which we decided to implement. This model offers the advantage of supporting categorical features and performs predictions very quickly.

Convolutional Neural Network 1-D

CNNs are a type of deep learning algorithm which performs really well on audio and image data. We trained two CNNs. The first model consists of 6 layers of ReLU activation functions with the summary shown below.

In the second CNN model, we removed all the ReLU activation functions and added 2 layers of ‘tanh’ activation functions as a way to introduce some nonlinearity into the model and a ReLU activation layer.

All five models have been trained both with and without upsampling of the training dataset. Further, we have also explored converting the original multi-class classification problem to a binary class problem. Let’s discuss these two approaches in more detail.

What is upsampling?

Our approach to solving the class imbalance present in our dataset was through upsampling. Upsampling is a procedure where synthetically generated data points (corresponding to the minority classes) are injected into the dataset. After upsampling, the counts of all classes will be almost equal.

What’s the purpose of the multi-class to binary class conversion?

In the original dataset, we have five classes — all of which are considerably different from one another. So, we wanted to explore how well our model would perform if we instructed it to just classify heartbeats as normal or abnormal.

Conversion from multi-class to binary class

We make this conversion by taking the results of the multi-class classification of all models. All the ML models we used actually calculate the probability of a heart sound belonging to a particular class and predicts the class with the highest probability. For example, let’s assume that a datapoint has the following multi-class probabilities:

This datapoint will be classified as Normal. When we are viewing this through a binary classification context, we are just adding the probabilities of all the classes except for normal class and treating the resulting sum of probabilities as abnormal class probability. In the case of our example it would be the following:

Now, this datapoint will be classified as Abnormal. One observation to be noted is that a point classified originally as normal class will remain the same if its multi-class probability of normal is higher than 0.5.

Let’s briefly discuss some of the pros and cons to this approach.

Pros:

In the case of a datapoint which has similar features as a normal heartbeat but is actually not a normal heartbeat the probability of it being a normal class would be higher than the individual abnormal classes which would lead to misclassification. But when we combine all the abnormal classes the probability of abnormal class will be higher and hence the data point will be classified correctly.

Cons:

If the classifier is not good at identifying normal class and assign a marginally higher probability to normal class like in the above example the datapoint will be misclassified as abnormal.

Results

The following table summarizes each model’s multi-class accuracy score on the test set.

Multi-Class Accuracy of each model on test set

Here are some observations we noted while implementing these models:

Both Random Forest & LightGBM models have improved in separating the normal class from the abnormal class after it has been fit on the upsampled training data
CatBoost model performance has degraded when trained on upsampled training data. This might be due to the overfitting happening because of the high abnormal class data size
CNN Model 2 performed much better than CNN Model 1. This might be because of the non-linearity introduced to the model via the tanh activation function layers

Below are the results following the multi-class to binary class conversion.

Binary Class Accuracy of each model on test set

Binary Class F1 scores of each model on test set

Here, we see that

CatBoost classifies a high number of datapoints as abnormal which is increasing recall but at a high cost of decrease in precision and accuracy
CNN Model 1 is just classifying all test data points as abnormal which would lead 100% recall but very low precision, F1 score and binary accuracy. Hence, this model would not be useful if we have a high cost of misclassifying a normal heartbeat as abnormal
CNN Model 2 achieved the highest F1 score compared to all the models used

As our CNN Model 2 gave the best performance, let’s take a closer look at its confusion matrices.

Multi-class and binary confusion matrices before upsampling

Multi-class and binary confusion matrices after upsampling

For the sake of completeness, we also plotted the ROC-AUC & the Precision-Recall curves for all the trained models.

ROC-AUC curve — before (left) & after upsampling (right)

Here, we can notice that CNN Model 2’s curve is closer to the ideal point (0,1) compared to other curves.

Precision-Recall curve — before (left) & after upsampling (right)

On the upsampled training data, we see that CNN Model 2’s curve maintains a precision of 1 until the recall reaches a value of a little more than 0.5 and then remains higher than the other models’ curves.

Conclusion

Let’s recap what we’ve done so far, shall we?

First, we started off this project hoping to harness the power of Machine Learning to detect the early onset of heart disease. We then discovered a dataset with audio recordings of normal and abnormal heartbeats to which we injected synthetic data. Next, we extracted relevant features from this augmented dataset to make our data understandable to our ML models.

We trained and implemented five ML models — on multi-class as well as binary classification — before and after upsampling to understand which model would perform best.

LightGBM and CNN-1D Model 2 performed the best after upsampling the training data. But, LightGBM did not perform well before upsampling. Although we had other models with similar accuracy scores, CNN Model 2 outperforms other models in all aspects like F1 scores, ROC-AUC and the Precision-Recall curves.

Hence, the CNN 1-D Model 2 with the tanh activation functions performed best for this problem.

Next Steps

In the future, we would look into improving the CNN model further and exploring other methods to treat class imbalance. We have already seen how adding non-linearity and upsampling have affected model performance. There is scope to improve upon this.

Another venture worth looking into would be to obtain better data. The dataset used for this project involved the heartbeats being recorded physically (either through a stethoscope or a mobile). As a result, data quality may be affected by factors like improper positioning of the device, noise caused by the device brushing against clothes, etc. We would like to explore this project further with data obtained through non-invasive technologies (like lasers).

If you’d like to take a look at the data and the code, feel free to check it out here.

Let us know your thoughts on this!

References

Dataset — http://www.peterjbentley.com/heartchallenge/
National Library of Medicine — https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8229456/
Librosa Documentation —https://librosa.org/doc/latest/index.html