The Prediction of Parkinson’s disease using Machine Learning

9 min readJul 3, 2019

Two men's sufferings from Parkinson image from Wikipedia

Parkinson’s disease is one of the diseases that do not have any treatment at the present time so the best treatment is to prevent the occurrence of this disease and to know if the person is more exposed and there is a possibility of infection through some experiments on the audios of this person so we used the data of 188 patients (PFCs), Wavelet Transform Based Features, Vocal Fold Features, and Melanogaster. TWQT features have been applied to the speech recordings of Parkinson’s disease (PD) patients to extract clinically useful information for PD assessment. “[1]. approximately 90% of cases detected with Parkinson’s syndrome practice variations in their sound or their energy to make communication sounds [1]. So in original years, there is an extended affair to detect Parkinson’s disease in patients by using their speech samples. In this outline, we are using the dataset from http://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification#, the term holding necessary inclination and diffusion metrics for all features in the given dataset in order to review the data and to decrease the disparities between various voice samples of a subject. Popular machine learning classification algorithms like logistic regression, Decision Tree, Random Forest, Neural Network and K-nearest neighbors are utilized on the produced dataset and the dataset created by authors to examine each patient as being Parkinson positive or Parkinson negative. We extra measure the success of each model for their ability to correctly classify the cases into one of these sections. It is observed that Random Forest performs the best in the datasets.

Parkinson’s is the sickness that creates a biased or entire loss in engine reflexes, communication, ethics, reasoning processing, and other essential duties [2]. It concerns seven to ten million people extensive, most of them over the age of 60 (Parkinson’s disease Foundation). People with Parkinsonism hurt from tongue impairments like dysphonia (defective performance of the voice), hypophonia (reduced volume), monotone (reduced pitch limit), and dysarthria (difficulty with the coupling of vibrations or syllables). Since spoken impairment is accepted in 70–90% of patients after the rush of illness, we are practicing this to discover if the case has Parkinson or not in our project. In addition, it may be one of the initial symbols of the infection and 29% of cases consider it as one of their vastest obstacles The principal reason following the demand for PD investigation from tongue impairments is that telediagnosis and telemonitoring orders based on speech signals are low in cost and easy to self-us. Hence those orders reduce the difficulty and charge of physical appointments of sufferers to the dispensaries and support in the early investigation of the condition although prescription and healing intervention can hold back the progression of the condition and mitigate some of the symptoms, there is no possible cure. Thus, the early investigation is critical in order to enhance the case’s quality of life and hold it. So, we decided to rescue these patients by early prediction if they are infected with this harmful disease. We used Predicative analysis to achieve this where we use a combination of machine learning supervised techniques to make the prediction.

Now, we will get to the most appeal which starts coding and building our model that will classify the cases we have.

First, we import the needed libraries

Code for reading the data

First, we checked the missing in the data and as our hypothesis states the data is free from any missing as we can see.

Here we are want to drop the highly correlated features to avoid the overfitting of the model.

Here we can see the correlation matrix that shows the correlation between a subset of the features which we have. We can see a high negative correlation between the gender of the patient and the numPulses of the voice also there is a high negative correlation between the gender and the two features f1 and f2. Also, we can notice a high negative correlation between the pair of features DFA and f1. Also, there is a high negative correlation between the pair of features RPDE with numPulses.

The target variable is imbalanced which means the number of the people labeled as patients is more than the number of people labeled as not patients this surely will cause problems it modeling phase and the model will be biased towards the majority class which is 1 or patient people. In the Figure below we can see how the classes of the data are imbalanced.

So, we have now three options. The first option is oversample the minor class which means copies the instance from the minor class and pastes it again or to undersampling the major class which means takes only a little sample from the major class which equals the number of instances in the minor class. The final option is to give weights for each class. So, we have chosen the oversampling method to increase the number of data points we have and solve the unbalancing problem. And in the below figure we can find the distribution of the class after applying the oversampling method.

*The distribution of the classes after oversampling the minor class*

Then, we will demonstrate the distribution of some of the numerical variables we have. As we can see the three features DFA, PRDE, and numPulses are normally distributed. This means the probability of the random variable in these features will tend to be closer to the mean value of these features. And, the other three features PPE, stdDEVPeriodPulses, and intensity are skewed; PPE and minIntensity have left skewness and stdDEVPeriodPulses has a left skewness. This satisfies our hypothesis which states that most of the features follow the normal distribution of their values and there is no need for making any type of normalization or transformation on the data.

*The distribution of some of the numerical features*

To elaborate on the data with the better way we need to investigate the feature importance. So, we used the Random Forest algorithm to get the most important features in the data. We made this after removing all the highly correlated features with each other which cause redundancy in the data. And we now can check the best features in the below Figure 6. As we can see the tqwt_energy_dec_27,tqwt_medianValue_dec_16, and minIntensity all the most important features in the data.

*The most important feature in the data*

After knowing the most important features we have in the data. Now, we want to get the correlation between these features and our target variable also we will try to discover the outliers in these features. So, now we will start with the correlation between the target variable and tqwt_energy_dec_27 our hypothesis is that there is a clear correlation between the target variable and all the important variables. As we can see in the below figure the patient people have higher tqwt_energy_dec_27 than the non-patients.

*The Relation between the target variable and tqwt_energy_dec_27*

And, in the below figure we can notice that there some outliers in both class 1 (patient) and class 0 (non-patient). The below Figure show the correlation between the target variable and f2 features. As we can see in the below figure the non-patient people tend to have higher f2 values. Also, we can notice that there are outliers in both classes.

*The Relation between the target variable and f2*

In the same way, we can find the relation between the classes of the target variable and each vocalization numerical variable we have. And, check if our hypothesis is true for all the features. This can be checked in the code

Now, we will get to the other part in the results which concern with the models and algorithms used to predict the Parkinson disease. We split the data 3 times and each time with different training and testing sizes. The first time we split the data 80% of data is training and 20 % of the data is testing data. And, the second split is 60% of data is training and 40 % of the data is testing data. And, the third split is 50% of data is training and 0 % of the data is testing data.

1. 80% Training and 20% Testing

Here are the result we got for each algorithm from the Cross-validation and 10 folds in K-folds

Neural Network: 0.530753 (0.067665)

Naive Bayes: 0.661720 (0.080249)

K-Nearest Neighbor: 0.578817 (0.092935)

Decision Tree: 0.706989 (0.081376)

Random Forest: 0.755484 (0.071165)

As we can see from the below figure and above results the Random Forest has the highest mean and max CV Score which means it is the best performing model with the highest f1-score. And the lowest performing model is the Neural Network with the lowest f1-score.

As we could see in the above results we have tried to use 5 algorithms using three different sizes of training and testing sizes. Each time we found that the Random Classifier is the best performing with the average 0.75 f1-score from the three splits. This can be considered as good classifier because the numbers of the data points with somehow small and there are many features but these features have many highly correlated features which can be considered as redundant and we dropped it. Overall The Random Forest and the decision tree classifiers considered reasonable good classifiers that able to predict if the person has Parkinson diseases or not. In contrast, the Neural Network considered as the lowest performing classifier with the average value for the three splits is 0.63 f1-score. The reason for this low performing of the Neural Network classifier we can call it underfitting, which means the model is so complex to fit the data and we need a simpler model to handle this problem. Also, we can see that in the confusion matrix of the Random forest classifier almost 23 patients mistakenly classified as non-patients and this really forms huge hazards because our target is the early detection and prediction for the Parkinson disease this means we are concern about the recall value more than the precision value. So, we can say that our model is good but need more refinements to increase the recall value to avoid the incorrect classification of the patients' people.

References

[1] J Sakar, C.O., Serbes, G., Gunduz, A., Tunc, H.C., Nizam, H., Sakar, B.E., Tutuncu, M., Aydin, T., Isenkul, M.E. and Apaydin, H., 2018. A comparative analysis of speech signal processing algorithms for Parkinsonâ€™s disease classification and the use of the tunable Q-factor wavelet transform. Applied Soft Computing, DOI: [Web Link].

[2] B. Harel, M. Cannizzaro, and P. J. Snyder, “Variability in fundamental frequency during a speech in prodromal and incipient Parkinson's disease: A longitudinal case study,” Brain and cognition, vol. 56, no. 1, pp. 24– 29, 2004.

The Prediction of Parkinson’s disease using Machine Learning

Written by Ahmed Sayed