Detecting Parkinson’s Disease with Machine Learning

Keep On Moving

Published in

Analytics Vidhya

4 min readSep 2, 2020

In this article, I will build a model that can accurately detect the presence of Parkinson’s disease in humans. Before starting, we will walk through the data science life cycle to understand what should happen.

Steps of solving Business Problem with Data Science Life Cycle:
1. Business Understand
2. Data Collection
3. Data Preparation
4. Exploratory Data Analysis
5. Modelling
6. Model Evaluation
7. Model Deployment

#1 ~ Business Understanding

Any project, regardless of its size, requires an understanding of the business, which is the basis for effectively solving business problems.

What is problem I trying to solve?

At this stage, I need to define problems, project goals and solutions from a business perspective. This is the first step in solving problems with data science methods. In this article, I need to accurately detect the presence of Parkinson’s disease based on the voice measurement data set.

What is Parkinson’s Disease?

Parkinson’s disease is an age-related neurodegenerative disease of the central nervous system. It is the second most common disease after Alzheimer’s disease that affects neurons in the brain that produce dopamine. It affects movement and induces tremor and stiffness. An estimated 70–10 million people worldwide suffer from Parkinson’s disease.

Therefore, it is a Supervisor Learning ~ I will use Classification method.

#2 ~ Data Collection

Data is the most critical part of the entire machine learning. So I need to consider the following:

Do I have the data?
Where does the data come from?
Do we trust the data source?
Do I have the domain knowledge?

The data is coming from the UCI Donald Bren School of Information & Computer Sciences, you can download it here. The dataset has 24 columns and 195 records.

#3 ~ Data Preparation & #4 ~ Exploratory Data Analysis

Now we will make necessary imports and try to load the Parkinson’s disease dataset to jupyter notebook.

Now, let’s take a look at the dataset. The dataset contain 195 rows and 24 columns. The dependent variable is status and MDVP:Fo(Hz) to PPE are independent variables. In the picture below, we can see that there are no missing values in dataset.

We can see that all our numeric variables are listed at the top, and we have values such as count, mean, standard deviation, minimum, maximum, 25% and 50%, and 75%.

#5 ~ Modelling

It is important to standardization the training data and test data because most machine learning models converge much faster if the proportions of the elements are the same. Doing standardization will centralize the feature’s mean between 0 to 1. The data will be distributed on normal distribution. To calculate the mean and standard deviation of those feature and apply the above formula to each observation/value, it will use sklearn’s StandardScaler:

The last thing to do before training our models is to split the dataset. I split this data randomly with 80/20 for training versus test examples to training set and testing set. We need to do this so we could estimate the predictive result of our model by predicting the testing set data (unseen data).

LogisticRegression

DecisionTreeClassifier

SVC

#6 ~ Model Evaluation

In this step, we will evaluate the performance and accuracy of the machine learning model. Based on the prediction, we can see that the model can accurately predict from Logistic Regression is 0.79, Decision Tree is 0.87 and SVC is 0.92.

#7 ~ Model Deployment

This is a relatively simple project, usually we also need iteration and function selection or compare with other algorithm. By collecting the results of the implementation model, you also need to receive feedback about the performance of the model and its impact on the implementation environment. By analyzing this information, data scientists can improve the model and improve its accuracy, thereby improving its practicality. Once a satisfactory model is developed, it will be implemented in the production environment.

Thanks for reading! If you enjoyed the post, please appreciate your support by applauding via the clap (👏🏼) button below or by sharing this article so others can find it.

I hope you have a basic understanding of the Data Science Life Cycle. How to think at each stage to help guide you through the methodology of a successful data science project. At the end, I hope that you can learn the how to use the logistic regression techniques. You can also find the full project on the GitHub repository.

[1]: ‘Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection’,
Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM.
BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)

[2]: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008),
‘Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease’,
IEEE Transactions on Biomedical Engineering (to appear).