Predicting Parkinsonian Symptom Severity from Voice Recordings

10 min readDec 22, 2021

Links

Dataset source — https://www.nia.nih.gov/health/parkinsons-disease

Project Github & Code — https://github.com/BobbyWilt/PD_Voice_UPDRS

Research Paper — https://www.researchgate.net/publication/40026354_Accurate_Telemonitoring_of_Parkinson%27s_Disease_Progression_by_Noninvasive_Speech_Tests

Introduction

Parkinson’s disease is the second most common degenerative neurological disorder with an estimated 1% of the global population over 60 years old being diagnosed with Parkinson’s disease. The condition itself can be caused by genetic or environmental factors that result in the death of key neurons located in the middle of the brain within a structure called the basal ganglia. Parkinson’s symptoms typically include shakiness, slowness of movement, muscle rigidity, and issues with balance. Motor symptoms can also apply to facial muscles which can negatively affect patients’ clarity and volume of speech. Symptom severity tends to also increase over time as the disease progresses.

To characterize the severity of Parkinsonian symptoms, the Unified Parkinson’s Disease Rating Scale (UPDRS) was created. The test itself consists of 5 segments that test motor and cognitive symptoms of Parkinson’s disease. During the motor portion of the test, patients are instructed to perform a variety of muscle movements while a qualified movement disorder clinician rates the impairment of movement for each task. A total score for the test is obtained by summing the values from each individual task.

A study was conducted by Dr. Athanasios Tsanas in 2009 investigating the correlation of speech features with motor UPDRS scores. In his study, Dr. Tsanas collected over 5000 speech recordings from 42 Parkinson’s patients over the course of 6 months. Patients were administered UPDRS assessments at 3 points throughout the study (start, middle, end), and motor UPDRS scores were linearly interpolated between these assessments. Dr. Tsanas also implemented auditory processing algorithms to extract the amplitude and consistency of speech from these recordings with the idea being that these speech features might be able to estimate the patient’s motor UPDRS score.

Since the UPDRS has to be conducted in-person, this can be problematic for patients who live in remote areas and also for COVID-related safety concerns. If auditory features can consistently and accurately predict motor UPDRS scores, then this type of speech processing could help more Parkinson’s patients obtain a diagnosis or obtain an up-to-date assessment of their condition more easily.

My goal in this project is to develop a regression model that more accurately predicts motor UPDRS scores than Dr. Tasanas’ models. I will use the same dataset as his but will implement additional machine learning techniques to develop my model.

Data Dictionary:

Pre-Processing:

The dataset can be assessed in the links above on the my github project page or from the original dataset source. The raw dataset consists of around 5000 observations with 22 features. The “Motor UPDRS” column will be my target feature for the modeling process. As mentioned in the intro section, the large majority of the motor UPDRS scores are actually calculated whereby a linear line is fit between actual UPDRS recordings and the motor UPDRS score for each row is just placed on the line. To help prevent our model from tracking this predictable linear trend, the patient iidentifiers such as the “participant id” and “age” columns were dropped. Additionally, the “total UPDRS” column was also dropped since it contains the motor UPDRS score.

The shimmer and jitter columns containing auditory recording features had pretty sharp right skewed histograms. These columns were log transformed to provide our model with a mostly-normal histogram.

The test time column contains a float number of days, and we used the decimal portion of each day to calculate the time of day that the voice recording was collected and created a new column for this data titled “hour of the day”.

# Create wrangle function to run preprocessing on raw dataset
def wrangle(url):
    df = pd.read_csv(url)
    
    # drop data leakage columns: total_updrs - includes motor updrs data in it, subject# - motor_UPDRS is interpolated linearly so ties to the specific subject will leak data, age for the same reason as subject#
    df.drop(columns=['total_UPDRS','subject#' ,'age'],inplace=True)
    
    # Convert skewed column data to a normal distribution via np.log
    left_skew_cols = ['Jitter(%)', 'Jitter(Abs)','Jitter:RAP', 'Jitter:PPQ5', 'Jitter:DDP', 'Shimmer', 'Shimmer(dB)',
       'Shimmer:APQ3', 'Shimmer:APQ5', 'Shimmer:APQ11', 'Shimmer:DDA', 'NHR']
    df[left_skew_cols] = np.log(df[left_skew_cols])
    
    # Absolute test time since there are some negative values
    df.test_time=df.test_time.abs()
    
    # Create a time of the day column using the test_time column
    hr_list = []
    for item in df.test_time.astype(str):
        hr = float('.'+ item.split('.')[-1])*24
        hr_list.append(hr)
    df['hour_of_day'] = hr_list
    
    return df# Dataset local directory
url = 'https://raw.githubusercontent.com/BobbyWilt/PD_Voice_UPDRS/main/parkinsons_updrs.csv'
df = wrangle(url)
df.head()

The pandas toolbox was then used to convert our data into a dataframe. Here is a sample of the dataframe.

Data Splitting:

To split the data, I first specified the target column and separated out the dataset into an X and y variable. The data was then split into a training, validation, and testing dataset whereby: 80% was used for training, 10% for validation, and 10% for testing.

# Define X and y
target = ['motor_UPDRS']
X = df.drop(columns = target)
y = df[target]# Standard splitting method
X_train, X_mod, y_train, y_mod = train_test_split(X,y,test_size=0.2,random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_mod,y_mod,test_size=0.5,random_state=42)

Initial Modeling:

In Dr. Tsanas’s paper, he implemented a linear regression, least squares regression, lasso, and a random forest model. For my models, I will implement a linear regression, lasso, and a decision tree regressor to compare to Tsanas’s models’ performances. Additionally, I employed a random forest, sklearn boost regressor, and the XGB regressor model. These models were first initiated using their default parameters to help identify which one might be best suited for modeling this problem.

Here is the code I used to initiate the default versions of the models:

# Linear Regression Model
model_lr = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    StandardScaler(),
    LinearRegression()
)
model_lr.fit(X_train,np.ravel(y_train))# Lasso Model
model_l = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    StandardScaler(),
    Lasso()
)
model_l.fit(X_train,np.ravel(y_train))# Decision Tree Regressor Model
model_t = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    StandardScaler(),
    DecisionTreeRegressor()
)
model_t.fit(X_train,np.ravel(y_train))# Random Forest Regressor Model
model_rf = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    StandardScaler(),
    RandomForestRegressor()
)
model_rf.fit(X_train,np.ravel(y_train))# Boost Regressor Model
model_br = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    StandardScaler(),
    GradientBoostingRegressor(random_state=42, n_estimators=75)
)
model_br.fit(X_train,np.ravel(y_train))#XGB Regressor Model
model_xgb = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    XGBRegressor(random_state=42, n_estimators=75, n_jobs=-1) # learning_rate=0.1, 0.01, 02
);model_xgb.fit(X_train, np.ravel(y_train));

To compare all of the model performances, I created a function to calculate the mean absolute error, R-squared, and root mean squared error for each model and placed these scores in a dataframe. We will focus on the mean absolute error (MAE) metric to evaluate our model’s performance in order to compare with Dr. Tsanas’s model metrics. For our baseline comparison, the mean motor UPDRS value was calculated from the X_train dataset.

Here is the dataframe output for the models. Our linear regression and lasso models have similar MAE values for the validation data as Dr. Tsanas’s. Our decision tree regressor model however has a much smaller MAE value than the CART tree model used in Dr. Tsanas’s study.

To visualize our initial output scores, a seaborn barplot was used to plot all models and metrics. At this point, our random forest model is producing the smallest MAE value from the validation data. The linear regression and lasso models both have a similar MAE performance value of around 6 while the baseline MAE is closer to 7.

Model Tuning

Now that we have tried out these model defaults, let’s try tuning some hyperparameters for all of these models. For the sake of time, sklearn’s RandomizedSearchCV function was used for the more computationally-demanding models while GridSearchCV function was used for the simpler ones.

Lasso Regression Turning

param_grid = {'lasso__alpha': [0.1, 0.3, 0.5, 0.7, 0.9, 1, 1.5, 2, 2.5]}
# Lasso Model
model_l_t = GridSearchCV(
    model_l,
    param_grid = param_grid,
    n_jobs=-1,
    cv=5,
    verbose=0
)
model_l_t.fit(X_train,np.ravel(y_train))
model_l_t_params = model_l_t.best_params_
print('MAE Score:', mean_absolute_error(y_val,model_l_t.predict(X_val)))

MAE Score: 6.377

Decision Tree Tuning

param_grid = {'decisiontreeregressor__max_depth': [1,3,5,10,15,20,25, None],
              'decisiontreeregressor__max_features': [1,2,4,5,10,None],
              'decisiontreeregressor__min_samples_leaf': [1,2,4,5,10],
             }
# Lasso Model
model_t_t = GridSearchCV(
    model_t,
    param_grid = param_grid,
    n_jobs=-1,
    cv=5,
    verbose=0
)
model_t_t.fit(X_train,np.ravel(y_train))
model_t_t_params = model_t_t.best_params_
print('MAE Score:', mean_absolute_error(y_val,model_t_t.predict(X_val)))

MAE Score: 4.307

Random Forest Regression Tuning

param_grid = {
    'randomforestregressor__max_depth':[5,10,15,20,25],
    'randomforestregressor__n_estimators':[50,100,200],
    'randomforestregressor__max_samples':[0.1,0.3,0.5,0.7,0.9]
}
# Random Forest Model
model_rf_t = RandomizedSearchCV(
    model_rf,
    param_distributions = param_grid,
    n_jobs=-1,
    cv=5,
    verbose=0,
    n_iter = 30
)
model_rf_t.fit(X_train,np.ravel(y_train))
model_rf_t_params = model_rf_t.best_params_
print('MAE Score:', mean_absolute_error(y_val,model_rf_t.predict(X_val)))

MAE Score: 3.56

Gradient Boosting Regressor

param_grid = {
    'gradientboostingregressor__max_depth':[5,10,15,20,25],
    'gradientboostingregressor__n_estimators':[50,100,200],
    'gradientboostingregressor__learning_rate':[0.1,0.3,0.5,0.7,1,1.5,2]
}
# Boost Model
model_br_t = RandomizedSearchCV(
    model_br,
    param_distributions = param_grid,
    n_jobs=-1,
    cv=5,
    verbose=0,
    n_iter = 30
)
model_br_t.fit(X_train,np.ravel(y_train))
model_br_t_params = model_br_t.best_params_
print('MAE Score:', mean_absolute_error(y_val,model_br_t.predict(X_val)))

MAE Score: 3.121

XGB Regressor

param_grid = {
    'xgbregressor__max_depth':[5,10,15,20,25],
    'xgbregressor__n_estimators':[50,100,200],
    'xgbregressor__learning_rate':[0.1,0.3,0.5,0.7,1,1.5,2]
}
# Boost Model
model_xgb_t = RandomizedSearchCV(
    model_xgb,
    param_distributions = param_grid,
    n_jobs=-1,
    cv=5,
    verbose=0,
    n_iter = 30
)
model_xgb_t.fit(X_train,np.ravel(y_train))
model_xgb_t_params = model_xgb_t.best_params_
print('MAE Score:', mean_absolute_error(y_val,model_xgb_t.predict(X_val)))

MAE Score: 3.602

Below are the evaluation dataframes for the initial vs the tuned models. The baseline and linear regression models did not have any changes since we can’t be tuned. The lasso model had a pretty good improvement in MAE from 6.9 to 6.3. Interestingly enough, our decision tree and xg boost regressor models actually had higher MAE values after tuning which could be due to the random selection of parameters from RandomizedSearch and/or also that additional parameter tuning might be needed. The gradient boosting regressor model did see an improvement in MAE and had the lowest MAE value overall for the validation data with a score of 3.12.

Plotting each metric for our tuned models helps clarify that the gradient boosting regressor model slightly outperforms the other models in the MAE and R-Squared metrics.

To further tune our gradient boosting regressor model, we’ll add in a few additional hyperparameters and increase the number of estimators to further lower our MAE score.

# Define our final model pipeline using the model parameters from above
model_final = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    GradientBoostingRegressor(random_state=42,
                              subsample=1.0,
                              n_estimators=1000,
                              min_samples_split=15,
                              min_samples_leaf=10,
                              max_depth=8,
                              learning_rate=0.1,
                              max_features=None,
                ) 
)model_final.fit(X_train, np.ravel(y_train));
print('MAE Score:',mean_absolute_error(y_val,model_final.predict(X_val)))

Model Performance:

Now that we have fully tuned our best performing model, it would be interesting to see which features had the largest impact on the model’s predictions.

Looking through the top most important features, our engineered feature “hour_of_day” actually had the largest impact on the model outcome. To me this is pretty interesting, since Parkinson’s patients do have daily fluctuations in their symptoms depending on the time of the day, but the motor UPDRS scores for each recording are interpolated and do not actually capture the time of the day of the recording. In terms of the actual qualities of the audio recordings, DFA was the most important, and this feature describes the consistency of speech. HNR came second for audio quality features, which is a signal to noise feature representing the quality of the recording.

Now that we saw the feature importances, let’s employ SKlearn’s plot_partial_dependence function to visualize how each feature independently impacts the motor UPDRS score.

# Transform X_train data outside of pipeline for partial dependance chart
ohe = OneHotEncoder(use_cat_names=True).fit(X_train)
ohe_dat = ohe.transform(X_train)
sc = StandardScaler().fit(ohe_dat)
sc_dat = sc.transform(ohe_dat)
X_trans = pd.DataFrame(sc_dat, columns=ohe_dat.columns)# Define our final model outside of a pipeline for the partial dependence plot
model_final_st = GradientBoostingRegressor(random_state=42,
                              subsample=1.0,
                              n_estimators=1000,
                              min_samples_split=15,
                              min_samples_leaf=10,
                              max_depth=8,
                              learning_rate=0.1,
                              max_features=None,
                                          ) 
model_final_st.fit(X_trans,np.ravel(y_train))# Add in a dummy feature to facilitate partial dependance chart
model_final_st.dummy_ = "dummy"# Run Partial Depdendance plot
features = ohe_dat.columns
plot_partial_dependence(model_final_st, X=X_trans, features=features,
                        n_jobs=-1, grid_resolution=20)# Adjust figure properties
fig = plt.gcf()
fig.set_size_inches(20, 20)
fig.subplots_adjust(hspace=0.3)
fig.suptitle('Gradient Boosting Regressor Feature Partial Dependance on Motor UPDRS Score',fontsize=20);
fig.tight_layout()
fig.subplots_adjust(top=0.95)

Please note that the x-axis values have been normalized by a standard scalar. Connecting back to our feature importances, our time of day feature has a really variable impact on our target feature based on what time of the day the recording was captured. Also, as our HNR feature increases, the motor UPDRS score lowers which does seem a bit odd that higher quality recordings are resulting in lower target values. It does make sense that as our DFA feature increases the motor updrs score lowers since Parkinson’s can impair a person’s ability to speak normally.

Finally, let’s run our model on the test dataset to see how it performs. Below includes a dataframe output for each model evaluation metric and a bar chart visualization of the results.

Our model performs reasonably well on the test dataset, but it is slightly overfit compared to the validation dataset based on the worsened performance on all 3 metrics. It is also quite clear that the model is extremely overfit to the training dataset, so incorporating some sort of model bias could be introduced to reduce the training overfit and corresponding validation and test dataset scores.

Conclusion

In this project we have applied several different types of regression models to predict the motor severity scores of Parkinson’s patients using time and speech traits as features. Initial modeling without hyper parameter optimization produced some pretty low MAE values for the tree-based models, and these models alone were an improvement upon the models presented in Dr. Tsanas’s study. Hyperparameter tuning led to improvements in the lasso, decision tree, and in the sklearn gradient boost regressor models. Upon further examination of our model features, we learned that the time of testing, voice consistency, and audio recording quality played a large role in our model’s predictions. Running our model on the test dataset resulted in similar MAE, R2, and RMSE values as from the validation data, indicating that our model generalized pretty well to new data. Implementing and tuning new machine learning modeling techniques successfully resulted in a smaller MAE value of 3 compared to Dr. Tsanas’s best performing model, CART, producing a MAE value of 5.8.

Predicting Parkinsonian Symptom Severity from Voice Recordings

Written by Bobby Wilt