Model selection and parameter tuning

Matthew E. Parker
Analytics Vidhya
Published in
5 min readAug 25, 2019
Photo by Héctor J. Rivas on Unsplash

When undertaking classification (or regression) tasks, one of the most important steps in the data science workflow is the selection of the best model algorithm for your data set. Assuming that the data set has been sufficiently cleaned, multicollinearity has been reduced (or avoided via Principal Component Analysis), and all other exploratory data analysis (EDA) tasks have been completed, one can begin the process of modeling.

There is no magic trick to picking the right model, it is wholly dependent upon the data set itself. After determining a scoring metric(s) to rely upon, a first step to save time and aimless wandering might be to attempt fitting numerous classifiers in a for loop, using default parameters. For example:

# for acquiring and managing datasets
import pandas as pd
import numpy as np
np.random.seed(42)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
# for modeling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import metrics
# classifier modeling methods
import xgboost
from xgboost.sklearn import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state=42)
classifiers = [KNeighborsClassifier, SVC, DecisionTreeClassifier,
RandomForestClassifier, AdaBoostClassifier,
XGBClassifier, LogisticRegression, GaussianNB]
classy_scores = []
for classifier in classifiers:
clf = classifier()
clf.fit(X_train, y_train.values.flatten())
score = clf.score(X_test, y_test.values.flatten())
classy_scores.append((str(classifier), score))
ranked_classifiers = sorted(classy_scores, key=lambda x: x[1], reverse=True)
ranked_classifiers #outputs all classifiers, ranked by testing metric

Though not precise, such an approach will at least let you see the relative efficacy for each algorithm, allowing you to abandon those with noticeably lower scores. (The ones shown above are just for an example and are by no means an exhaustive list of algorithm possibilities).

Once a model (or two, or three) has been selected, it is time to begin the process of parameter tuning. This can be a lengthy endeavor, especially in terms of run times, so one must be smart about how this process is approached. One of the best ways to tune hyperparameters is using sklearn’s GridSearchCV. With grid search, one simply inputs a dictionary of lists of parameter values to test and the machine runs through all possible parameter combinations.

NOTE: this is where it is easy to lose hours. Rather than test a large number of parameter values for all possible parameters all at once, it is much more time efficient to run these in smaller batches in multiple steps. I recommend testing at most 5 possible values for two or three parameters at a time, preferably with the parameters in each step being the ones most closely related to one another. After the first round, narrow the spread of the values for the same parameters ([1,2,3] becomes [1.5, 2.0, 2.5] on the second time through). Or, if the results are at the upper/lower ends of the testing list, try extending the list in that direction (if [1,2,3] return ‘3’, next iteration tests [3,4,5]).

Once you have your optimal values (don’t get carried away with narrowing to numerous significant digits), proceed to the next batch of (hopefully related) parameters to test. This can take a while and may be a bit tedious. My best advice is to carefully read the documentation for the algorithm to determine which parameters make the most sense to test together and what values make the most sense to test. For instance, if you are testing sklearn’s SVC algorithm and wish to compare different kernels (e.g. ‘linear’, ‘rbf’, ‘poly’…), you cannot simultaneously test other parameters that may be kernel-specific (e.g. ‘degree’ only applies to ‘poly’ kernel) without nesting parameter dictionaries; this is possible but will dramatically increase run time. Instead, just compare the kernels themselves first, then further tune the one that seems to work best.

As an example, I recently ran a classification task that used Accuracy as the desired scoring metric. To save time (run time and time spent writing reduplicated code) I wrote several functions to automate the process of grid-searching parameter options, finding the combination that produced the best test accuracy (balanced with low over-fitting), and then populated those parameters for the next round of parameter tuning (of different parameters). It was important that the process use the hold out testing data for validation and that it also balance for over-fitting (something which commonly arises when tuning hyperparameters).

To accomplish this balance, I created a metric that was the weighted harmonic mean of the testing accuracy and the spread between training and testing accuracy (my measure of model over-fitting). Using this harmonic mean instead of just the testing accuracy has the downside of generating slightly lower test accuracy scores, but comes with the benefit of significantly reducing model overfit. To make visualizing each iteration, I also included an annotated confusion matrix graph of model success on the holdout testing data. Here’s the code:

# define a function to generate a confusion matrix
def confu_matrix(y_pred, x_tst, y_tst):
import warnings
warnings.filterwarnings('ignore')
y_pred = np.array(y_pred).flatten()
y_tst = np.array(y_tst).flatten()
cm = confusion_matrix(y_tst.flatten(), y_pred.flatten())
sns.heatmap(cm, annot=True, fmt='0g',
annot_kws={'size':14, 'ha':'center', 'va':'top'})
sns.heatmap(cm/np.sum(cm), annot=True, fmt='0.01%',
annot_kws={'size':14, 'ha':'center', 'va':'bottom'})
plt.title('Confusion Matrix', fontsize=14)
plt.show();

def convert_params(best_params):
params = {}
for key, val in best_params.items():
params[key] = [val]
return params

def get_best_params(cv_results):
"""
input: model.cv_results_
returns: dictionary of parameters with the highest harmonic
mean balancing mean_test_score and (1 - test_train_diff)
This reduces overfitting while maximizing test score.
"""
dfp = pd.DataFrame(cv_results)
dfp['test_train_diff'] = np.abs(dfp['mean_train_score'] - dfp['mean_test_score'])
dfp['harmonic'] = 3 / ((2 / dfp['mean_test_score']) + (1 / (1-dfp['test_train_diff'])))
dfp.sort_values(by='harmonic', ascending=False, inplace=True)
dfp.reset_index(drop=True, inplace=True)
return convert_params(dfp.iloc[0].params)

def gridsearch_params(estimator, params_test, old_params=None,
update_params=True, scoring='accuracy'):
"""
Inputs an instantiated estimator and a dictionary of parameters
for tuning (optionally an old dictionary of established parameters)
Returns a dictionary of the new best parameters.
Requires X_train, X_test, y_train, y_test to exist as global variables.
"""
import warnings
warnings.filterwarnings('ignore')
if update_params:
old_params.update(params_test)
params_test = old_params
gsearch1 = GridSearchCV(estimator=estimator, refit=True,
param_grid=params_test, scoring=scoring,
n_jobs=4, iid=False, cv=5)
gsearch1.fit(X_train, y_train.values.flatten())
best_params = get_best_params(gsearch1.cv_results_)
gsearch1a = GridSearchCV(estimator=estimator, refit=True,
param_grid=best_params, scoring=scoring,
n_jobs=4, iid=False, cv=5)
gsearch1a.fit(X_train, y_train.values.flatten())
confu_matrix(gsearch1a.predict(X_test), X_test, y_test)
tr_acc = round(accuracy_score(y_train.values.flatten(),
gsearch1a.predict(X_train)), 4)*100
tst_acc = round(accuracy_score(y_test.values.flatten(),
gsearch1a.predict(X_test)), 4)*100
print(f"Train accuracy: {tr_acc}%\nTest accuracy: {tst_acc}%\n{best_params}")
return best_params, gsearch1a

# First set of parameters
param_test1 = {'max_depth':range(3,8),
'min_child_weight':range(1,6)}
xgb1a = XGBClassifier(learning_rate=0.1, n_estimators=1000,
objective='binary:logistic', nthread=4, seed=42)
best_params, xgb_gs1 = gridsearch_params(xgb1a, param_test1,
update_params=False)

# second set of parameters, including best params from first set
param_test2 = {'gamma': np.linspace(0.0, 0.2, 5),
'subsample':np.linspace(.8, 1.0, 5),
'colsample_bytree':np.linspace(.8, 1.0, 5)}
xgb1a = XGBClassifier(learning_rate=0.1, n_estimators=1000,
objective='binary:logistic', nthread=4, seed=42)
best_params, xgb_gs2 = gridsearch_params(xgb1a, param_test2, best_params,update_params=True)

--

--