Python Model Tuning Methods Using Cross Validation and Grid Search

Sebastian Norena
6 min readMay 18, 2018

--

Image from Karl Rosaen Log http://karlrosaen.com/ml/learning-log/2016-06-20/

When doing machine learning it is usually difficult to know in advance what parameters and hyper parameters will enable a specific model to make better predictions, also even deciding which model will give better predictions is difficult. Another challenge for machine learning arises with training and testing data, because picking different percentages of data for training and testing affects the quality of the resulting model. These are challenges that every data scientist needs to solve in order to get better predictions.

Fortunately there are some tools that make this process easier by making predictions with many different combinations of parameters, hyper parameters, models, training data, and testing data, and giving back the combination that gets the best predicting results, this is very useful as it saves a lot of time and effort and will lead to better results, so below I will describe some of these tools:

Sample Data

For a better illustration I will use the Titanic dataset, it has become popular for learning data science and I used it to learn myself. On this post I will use it to train a model that based on all the features will predict if a particular person survives or not.

You can see the complete code for this post on this colaboratory file: https://drive.google.com/file/d/18OHAYPtXaX7p1sJutWByqgVyYbA2ArA5/view?usp=sharing

You can import and prepare the Titanic data like this:

import seaborn as sns
import pandas as pd
import numpy as npfrom sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
titanic = sns.load_dataset(‘titanic’)
titanic = pd.DataFrame(titanic.drop_duplicates())
titanic = titanic.drop([‘alive’,’adult_male’,’who’,’class’,’embark_town’], axis=1)
titanic[‘embarked’] = titanic[‘embarked’].fillna(method=’ffill’)
titanic = titanic.drop([‘deck’], axis=1)
titanic[‘age’] = titanic[‘age’].fillna(method=’ffill’)
# Convert binomials and categoricals to encoded labels
for label in [‘embarked’,’sex’,’alone’]:
titanic[label] = LabelEncoder().fit_transform(titanic[label])
colors = np.array([‘red’,’blue’])[titanic[‘survived’]]
survived = titanic[‘survived’]
titanic = titanic.drop([‘survived’],axis=1)

Split Training and Testing Data

sklearn.model_selection.train_test_split

Useful to split the data set into train and test subsets, the size of each subset is specified on the parameters test_size and train_size.

How it works

x_train, x_test, y_train, y_test = train_test_split(titanic, survived, test_size = 0.2, random_state=42)

As test_size = 0.2 the function will assign 20% of the data to the test subsets and the remaining 80% to the train subsets

x_train: will have 80% of the training data to be used for training (in this case “titanic”)

y_train (validation set): will have 80% real results data (in this case “survived”) used to train the model with the x_train data

x_test: will have 20% of the training data to be used as input for a trained model

y_test (test set): will have 20% real results data (in this case “survived”) used to validate the model accuracy by comparing them with the results given by the trained model from x_test data.

Cross Validation

sklearn.model_selection.cross_validate()

In order to avoid manually setting different percentages for training and testing sets, you can give this task to the cross_validate function, which will divide the training set into k folds and then try the different combinations where each of the combinations will use a different fold as the test set and the remaining k-1 folds as the train set. k is the desired number of folds, I have seen it usually works well with k=5.

When using cross validation, a test set (x_test and y_test) should still be held out to make a final evaluation of the model with new data, but the validation set (y_train) will not be needed as by the end of all the iterations, the cross validation function will use all the folds for training and also all the folds for validation.

How it Works

X = titanic
Y = survivednum_instances = len(X)
seed = 7
kfold = model_selection.KFold(n_splits=5, random_state=seed)
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: Final mean:%.3f%%, Final standard deviation:(%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
print('Accuracies from each of the 5 folds using kfold:',results)
print("Variance of kfold accuracies:",results.var())

Use Grid Search to Explore Hyper-parameters

Before using Grid Search lets define Parameters and Hyper-parameters:

Model parameters are the values that get defined after doing some training on a dataset, and are defined directly according to the data. Model parameters can also be seen as configuration variables internal to the model and whose value can be estimated from de dataset. Often they are not set manually.

Some examples of parameters are:

The coefficients on a logistic regression or linear regression model

The weights in a neural network

Model hyper-parameters are values that get defined before training a dataset and can not be learned directly from the data. Hyper-parameters are defined for the model at a higher level so the model can be train on the data set according to them and then determine the model parameters.

Hyper-parameters set a particular way in which a model will adapt to the dataset.

Some examples of hyper-parameters are:

  • The learning rate of the model
  • The number of folds on a k-fold cross validation

Grid Search is useful to Explore Hyper-parameters

For every model there are many hyper-parameters, so a good way to define the best set of hyper-parameters is by trying different combinations and comparing the results.

How it Works

Grid Search evaluates all the combinations from a list of desired hyper-parameters and reports which combination has the best accuracy.

# Create logistic regression object
logistic = linear_model.LogisticRegression()
# Create a list of all of the different penalty values that you want to test and save them to a variable called 'penalty'
penalty = ['l1', 'l2']
# Create a list of all of the different C values that you want to test and save them to a variable called 'C'
C = [0.0001, 0.001, 0.01, 1, 100]
# Now that you have two lists each holding the different values that you want test, use the dict() function to combine them into a dictionary. # Save your new dictionary to the variable 'hyperparameters'
hyperparameters = dict(C=C, penalty=penalty)
# Fit your model using gridsearch
clf = GridSearchCV(logistic, hyperparameters, cv=5, verbose=0)
best_model = clf.fit(X, Y)
#Print all the Parameters that gave the best results:
print('Best Parameters',clf.best_params_)
# You can also print the best penalty and C value individually from best_model.best_estimator_.get_params()
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

Calculate AUC for a Set of Results

On the previous example the grid search was done without specifying how to choose the best results. The AUC (Area Under the Curve) is a useful measure of how effective a model is. In order to calculate the AUC and choose the set of hyper-parameters with the best AUC simply set the parameter scoring=’roc_auc’.

How it Works

Here the model used for the comparisons is the Random Forest Classifier with scoring=’roc_auc’ so the choice of best parameters will be based on the best AUC:

rfc = RandomForestClassifier()
param_grid = {'n_estimators':[70,100,180],'criterion':['gini','entropy'],'verbose':[0,4,10],'warm_start':['False','True'], 'random_state':[42,72,100,200]}
CV_rfc = GridSearchCV(estimator=rfc,param_grid=param_grid, scoring='roc_auc', cv= 5)
CV_rfc.fit(X, Y)
print('BEST PARAMETERS:\n',CV_rfc.best_params_)
print('BEST SCORE:\n',CV_rfc.best_score_)
#>>>BEST PARAMETERS:{'criterion': 'entropy', 'n_estimators': 180, 'random_state': 72, 'verbose': 0, 'warm_start': 'False'}
#>>>BEST SCORE:0.8326181651784338

On the final lines appear the results of the grid-search, which shows the best parameters and best score. This means the best combination of hyper-parameters to train a random forest classifier on the given dataset is: {‘criterion’: ‘entropy’, ‘n_estimators’: 180, ‘random_state’: 72, ‘verbose’: 0, ‘warm_start’: ‘False’} and whith this combination the AUC (Area Under the Curve) is 0.8326181651784338

Conclusion

Cross Validation and Grid Search are essential tools for using a dataset as effectively as possible and training a model with the best combination of hyper-parameters. Said tools save a considerable amount of time as making all the different combinations of train and test sets and then training a model with the different combinations of hyper-parameters is very time consuming.

Except where otherwise noted, this content is licensed under a Creative Commons License.

--

--