CodeX
Published in

CodeX

Do I need to tune logistic regression hyperparameters?

Aren’t we over-committed to optimizing the data science work we do? We are often trying to find the best combination of x, y, z variables to find that perfect solution. The optimization tasks are fun, yet they easily carry us far away, and we end up spending extra precious hours or even days.

https://pixabay.com/photos/code-programming-hacking-html-web-820275/

The optimization universe is wide and deep. We won’t cover answers to all the questions, and this article will focus on the simplest, yet most popular algorithm — logistic regression.

According to the Kaggle survey, the most popular machine learning model is logisitc regression which is the most popular classification technique — https://blog.exploratory.io/exploratory-weekly-update-12-3-d4b1d0f620b9

Hyperparameter Tuning

Hyperparameter tuning is an optimization technique and is an essential aspect of the machine learning process. A good choice of hyperparameters may make your model meet your desired metric. Yet, the plethora of hyperparameters, algorithms, and optimization objectives can lead to an unending cycle of continuous optimization effort.

Logistic Regression Hyperparameters

The main hyperparameters we may tune in logistic regression are: solver, penalty, and regularization strength (sklearn documentation).

Solver is the algorithm to use in the optimization problem. The choices are {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’.

  1. lbfgs relatively performs well compared to other methods and it saves a lot of memory, however, sometimes it may have issues with convergence.
  2. sag faster than other solvers for large datasets, when both the number of samples and the number of features are large.
  3. saga the solver of choice for sparse multinomial logistic regression and it’s also suitable for very large datasets.
  4. newton-cg computationally expensive because of the Hessian Matrix.
  5. liblinearrecommended when you have a high dimension dataset - solving large-scale classification problems.

Penalty (or regularization) intends to reduce model generalization error, and is meant to disincentivize and regulate overfitting. Technique discourages learning a more complex model, so as to avoid the risk of overfitting. The choices are: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’. However, some penalties may not work with some solvers, the following table summarizes the penalties supported by each solver:

https://scikit-learn.org/dev/modules/linear_model.html#logistic-regression

C (or regularization strength) must be a positive float. Regularization strength works with the penalty to regulate overfitting. Smaller values specify stronger regularization and high value tells the model to give high weight to the training data.

Logistic regression offers other parameters like: class_weight, dualbool (for sparse datasets when n_samples > n_features), max_iter (may improve convergence with higher iterations), and others. However, these provide less impact.

First, we optimize logistic regression hyperparameters for a fintech dataset. It is a binary classification task, with the objective to predict if a given loan applicant is likely to pay the loan back.
Kaggle notebook reference is here.

The Baseline

We establish a baseline by fitting the classifier with the default parameters before performing the hyperparameter tuning. Note, ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale, therefore we preprocess the data with a scaler from sklearn.preprocessing.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, auc,roc_curve
from sklearn.preprocessing import MinMaxScaler
loans = pd.read_csv('../input/prepared-lending-club-dataset/mycsvfile.csv')loans = loans[["loan_amnt", "term", "sub_grade", "emp_length", "annual_inc", "loan_status", "dti", "mths_since_recent_inq", "revol_util", "num_op_rev_tl"]]X = loans.drop('loan_status', axis=1)
y = loans[['loan_status']]
y = y.values.ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

scaler = MinMaxScaler() #saga solver requires features to be scaled for model conversion

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print('Accuracy of logistic regression classifier on train set: {:.2f}'.format(logreg.score(X_train, y_train)))
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
print('Precision of logistic regression classifier on test set: {:.2f}'.format(precision_score(y_test, y_pred)))

We will use these metrics as a baseline comparison for any improvements we yield during the optimization work.

The best solver

The model performs well on the train and testings sets, and both yield a similar accuracy of 0.79, and at a glance, there is no overfitting occur. Therefore, we start by selecting the best solver excluding regularization at this stage: {‘newton-cg’, ‘lbfgs’, ‘sag’, ‘saga’}.

liblinear is excluded for the time being, as it does not support ‘none’ penalty, and would log an error if we try: penalty == “none”: → 464 raise ValueError(“penalty=’none’ is not supported for the liblinear solver”)

clf = [
LogisticRegression(solver='newton-cg',penalty='none',max_iter=1000),
LogisticRegression(solver='lbfgs',penalty='none',max_iter=1000),
LogisticRegression(solver='sag',penalty='none',max_iter=1000),
LogisticRegression(solver='saga',penalty='none',max_iter=1000)
]
clf_columns = []
clf_compare = pd.DataFrame(columns = clf_columns)

row_index = 0
for alg in clf:

predicted = alg.fit(X_train, y_train).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
clf_name = alg.__class__.__name__
clf_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_train, y_train), 5)
clf_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 5)
clf_compare.loc[row_index, 'Precission'] = round(precision_score(y_test, predicted),5)
clf_compare.loc[row_index, 'Recall'] = round(recall_score(y_test, predicted),5)
clf_compare.loc[row_index, 'AUC'] = round(auc(fp, tp),5)

row_index+=1

clf_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)
clf_compare

What we observe here is that regardless of the solver we choose, the model metric improvement vs baseline is less than 0.001%.

Comparing Solvers with Penalties

Next, we add a regularization l2 layer to all the solvers, including liblinear

clf = [
LogisticRegression(solver='newton-cg',penalty='l2',max_iter=1000),
LogisticRegression(solver='lbfgs',penalty='l2',max_iter=1000),
LogisticRegression(solver='sag',penalty='l2',max_iter=1000),
LogisticRegression(solver='saga',penalty='l2',max_iter=1000)
]
clf_columns = []
clf_compare = pd.DataFrame(columns = clf_columns)

row_index = 0
for alg in clf:

predicted = alg.fit(X_train, y_train).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
clf_name = alg.__class__.__name__
clf_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_train, y_train), 5)
clf_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 5)
clf_compare.loc[row_index, 'Precission'] = round(precision_score(y_test, predicted),5)
clf_compare.loc[row_index, 'Recall'] = round(recall_score(y_test, predicted),5)
clf_compare.loc[row_index, 'AUC'] = round(auc(fp, tp),5)

row_index+=1

clf_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)
clf_compare

We observe a stronger variance in the results, yet, as you can see it is insignificant.

Comparing C parameter

Finally, we introduce C (default is 1) which is a penalty term, meant to disincentivize and regulate overfitting. We will specify smaller numbers in order to get stronger regularization.

clf = [
LogisticRegression(solver='newton-cg',penalty='l2', C=0.001, max_iter=1000),
LogisticRegression(solver='lbfgs',penalty='l2',C=0.001, max_iter=1000),
LogisticRegression(solver='sag',penalty='l2',C=0.001, max_iter=1000),
LogisticRegression(solver='saga',penalty='l2',C=0.001, max_iter=1000)
]
clf_columns = []
clf_compare = pd.DataFrame(columns = clf_columns)

row_index = 0
for alg in clf:

predicted = alg.fit(X_train, y_train).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
clf_name = alg.__class__.__name__
clf_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_train, y_train), 5)
clf_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 5)
clf_compare.loc[row_index, 'Precission'] = round(precision_score(y_test, predicted),5)
clf_compare.loc[row_index, 'Recall'] = round(recall_score(y_test, predicted),5)
clf_compare.loc[row_index, 'AUC'] = round(auc(fp, tp),5)

row_index+=1

clf_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)
clf_compare

A minor reshuffle happens with this case — between precision and recall with a slight improvement in accuracy, yet at the cost of AUC, and since the AUC isn’t higher we can assume without further analysis this classifier isn’t better than the previous.

From these 3 steps experiment, we can conclude we didn’t gain any substantial benefit in building a better model while tuning the hyperparameters, and the classifier with default parameters was strong enough by itself.

Now, we try the same approach with a heart disease dataset. Again, this is a binary classification task with the objective to predict if a given person's health condition is likely to cause heart disease.

The Baseline

We start by building the baseline:

hd = pd.read_csv('../input/personal-key-indicators-of-heart-disease/heart_2020_cleaned.csv')hd =  hd[hd.columns].replace({'Yes':1, 'No':0, 'Male':1,'Female':0,'No, borderline diabetes':'0','Yes (during pregnancy)':'1' })
hd['Diabetic'] = hd['Diabetic'].astype(int)
cleaner_app_type = {"AgeCategory": {"18-24": 1.0, "25-29": 2.0, "30-34": 3.0, "35-39": 4.0, "40-44": 5.0,
"45-49": 6.0, "50-54": 7.0, "55-59": 8.0, "60-64": 9.0, "65-69": 10.0,
"70-74": 11.0, "75-79": 12.0, "80 or older": 13.0
} }
hd = hd.replace(cleaner_app_type)
hd = hd.drop(columns = ['Race', 'GenHealth'], axis = 1)
X = hd.drop('HeartDisease', axis=1)
y = hd[['HeartDisease']]
y = y.values.ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

scaler = MinMaxScaler() #saga solver requires features to be scaled for model conversion

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print('Accuracy of logistic regression classifier on train set: {:.2f}'.format(logreg.score(X_train, y_train)))
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
print('Precision of logistic regression classifier on test set: {:.2f}'.format(precision_score(y_test, y_pred)))

The best solver

The model performs well on the train and testings sets, and both yield a similar accuracy of 0.91. Let’s see if any solver can show a significantly better performance:

clf = [
LogisticRegression(solver='newton-cg',penalty='none',max_iter=1000),
LogisticRegression(solver='lbfgs',penalty='none',max_iter=1000),
LogisticRegression(solver='sag',penalty='none',max_iter=1000),
LogisticRegression(solver='saga',penalty='none',max_iter=1000)
]
clf_columns = []
clf_compare = pd.DataFrame(columns = clf_columns)

row_index = 0
for alg in clf:

predicted = alg.fit(X_train, y_train).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
clf_name = alg.__class__.__name__
clf_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_train, y_train), 5)
clf_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 5)
clf_compare.loc[row_index, 'Precission'] = round(precision_score(y_test, predicted),5)
clf_compare.loc[row_index, 'Recall'] = round(recall_score(y_test, predicted),5)
clf_compare.loc[row_index, 'AUC'] = round(auc(fp, tp),5)

row_index+=1

clf_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)
clf_compare

We are learning there is no substantial difference:

Comparing Solvers with Penalties

Now, let’s add the penalty l2 layer:

clf = [
LogisticRegression(solver='newton-cg',penalty='l2',max_iter=1000),
LogisticRegression(solver='lbfgs',penalty='l2',max_iter=1000),
LogisticRegression(solver='sag',penalty='l2',max_iter=1000),
LogisticRegression(solver='saga',penalty='l2',max_iter=1000)
]
clf_columns = []
clf_compare = pd.DataFrame(columns = clf_columns)

row_index = 0
for alg in clf:

predicted = alg.fit(X_train, y_train).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
clf_name = alg.__class__.__name__
clf_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_train, y_train), 5)
clf_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 5)
clf_compare.loc[row_index, 'Precission'] = round(precision_score(y_test, predicted),5)
clf_compare.loc[row_index, 'Recall'] = round(recall_score(y_test, predicted),5)
clf_compare.loc[row_index, 'AUC'] = round(auc(fp, tp),5)

row_index+=1

clf_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)
clf_compare

And we see, we don’t gain any substantial difference here as well

Comparing C

The last step to see is to adjust the C:

clf = [
LogisticRegression(solver='newton-cg',penalty='l2', C=0.001, max_iter=1000),
LogisticRegression(solver='lbfgs',penalty='l2',C=0.001, max_iter=1000),
LogisticRegression(solver='sag',penalty='l2',C=0.001, max_iter=1000),
LogisticRegression(solver='saga',penalty='l2',C=0.001, max_iter=1000)
]
clf_columns = []
clf_compare = pd.DataFrame(columns = clf_columns)

row_index = 0
for alg in clf:

predicted = alg.fit(X_train, y_train).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
clf_name = alg.__class__.__name__
clf_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_train, y_train), 5)
clf_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 5)
clf_compare.loc[row_index, 'Precission'] = round(precision_score(y_test, predicted),5)
clf_compare.loc[row_index, 'Recall'] = round(recall_score(y_test, predicted),5)
clf_compare.loc[row_index, 'AUC'] = round(auc(fp, tp),5)

row_index+=1

clf_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)
clf_compare

Similarly, as with the fintech dataset, we see a stronger reshuffle between precision and recall metrics, which could be a better gain, if you optimise any specific metric. For example, with the case of heart disease, you may want tto focus on better prediction of people with the heart disease. Yet, even with the reshuffle, we see AUC score decreased, meaning overall classifier performance decreased.

There are many algorithms available out there, and it does take a good amount of time to study the meaning, impact and optimization reason of a hyperparameter. In addition, tuning is often compute demanding activity, either you apply a randomsearch or a gridsearch, or an even more complex methodology, you burn compute resources. Yet, sometimes the incremental impact is so little. And in our particulate case, the incrementally is not even meaningful considering the datasets are imbalanced.

Wouldn’t it be better to spend your time on something more important? For example, linking the model results with business metrics, or optimizing decision thresholds, and so on?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store