THE NAIVE BAYES GUIDE

How to Improve Naive Bayes?

Section 3: Tuning the Model in Python

Kopal Jain
Analytics Vidhya

--

Reference How to Implement Naive Bayes? Section 2: Building the Model in Python, prior to continuing…

[10] Define Grid Search Parameters

param_grid_nb = {
'var_smoothing': np.logspace(0,-9, num=100)
}
  • var_smoothing is a stability calculation to widen (or smooth) the curve and therefore account for more samples that are further away from the distribution mean. In this case, np.logspace returns numbers spaced evenly on a log scale, starts from 0, ends at -9, and generates 100 samples.

Why this step: To set the selected parameters used to find the optimal combination. By referencing the sklearn.naive_bayes.GaussianNB documentation, you can find a completed list of parameters with descriptions that can be used in grid search functionalities.

[11] Hyperparameter Tune using Training Data

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
nbModel_grid = GridSearchCV(estimator=GaussianNB(), param_grid=param_grid_nb, verbose=1, cv=10, n_jobs=-1)nbModel_grid.fit(X_train, y_train)print(nbModel_grid.best_estimator_)...Fitting 10 folds for each of 100 candidates, totalling 1000 fitsGaussianNB(priors=None, var_smoothing=1.0)

Note: Total number of fits is 1000 since the cv is defined as 10 and there are 100 candidates (var_smoothing has 100 defined parameters). Therefore, the calculation for a total number of fits → 10 x [100] = 1000.

  • estimator is the machine learning model of interest, provided the model has a scoring function; in this case, the model assigned is GaussianNB().
  • param_grid is a dictionary with parameters names (string) as keys and lists of parameter settings to try as values; this enables searching over any sequence of parameter settings.
  • verbose is the verbosity: the higher, the more messages; in this case, it is set to 1.
  • cv is the cross-validation generator or an iterable, in this case, there is a 10-fold cross-validation.
  • n_jobs is the maximum number of concurrently running workers; in this case, it is set to -1 which implies that all CPUs are used.

Why this step: To find an optimal combination of hyperparameters that minimizes a predefined loss function to give better results.

[12] Predict on Testing Data

y_pred = nbModel_grid.predict(X_test)print(y_pred)...[0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 0]

Why this step: To obtain model prediction on testing data to evaluate the model’s accuracy and efficiency.

[13] Numeric Analysis

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred), ": is the confusion matrix")
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred), ": is the accuracy score")
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred), ": is the precision score")
from sklearn.metrics import recall_score
print(recall_score(y_test, y_pred), ": is the recall score")
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred), ": is the f1 score")
...[[81 27]
[19 81]] : is the confusion matrix

0.7788461538461539 : is the accuracy score
0.75 : is the precision score
0.81 : is the recall score
0.7788461538461539 : is the f1 score

Note: Using the confusion matrix, the True Positive, False Positive, False Negative, and True Negative values can be extracted which will aid in the calculation of the accuracy score, precision score, recall score, and f1 score:

  • True Positive = 81
  • False Positive = 27
  • False Negative = 19
  • True Negative = 81
Equations for Accuracy, Precision, Recall, and F1.

Why this step: To evaluate the performance of the tuned classification model. As you can see, the accuracy, precision, recall, and F1 scores all have improved by tuning the model from the basic Gaussian Naive Bayes model created in Section 2.

--

--

Kopal Jain
Analytics Vidhya

Genentech Data Engineer | Harvard Data Science Grad | RPI Biomedical Engineer