Enhancing the performance measures by Voting Classifier in ML
Suppose you are a part of a decision panel and every member of the panel comes with a decision about something. The panel does the voting among them and reaches to the final decision. How do they come to it? Simple they use the mode of the all member’s vote.
Same thing you can do with a machine learning classification problems. Suppose you have trained a few classifiers such as Logistic Regression classifier, SVC classifier, Decision Tree classifier, Random Forest Classifier and perhaps a few more and each one getting accuracy about 85%.
Similarly machine learning classification we can also use the panel voting method. In other words, a very simple way to create an even batter classifier is to aggregate the predictions of each classifier and predict the class with most votes. This majority-vote classifier is called a hard-voting classifier.
Somewhat surprisingly, this voting classifier often achieves higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a weak learner (mean‐ ing it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy.
The following code creates and trains a voting classifier in Scikit-Learn.
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVClog_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()voting_clf = VotingClassifier( estimators=[(‘lr’, log_clf), (‘rf’, rnd_clf),
(‘svc’, svm_clf)], voting=’hard’ )
voting_clf.fit(X_train, y_train)
Let’s look at each classifier’s accuracy on the test set:
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.896
There you have it! The voting classifier slightly outperforms all the individual classifiers. If all classifiers are able to estimate class probabilities (i.e., they have a pre dict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do is replace voting=”hard” with voting=”soft” and ensure that all classifiers can estimate class probabilities. This is not the case of the SVC class by default, so you need to set its probability hyperparameter to True (this will make the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add a predict_proba() method). If you modify the preceding code to use soft voting, you will find that the voting classifier achieves over 91% accuracy!