Voting Classifiers –
Voting classifiers are ensemble of many classifiers. In voting classifiers we aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier.
Voting classifiers often achieves a higher accuracy than the best classifier in the ensemble. In fact even if each classifier is a weak learner (meaning it does only slightly better than random guessing) the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse.
How to train a Voting Classifier in Sklearn ?
Let’s read a dataset to work with.
import pandas as pd import numpy as np url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv' df = pd.read_csv(url) df.head()
Next split the data into training and test set.
from sklearn.model_selection import train_test_split X = df.drop('diagnosis', axis=1) y = df['diagnosis'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, we will train a voting classifier in scikit-learn.
from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import VotingClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC log_clf = LogisticRegression() rnd_clf = RandomForestClassifier() svm_clf = SVC() voting_clf = VotingClassifier( estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='hard' ) voting_clf.fit(X_train, y_train)
Now Let’s look at each classifier’s accuracy on the test set.
from sklearn.metrics import accuracy_score for clf in (log_clf, rnd_clf, svm_clf, voting_clf): clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
LogisticRegression 0.9649122807017544 RandomForestClassifier 0.9649122807017544 SVC 0.9473684210526315 VotingClassifier 0.9649122807017544
If all classifiers are able to estimate class probabilities (i.e. they all have a predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do is replace voting=’hard‘ with voting=’soft’ and ensure that all classifiers can estimate class probabilities. This is not the case for the SVC class by default, so you need to set its probability hyperparameter to True ( this will make the SVC class use cross-validation to estimate class probabilities, slowing down training and it will add a predict_proba() method).