Voting Classifiers in Machine Learning

Spread the love

Voting Classifiers –

Voting classifiers are ensemble of many classifiers. In voting classifiers we aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier.

Voting classifiers often achieves a higher accuracy than the best classifier in the ensemble. In fact even if each classifier is a weak learner (meaning it does only slightly better than random guessing) the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse.

How to train a Voting Classifier in Sklearn ?

Let’s read a dataset to work with.

import pandas as pd
import numpy as np

url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv'
df = pd.read_csv(url)
df.head()

Next split the data into training and test set.

from sklearn.model_selection import train_test_split

X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now, we will train a voting classifier in scikit-learn.

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)

voting_clf.fit(X_train, y_train)

Now Let’s look at each classifier’s accuracy on the test set.

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
LogisticRegression 0.9649122807017544
RandomForestClassifier 0.9649122807017544
SVC 0.9473684210526315
VotingClassifier 0.9649122807017544

If all classifiers are able to estimate class probabilities (i.e. they all have a predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do is replace voting=’hard‘ with voting=’soft’ and ensure that all classifiers can estimate class probabilities. This is not the case for the SVC class by default, so you need to set its probability hyperparameter to True ( this will make the SVC class use cross-validation to estimate class probabilities, slowing down training and it will add a predict_proba() method).

Rating: 1 out of 5.

Leave a Reply