Bagging and Pasting in Machine Learning

Spread the love

Bagging and Pasting –

One way to get a diverse set of classifiers is to use very different training algorithms as we did with Voting classifiers. Another approach is to use the same training algorithm for every predictor and train them on different random subsets of the training set.

When sampling is performed with replacement this method is called bagging ( short for bootstrap aggregating ). When sampling is performed without replacement it is called pasting.

In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors but only bagging allows training instances to be sampled several times for the same predictor. This sampling and training process is represented below.

Once all predictors are trained, the ensemble can make prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode (i.e. the most frequent prediction, just like a hard voting classifier ) for classification or the average for regression. Each individual predictor has a higher bias than if it were trained on the original training set but aggregation reduces both bias and variance. generally the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

How to Train a Bagging and Pasting Model in Scikit-learn ?

Scikit-learn offers a simple API for both bagging and pasting with the BaggingClassifier for classification and BaggingRegressor for regression.

Let’s read a dataset to work with.

import pandas as pd
import numpy as np

url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv'
df = pd.read_csv(url)
df.head()

Next split the data into a training and test set.

from sklearn.model_selection import train_test_split

X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next we will create an ensemble of 500 decision tree classifiers. Each trained on 50 training instances randomly sampled from the training set with replacement ( this is an example of bagging but if you want to use the pasting instead, just set bootstrap=False) The n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions (-1 tells Scikit-Learn to use all available cores).

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=50, bootstrap=True, n_jobs=-1
)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)
# output
0.956140350877193

Rating: 1 out of 5.

Leave a Reply