
In real world, having an imbalanced dataset is a common thing. Knowing how to handle them is an important skills for any data scientist. In this post, you will learn 5 ways to handle imbalanced dataset for Machine Learning.
Handling Imbalanced Dataset –
Method 1 – Get more data –
The first strategy to handle imbalanced dataset is to get more training data especially for the minority class. Do this will reduce the class imbalance in your data.
Method 2 – Use a different Metric for evaluation –
A second strategy to handle class imbalance is to use a metric that is better suited for class imbalance. We all know that using Accuracy to evaluate a model which suffers from class imbalance is a very bad idea. So, we can use other metrics to evaluate the model like confusion matrix, precision, recall, F1 Score, ROC curves.
Here you can find details of all these metrics.
1 . Confusion Matrix
3 . F1 Score
4 . ROC Curve
Method 3 – Use an algorithm that can handle class imbalanced –
There are many algorithms in scikit-learn which can handle class imbalanced out of the box like Random Forest. Random Forest has a parameter called class_weight which can helps you reduce the effect of class imbalance. Let’s see how to do that.
First let’s read a dataset to work with. We will use the credit card fraud detection dataset from kaggle.
import pandas as pd
import numpy as np
import plotly.express as px
pd.options.display.max_columns = 999
df = pd.read_csv('creditcard.csv')
df.head()

Now, let’s train a Random Forest model without any modification so that we can compare the results later.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_predict
# separate features and target
X = df.drop('Class', axis=1)
y = df['Class']
# split the data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
# train a random forest model
rf = make_pipeline(SimpleImputer(),RandomForestClassifier(random_state=42))
rf.fit(X_train, y_train)
# make predictions
y_scores = cross_val_predict(rf, X_train, y_train, cv=3, method='predict_proba')
score = roc_auc_score(y_train, y_scores[:, 1])
print(np.round(score, 2))
output -
0.94
Now, let’s use the class_weight=’balanced’ in random forest and measure the roc auc score.
# train a random forest model
rf = make_pipeline(SimpleImputer(),
RandomForestClassifier(class_weight='balanced',random_state=42))
rf.fit(X_train, y_train)
# make predictions
y_scores = cross_val_predict(rf, X_train, y_train, cv=3, method='predict_proba')
score = roc_auc_score(y_train, y_scores[:, 1])
print(np.round(score, 2))
output -
0.94
Both the model has similar roc auc score. Although handling for imbalanced class does not improved the result on this dataset but you get the idea of how to use it.
Method 4 – Downsampling
Another way to handle class imbalance is Downsampling. In Downsampling, we randomly sample without replacement from the majority class to create a new subset of observations equal to the size of the minority class. For example if the minority class has 100 observations and majority class has 900 observations then we will randomly sample 100 observations from the majority class and combine both these samples to create a dataset with 200 observations. Then we train a model on this new dataset.
X = df_down.drop("Class", axis=1)
y = df_down['Class']
# split the data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
# train a random forest model
rf = make_pipeline(SimpleImputer(),RandomForestClassifier(random_state=42))
rf.fit(X_train, y_train)
# make predictions
y_scores = cross_val_predict(rf, X_train, y_train, cv=3, method='predict_proba')
score = roc_auc_score(y_train, y_scores[:, 1])
print(np.round(score, 2))
output -
0.98
We can see that after downsampling the accuracy has increased from 0.94 to 0.98.
Method 5 – Upsampling
In upsampling, for every observation in the majority class, we randomly select an observation from the minority class with replacement. The end result is the same number of observations from the minority and majority classes.
# perform upsampling
df_class1_up = df_class1.sample(class0_count, replace=True)
# combine both dataframe
df_up = pd.concat([df_class0, df_class1_up], axis=0)
# train a model on this new dataset
X = df_up.drop("Class", axis=1)
y = df_up['Class']
# split the data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
# train a random forest model
rf = make_pipeline(SimpleImputer(),RandomForestClassifier(random_state=42))
rf.fit(X_train, y_train)
# make predictions
y_scores = cross_val_predict(rf, X_train, y_train, cv=3, method='predict_proba')
score = roc_auc_score(y_train, y_scores[:, 1])
print(np.round(score, 2))
output -
1.0
Doing upsampling worked even much better than downsampling. This model has a roc auc score close to 1, You are seeing 1 because I was round the scores by 2 decimal places. This is not always going to be the case, so please apply all the method that is described here and choose the one that works best for your dataset.
I hope you liked this post. If you do then please share this post with others and subscribe to our blog below for more articles related to Machine Learning.