How to Use SMOTE for Imbalanced Data in R

Spread the love

Handling imbalanced data is a significant issue in machine learning applications, particularly in classification problems where the class distribution is skewed. One popular technique to address this challenge is the Synthetic Minority Over-sampling Technique (SMOTE). This guide provides an exhaustive look at using SMOTE in R to balance your dataset and improve model performance.

Understanding Imbalanced Data

What is Imbalanced Data?

In a typical classification problem, imbalanced data occurs when one class significantly outnumbers the other. For example, in a dataset with two classes A and B, if 90% of the instances belong to class A and only 10% to class B, then the dataset is imbalanced.

Why is it a Problem?

The imbalance in the dataset can lead to a bias in the model, where it might predict the majority class most of the time. This undermines the predictive power of the model, particularly for the minority class, which is often the more interesting or important class in real-world problems.

Overview of SMOTE

SMOTE stands for Synthetic Minority Over-sampling Technique. It aims to balance class distribution by generating new instances of the minority class via interpolation between existing instances. This synthetic creation of instances helps improve the performance of classifiers.

Installing Necessary Packages

First, install the required packages by running:

install.packages("DMwR")
install.packages("randomForest")

Using SMOTE on a Real Dataset

We’ll use the iris dataset to demonstrate SMOTE. This dataset contains 150 samples of iris flowers, but we’ll modify it to make it imbalanced.

Loading the Package and Modifying the Dataset

library(DMwR)
data(iris)

# Make the dataset imbalanced
imbalanced_data <- iris[1:85,]

Check Class Distribution

table(imbalanced_data$Species)

Apply SMOTE to Balance Data

set.seed(123)
balanced_data <- SMOTE(Species ~ ., data = imbalanced_data, perc.over = 600, k = 5)

Here, Species ~ . indicates that the Species variable is the class label, and we’re using all other variables for balancing. perc.over = 600 oversamples the minority class by 600%, and k = 5 uses 5 nearest neighbors.

Check New Class Distribution

table(balanced_data$Species)

Comparing Models Before and After SMOTE

Here, we will use the random forest algorithm for demonstration.

Load the randomForest Package

library(randomForest)

Train a Model on Imbalanced Data

set.seed(123)
model1 <- randomForest(Species ~ ., data = imbalanced_data)
print(model1)

Train a Model on Balanced Data

set.seed(123)
model2 <- randomForest(Species ~ ., data = balanced_data)
print(model2)

Compare Model Metrics

You can compare the confusion matrix, accuracy, precision, and recall to evaluate the model’s performance before and after using SMOTE.

Conclusion

As shown in the example using the iris dataset, SMOTE can help balance your imbalanced dataset, which, in turn, can significantly improve the performance of your machine learning models. Always remember to compare model performance before and after applying such techniques to assess their effectiveness correctly.

Posted in RTagged

Leave a Reply