How to Use createDataPartition() Function in R

Spread the love

The createDataPartition() function, part of the caret package in R, is a versatile tool in the data scientist’s toolkit. It is instrumental in creating stratified data partitions, ensuring the preservation of the proportional representation of classes (i.e., outcome variable levels) in both training and test datasets, a crucial step in building reliable and unbiased predictive models.

1. Basics of createDataPartition():

The createDataPartition() function is used to make balanced splits of the data based on the outcome variable. You can use the createDataPartition() function in R to partition a data frame into training and testing sets for model building. This method is particularly important when dealing with imbalanced datasets, where one class significantly outnumbers the others.

Syntax:

# Loading the caret package
library(caret)

# Syntax
train_indices <- createDataPartition(y, p, list, groups)
  • y: The outcome variable that you are using to split the data.
  • p: The proportion of the data that you want to include in the training set.
  • list: Logical. If TRUE, the function returns a list of indices, FALSE returns a vector.
  • groups: An integer value specifying the number of groups when performing k-fold cross-validation.

2. Simple Application:

Here’s a basic application of the createDataPartition() function, using the iris dataset.

# Loading the caret package
library(caret)

# Set seed for reproducibility
set.seed(123)

# Creating indices for stratified sampling
train_indices <- createDataPartition(iris$Species, p = 0.7, list = FALSE)

# Creating training and test sets
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]

3. Verifying Stratification:

It’s essential to verify whether the stratification worked as intended by comparing the class distribution in the original dataset, the training set, and the test set.

# Comparing class distributions
prop.table(table(iris$Species))          # Original dataset
prop.table(table(train_data$Species))   # Training set
prop.table(table(test_data$Species))    # Test set

4. Dealing with Imbalanced Data:

For highly imbalanced datasets, using createDataPartition() is particularly important to avoid building a biased model.

# Assume 'data' is your imbalanced dataset and 'Class' is your target variable
set.seed(123)
train_indices <- createDataPartition(data$Class, p = 0.7, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

5. Using with Cross-Validation:

When using k-fold cross-validation, the createDataPartition() function can also be used to create balanced folds, preserving the class distribution in each fold.

Step 1: Load the necessary library and dataset

# Load the necessary library
library(caret)

# Load the iris dataset
data(iris)

Step 2: Set up k-fold cross-validation

# Set seed for reproducibility
set.seed(123)

# Define the number of folds
k <- 5  # For 5-fold cross-validation

# Create balanced k folds
fold_indices <- createDataPartition(iris$Species, p = 1 - 1/k, list = TRUE, times = k)

Step 3: Creating Training and Test sets using folds

# Initialize a list to hold the training and test sets for each fold
folds_list <- list()

for(i in 1:k) {
  train_indices <- fold_indices[[i]]
  train_data <- iris[train_indices, ]  # Training data for the i-th fold
  test_data <- iris[-train_indices, ]  # Test data for the i-th fold
  folds_list[[i]] <- list("train_data" = train_data, "test_data" = test_data)
}

Step 4: Access the partitions and implement models

# Example of modeling using a simple linear discriminant analysis model as a placeholder
library(MASS)  # Load MASS package for lda function

for(i in 1:k){
  model <- lda(Species ~ ., data = folds_list[[i]]$train_data)
  predictions <- predict(model, folds_list[[i]]$test_data)$class
  
  # Evaluating the model's predictions
  print(confusionMatrix(predictions, folds_list[[i]]$test_data$Species))
}

6. Conclusion:

The createDataPartition() function in the caret package is a pivotal function in R for ensuring the balanced and unbiased splitting of datasets into training and test sets, especially in the context of imbalanced datasets. Whether employed as a standalone function or integrated into more complex custom functions and workflows, it serves as a building block for creating robust, reliable, and generalizable predictive models. By meticulously applying and verifying stratified partitions in datasets and making further necessary adjustments, data scientists can substantially enhance the predictive accuracy and reliability of their models.

Leave a Reply