Statistical analysis often relies on the generation of random samples from a given population for experiments, simulations, or tests. One of the most commonly used functions for this purpose in R is the sample
function. This versatile function is part of R’s base package and is used for generating random samples from either a vector of one or more elements or directly from a range of elements. In this comprehensive guide, we will explore the sample
function, its syntax, and its multiple use-cases.
Table of Contents
- Introduction to the
sample
Function - Syntax and Parameters
- Basic Usage
- Advanced Sampling Techniques
- Use Cases
- Working with Data Frames and Matrices
- Caveats and Pitfalls
- Practical Examples
- Conclusion
1. Introduction to the sample Function
The sample
function is a basic yet incredibly useful function for generating random samples in R. It can be used to sample single or multiple elements, with or without replacement, and with the option of providing a probability weight for each element.
2. Syntax and Parameters
The basic syntax of the sample
function is as follows:
sample(x, size, replace = FALSE, prob = NULL)
x
: A vector of one or more elements to sample from, or a positive number to sample from1:x
.size
: The number of items to return.replace
: Should sampling be with replacement? Default isFALSE
.prob
: A vector of probability weights for each element inx
.
3. Basic Usage
Sampling from a Vector
sample(c("red", "blue", "green"), 2)
This will randomly select 2 colors from the given vector.
Sampling from a Range
sample(10, 5)
This will randomly select 5 numbers between 1 and 10.
4. Advanced Sampling Techniques
Sampling with Replacement
sample(1:10, 5, replace = TRUE)
Weighted Sampling
sample(1:3, 5, replace = TRUE, prob = c(0.1, 0.3, 0.6))
5. Use Cases
Bootstrapping
boot_sample <- function(data, n){
sample(data, n, replace = TRUE)
}
Shuffle a Vector
sample(1:10)
Randomly Splitting Data
indices <- sample(1:nrow(df), nrow(df)*0.7)
train_set <- df[indices, ]
test_set <- df[-indices, ]
6. Working with Data Frames and Matrices
Random Row Sampling
# Install dplyr if you haven't
# install.packages("dplyr")
# Load dplyr
library(dplyr)
# Create a sample data frame
df <- data.frame(x = 1:10, y = 11:20)
# Sample 5 rows
sampled_df <- sample_n(df, 5)
Random Column Sampling
df[sample(ncol(df), 2)]
7. Caveats and Pitfalls
- Randomness: The
sample
function generates pseudo-random numbers, which means you should set a seed for reproducibility. - Performance: For very large samples, consider the efficiency of your sampling strategy.
8. Practical Examples
Monte Carlo Simulation
mean_estimates <- replicate(1000, mean(sample(1:6, 10, replace = TRUE)))
Stratified Sampling
stratified_sample <- df %>% group_by(category) %>% sample_n(5)
9. Conclusion
The sample
function in R is a versatile and powerful function for generating random samples. Whether you are performing basic random draws or complicated simulations, sample
provides the flexibility and functionality to meet your needs. Its syntax is simple, but its applications are many—ranging from data splitting for machine learning to sophisticated statistical simulations. By mastering the sample
function, you’ll gain a fundamental tool for statistical programming in R.