How to Perform Stratified Sampling in Python

Spread the love

Introduction

Stratified sampling is a statistical technique used to generate a sample population that’s representative of the groups within a larger population. In stratified sampling, the population is partitioned into non-overlapping groups, or strata, and a sample is selected by some design within each stratum.

In the Python ecosystem, we have powerful libraries such as numpy, pandas, and scikit-learn that help us to easily implement stratified sampling. This article will guide you through how to do stratified sampling in Python.

Prerequisites

To follow along with this article, you’ll need to have Python installed on your machine. You’ll also need to have the following Python libraries installed:

  • numpy
  • pandas
  • scikit-learn

If you don’t have these installed, you can install them using pip:

pip install numpy pandas scikit-learn

The Concept of Stratified Sampling

Before we dive into the code, it’s important to understand the concept of stratified sampling. Suppose you’re carrying out a survey of households in a city. The city has several neighborhoods, each with different average incomes, populations, and other characteristics. If you choose households at random across the city, you might end up with a sample that’s skewed towards one or more neighborhoods.

Stratified sampling addresses this issue by dividing the population into separate groups, or strata, based on one or more characteristics. You then sample from each group separately to ensure your sample is representative of the overall population. In the city survey example, you might divide the city into neighborhoods and sample equally from each one.

Simple Stratified Sampling with NumPy

Let’s start with a simple example using numpy. Suppose we have a population divided into three groups, with 100 individuals in group A, 200 in group B, and 300 in group C. We want to sample 60 individuals in total, with the samples distributed proportionally among the groups.

import numpy as np

# Create a population array
group_a = np.full(100, 'A')
group_b = np.full(200, 'B')
group_c = np.full(300, 'C')
population = np.concatenate([group_a, group_b, group_c])

# Determine the size of each stratum
size = 60  # total sample size
size_a = round(size * len(group_a) / len(population))
size_b = round(size * len(group_b) / len(population))
size_c = size - size_a - size_b  # ensure the total size is correct

# Sample from each stratum
sample_a = np.random.choice(group_a, size_a, replace=False)
sample_b = np.random.choice(group_b, size_b, replace=False)
sample_c = np.random.choice(group_c, size_c, replace=False)

# Combine the samples
sample = np.concatenate([sample_a, sample_b, sample_c])

This code first creates a population array with three groups. It then calculates the size of each group in the sample, based on the proportion of each group in the population. It then selects a random sample from each group, ensuring no individual is selected more than once, and combines the samples.

Stratified Sampling with Pandas and Scikit-Learn

Let’s now turn to a more realistic example where our data is a pandas DataFrame and we want to sample based on a categorical column. We’ll use the train_test_split function from scikit-learn, which is normally used to split data into training and test sets but can also be used to perform stratified sampling.

First, let’s create a DataFrame:

import pandas as pd

data = {
    'Category': ['A'] * 100 + ['B'] * 200 + ['C'] * 300,
    'Data': np.random.randn(600)
}
df = pd.DataFrame(data)

This DataFrame has two columns, ‘Category’ and ‘Data’. ‘Category’ indicates the group to which each row belongs, while ‘Data’ is some random data.

To perform stratified sampling, we’ll use train_test_split with the stratify parameter:

from sklearn.model_selection import train_test_split

# We want 20% of the data
sample_size = 0.2

# Stratified sampling
df_sample, _ = train_test_split(df, test_size=sample_size, stratify=df['Category'], random_state=42)

The train_test_split function splits the DataFrame into two sets. The test_size parameter specifies the proportion of the data to include in the test split, which will be our sample. The stratify parameter specifies the column to use for stratification. The function returns two DataFrames, but we only need the first one, which is our sample. The random_state parameter ensures that the operation is reproducible.

Verifying the Sampling

After performing stratified sampling, it’s a good idea to verify that the sampling was done correctly. We can do this by comparing the distribution of categories in the sample with the distribution in the original data:

print(df['Category'].value_counts(normalize=True))
print(df_sample['Category'].value_counts(normalize=True))

The value_counts(normalize=True) function counts the number of occurrences of each category and normalizes the results to get proportions. If the sampling was done correctly, the proportions in the sample should be roughly the same as in the original data.

Conclusion

Stratified sampling is a powerful technique for generating representative samples from a population. Python, with libraries such as numpy, pandas, and scikit-learn, provides easy-to-use tools to perform stratified sampling. Whether you’re conducting a survey or building a machine learning model, stratified sampling can help you ensure your results are valid and reliable.

Leave a Reply