### Introduction

Stratified sampling is a statistical technique used to generate a sample population that’s representative of the groups within a larger population. In stratified sampling, the population is partitioned into non-overlapping groups, or strata, and a sample is selected by some design within each stratum.

In the Python ecosystem, we have powerful libraries such as `numpy`

, `pandas`

, and `scikit-learn`

that help us to easily implement stratified sampling. This article will guide you through how to do stratified sampling in Python.

### Prerequisites

To follow along with this article, you’ll need to have Python installed on your machine. You’ll also need to have the following Python libraries installed:

`numpy`

`pandas`

`scikit-learn`

If you don’t have these installed, you can install them using `pip`

:

`pip install numpy pandas scikit-learn`

### The Concept of Stratified Sampling

Before we dive into the code, it’s important to understand the concept of stratified sampling. Suppose you’re carrying out a survey of households in a city. The city has several neighborhoods, each with different average incomes, populations, and other characteristics. If you choose households at random across the city, you might end up with a sample that’s skewed towards one or more neighborhoods.

Stratified sampling addresses this issue by dividing the population into separate groups, or strata, based on one or more characteristics. You then sample from each group separately to ensure your sample is representative of the overall population. In the city survey example, you might divide the city into neighborhoods and sample equally from each one.

### Simple Stratified Sampling with NumPy

Let’s start with a simple example using `numpy`

. Suppose we have a population divided into three groups, with 100 individuals in group A, 200 in group B, and 300 in group C. We want to sample 60 individuals in total, with the samples distributed proportionally among the groups.

```
import numpy as np
# Create a population array
group_a = np.full(100, 'A')
group_b = np.full(200, 'B')
group_c = np.full(300, 'C')
population = np.concatenate([group_a, group_b, group_c])
# Determine the size of each stratum
size = 60 # total sample size
size_a = round(size * len(group_a) / len(population))
size_b = round(size * len(group_b) / len(population))
size_c = size - size_a - size_b # ensure the total size is correct
# Sample from each stratum
sample_a = np.random.choice(group_a, size_a, replace=False)
sample_b = np.random.choice(group_b, size_b, replace=False)
sample_c = np.random.choice(group_c, size_c, replace=False)
# Combine the samples
sample = np.concatenate([sample_a, sample_b, sample_c])
```

This code first creates a population array with three groups. It then calculates the size of each group in the sample, based on the proportion of each group in the population. It then selects a random sample from each group, ensuring no individual is selected more than once, and combines the samples.

### Stratified Sampling with Pandas and Scikit-Learn

Let’s now turn to a more realistic example where our data is a pandas DataFrame and we want to sample based on a categorical column. We’ll use the `train_test_split`

function from `scikit-learn`

, which is normally used to split data into training and test sets but can also be used to perform stratified sampling.

First, let’s create a DataFrame:

```
import pandas as pd
data = {
'Category': ['A'] * 100 + ['B'] * 200 + ['C'] * 300,
'Data': np.random.randn(600)
}
df = pd.DataFrame(data)
```

This DataFrame has two columns, ‘Category’ and ‘Data’. ‘Category’ indicates the group to which each row belongs, while ‘Data’ is some random data.

To perform stratified sampling, we’ll use `train_test_split`

with the `stratify`

parameter:

```
from sklearn.model_selection import train_test_split
# We want 20% of the data
sample_size = 0.2
# Stratified sampling
df_sample, _ = train_test_split(df, test_size=sample_size, stratify=df['Category'], random_state=42)
```

The `train_test_split`

function splits the DataFrame into two sets. The `test_size`

parameter specifies the proportion of the data to include in the test split, which will be our sample. The `stratify`

parameter specifies the column to use for stratification. The function returns two DataFrames, but we only need the first one, which is our sample. The `random_state`

parameter ensures that the operation is reproducible.

### Verifying the Sampling

After performing stratified sampling, it’s a good idea to verify that the sampling was done correctly. We can do this by comparing the distribution of categories in the sample with the distribution in the original data:

```
print(df['Category'].value_counts(normalize=True))
print(df_sample['Category'].value_counts(normalize=True))
```

The `value_counts(normalize=True)`

function counts the number of occurrences of each category and normalizes the results to get proportions. If the sampling was done correctly, the proportions in the sample should be roughly the same as in the original data.

### Conclusion

Stratified sampling is a powerful technique for generating representative samples from a population. Python, with libraries such as `numpy`

, `pandas`

, and `scikit-learn`

, provides easy-to-use tools to perform stratified sampling. Whether you’re conducting a survey or building a machine learning model, stratified sampling can help you ensure your results are valid and reliable.