
Introduction
In the realm of statistics and data science, there exist several methods for selecting a subset of individuals from a statistical population to estimate characteristics of the whole population. These methods are known as sampling techniques. One such technique is cluster sampling, which is a probability sampling technique where the entire population is divided into groups, or clusters, and a random sample of these clusters are selected.
In this article, we’ll delve into the concept of cluster sampling, why it is used, and how to implement it in Python. We’ll use popular Python libraries, such as numpy
, pandas
, and sklearn
, for our implementations.
Understanding Cluster Sampling
Cluster sampling involves dividing the population into separate groups, known as clusters. A random sample of clusters is then selected, and all individuals within the chosen clusters are included in the sample.
For example, consider a scenario where you need to survey school students across a country. In this case, it might be logistically simpler to randomly select a few schools (clusters) and then survey all students (elements) in these schools, rather than surveying individual students from all schools.
Cluster sampling is different from stratified sampling. While both divide the population into groups, in stratified sampling, only a random sample from each group (or stratum) is selected, whereas in cluster sampling, all members from selected clusters are chosen.
Simple Cluster Sampling with NumPy
Let’s start with a simple example of cluster sampling in Python using the numpy
library.
Suppose we have a population divided into five clusters, each containing 100 individuals. We want to select a sample from this population using cluster sampling.
First, we create the population and the clusters:
import numpy as np
# Create a population divided into 5 clusters
population = np.arange(500)
clusters = np.split(population, 5) # Divide into 5 clusters
Now, let’s randomly select two clusters and take all the individuals in these clusters as our sample:
# Choose 2 clusters randomly
chosen_clusters = np.random.choice(len(clusters), 2, replace=False)
# Get the individuals in the chosen clusters
sample = np.concatenate([clusters[i] for i in chosen_clusters])
In the above code, np.random.choice(len(clusters), 2, replace=False)
randomly selects two clusters. np.concatenate([clusters[i] for i in chosen_clusters])
then combines all the individuals from these chosen clusters to form our sample.
Cluster Sampling with Pandas and Scikit-Learn
Let’s move on to a more realistic example where our data is a pandas DataFrame, and we have a column that indicates the cluster of each row.
For this purpose, we will use the train_test_split
function from sklearn
. This function is usually used to split data into training and test sets, but it can also be used for cluster sampling. It allows us to randomly select a fraction of the clusters.
First, let’s create a DataFrame:
import pandas as pd
data = {
'Cluster': ['A'] * 100 + ['B'] * 100 + ['C'] * 100 + ['D'] * 100 + ['E'] * 100,
'Data': np.random.randn(500)
}
df = pd.DataFrame(data)
Here, ‘Cluster’ represents the cluster each row belongs to, and ‘Data’ is some random data.
To perform cluster sampling, we first find the unique clusters and then use train_test_split
to randomly select a fraction of these clusters:
from sklearn.model_selection import train_test_split
# Find unique clusters
unique_clusters = df['Cluster'].unique()
# Randomly select 60% of the clusters
chosen_clusters, _ = train_test_split(unique_clusters, test_size=0.4, random_state=42)
# Select all rows that belong to the chosen clusters
sample = df[df['Cluster'].isin(chosen_clusters)]
Here, train_test_split(unique_clusters, test_size=0.4, random_state=42)
randomly selects 60% of the clusters (since test_size=0.4
, the complement, 0.6, is chosen). df[df['Cluster'].isin(chosen_clusters)]
then selects all rows that belong to these chosen clusters.
Verifying the Sampling
After performing cluster sampling, you can verify the process by inspecting the sample. For instance, you can check that the sample only contains rows from the chosen clusters and that all rows from these clusters are included:
print("Chosen clusters: ", chosen_clusters)
print("Unique clusters in the sample: ", sample['Cluster'].unique())
Conclusion
Cluster sampling is a valuable technique when it’s costly or impractical to perform simple random sampling or stratified sampling. It’s especially useful in large-scale surveys where the population is widespread geographically. Python, with its extensive libraries, offers an intuitive and efficient way to perform cluster sampling, which can be utilized effectively in a variety of statistical studies and machine learning problems.