How to Perform Systematic Sampling in Python

Spread the love

Introduction

When it comes to statistical analysis, sampling techniques play a significant role in data collection. Among the different types of sampling techniques, systematic sampling is one of the simplest and most widely used. In systematic sampling, every nth member of a population is selected to form a sample. This method offers a straightforward approach and tends to be more precise than simple random sampling when used with an ordered population.

In this article, we’ll delve into the process of implementing systematic sampling in Python, utilizing the capabilities of libraries such as numpy and pandas.

Understanding Systematic Sampling

Systematic sampling involves selecting every nth member of a population. Here’s how it works: First, we decide on a sampling interval (n). This interval is usually calculated as the population size divided by the desired sample size. We then select a random starting point between 1 and n, and thereafter, we select every nth member.

For instance, suppose we have a population of 1000 individuals, and we want to sample 100 of them. Our sampling interval would be 1000 / 100 = 10, and if we randomly chose 5 as our starting point, we would select individuals 5, 15, 25, and so on, until we had our complete sample.

One key assumption of systematic sampling is that the order of the population is not related to the characteristics we’re interested in. If this assumption is violated, systematic sampling can introduce bias into our sample.

Systematic Sampling with NumPy

Let’s begin with a simple example of systematic sampling using the numpy library. Let’s assume we have a population of 1000 individuals, and we want to select a sample of 100.

First, let’s create our population:

import numpy as np

# Create a population
population_size = 1000
population = np.arange(population_size)

Here, we’ve created an array of 1000 elements to represent our population. Now, let’s perform systematic sampling:

# Decide on a sampling interval
sample_size = 100
interval = population_size // sample_size

# Choose a random starting point
start = np.random.randint(0, interval)

# Select every nth member of the population
sample = population[start::interval]

Here, np.random.randint(0, interval) chooses a random starting point between 0 and our sampling interval. population[start::interval] then selects every nth member of the population, starting from our chosen starting point.

Systematic Sampling with Pandas

Systematic sampling can also be performed when dealing with pandas DataFrames. Assume we have a DataFrame with 1000 rows, and we want to select a systematic sample of 100 rows.

First, let’s create our DataFrame:

import pandas as pd

# Create a DataFrame
data = pd.DataFrame({
    'Data': np.random.randn(1000)
})

Our DataFrame consists of a single column, ‘Data’, filled with random numbers.

Now, let’s perform systematic sampling on this DataFrame:

# Decide on a sampling interval
sample_size = 100
interval = len(data) // sample_size

# Choose a random starting point
start = np.random.randint(0, interval)

# Select every nth member of the DataFrame
sample = data.iloc[start::interval]

Just like with the numpy array, we choose a random starting point and select every nth member of the DataFrame. However, instead of using indexing directly as with the numpy array, we use the iloc function to index the DataFrame.

Verifying the Sampling

After performing systematic sampling, it can be useful to verify the sample. You can check the size of the sample and plot the selected elements to confirm that the sampling process worked as expected:

# Check the size of the sample
print("Sample size: ", len(sample))

# Plot the selected elements
import matplotlib.pyplot as plt
plt.scatter(np.arange(len(data)), [1]*len(data))
plt.scatter(sample.index, [1]*len(sample), color='r')
plt.show()

This code first prints the size of the sample. It then creates a scatter plot where all elements of the population are shown in blue and the selected elements are shown in red.

Conclusion

Systematic sampling is a valuable technique in statistics when you need a simple yet efficient way to create a representative sample from a larger population. Python’s powerful libraries, like numpy and pandas, provide straightforward and intuitive ways to implement systematic sampling. As with any sampling technique, it’s crucial to be aware of potential biases and the assumptions involved to ensure the validity of your results.

Leave a Reply