
The Kolmogorov-Smirnov (K-S) test is a non-parametric statistical test that allows you to compare a sample with a reference probability distribution (one-sample K-S test) or compare two samples with each other (two-sample K-S test). It’s often used to check the goodness-of-fit or to verify empirical distributions. This article provides a comprehensive guide on how to perform both versions of the K-S test in Python, using libraries like NumPy and SciPy.
Importing the Libraries
First import the required libraries.
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
Generating Sample Datasets
Let’s generate two random datasets for our examples:
# Set the seed for reproducibility
np.random.seed(0)
# Generate two sample datasets
data1 = np.random.normal(loc=0, scale=1, size=100)
data2 = np.random.normal(loc=0.5, scale=1.5, size=100)
Here, we’ve created two datasets of size 100 from a normal distribution, data1
with a mean (loc
) of 0 and a standard deviation (scale
) of 1, and data2
with a mean of 0.5 and a standard deviation of 1.5.
Visualizing the Datasets
Before conducting the K-S test, it’s beneficial to visualize the data. We’ll use histograms and cumulative distribution function (CDF) plots for this:
# Plot histograms
plt.figure(figsize=(10, 5))
sns.histplot(data1, kde=True, color='blue', label='Data1')
sns.histplot(data2, kde=True, color='red', label='Data2')
plt.title('Histograms')
plt.legend()
plt.show()
# Plot CDFs
plt.figure(figsize=(10, 5))
sns.ecdfplot(data1, color='blue', label='Data1')
sns.ecdfplot(data2, color='red', label='Data2')
plt.title('CDFs')
plt.legend()
plt.show()
Histograms provide a graphic representation of data distribution, while CDF plots show the probability that a random variable is less than or equal to a certain value.
Performing the One-Sample K-S Test
The one-sample K-S test compares a sample with a reference probability distribution. In this example, we’ll test if data1
follows a standard normal distribution.
# Perform the one-sample K-S test
D, p_value = stats.kstest(data1, 'norm')
# Print the test statistic and p-value
print(f'Test Statistic (D): {D}, P-Value: {p_value}')
The scipy.stats.kstest
function performs the K-S test. The first argument is the dataset, and the second argument is the cumulative distribution function (in this case, ‘norm’ for the standard normal distribution). The function returns the K-S test statistic D and the p-value.
Performing the Two-Sample K-S Test
The two-sample K-S test compares two empirical distributions. We’ll compare data1
and data2
.
# Perform the two-sample K-S test
D, p_value = stats.ks_2samp(data1, data2)
# Print the test statistic and p-value
print(f'Test Statistic (D): {D}, P-Value: {p_value}')
The scipy.stats.ks_2samp
function carries out the two-sample K-S test. It takes as arguments the two datasets and returns the K-S test statistic D and the p-value.
Interpreting the Results
When interpreting the results, you’ll typically choose a significance level (alpha, α) before performing the test. A common choice is 0.05:
- If the p-value is less than α, you reject the null hypothesis and conclude that the data does not follow the reference distribution (one-sample) or that the two datasets have different distributions (two-sample).
- If the p-value is greater than α, you do not reject the null hypothesis and conclude that the data could come from the reference distribution (one-sample) or that the two datasets could have the same distribution (two-sample).
alpha = 0.05
if p_value < alpha:
print('The null hypothesis can be rejected.')
else:
print('The null hypothesis cannot be rejected.')
Conclusion
The Kolmogorov-Smirnov test is a powerful tool for comparing distributions and checking goodness-of-fit. It’s a non-parametric test, meaning it doesn’t make assumptions about the specific shape of the population distribution. However, like any statistical test, it’s not without its limitations. For example, it can be sensitive to the sample size, and it may not perform as well when there are many repeated values in the data (a characteristic of discrete distributions). As such, it’s always best to use it alongside other exploratory and inferential statistical methods when analyzing your data.