How to Perform a Kolmogorov-Smirnov Test in Python

The Kolmogorov-Smirnov (K-S) test is a non-parametric statistical test that allows you to compare a sample with a reference probability distribution (one-sample K-S test) or compare two samples with each other (two-sample K-S test). It’s often used to check the goodness-of-fit or to verify empirical distributions. This article provides a comprehensive guide on how to perform both versions of the K-S test in Python, using libraries like NumPy and SciPy.

Importing the Libraries

First import the required libraries.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

Generating Sample Datasets

Let’s generate two random datasets for our examples:

# Set the seed for reproducibility
np.random.seed(0)

# Generate two sample datasets
data1 = np.random.normal(loc=0, scale=1, size=100)
data2 = np.random.normal(loc=0.5, scale=1.5, size=100)

Here, we’ve created two datasets of size 100 from a normal distribution, data1 with a mean (loc) of 0 and a standard deviation (scale) of 1, and data2 with a mean of 0.5 and a standard deviation of 1.5.

Visualizing the Datasets

Before conducting the K-S test, it’s beneficial to visualize the data. We’ll use histograms and cumulative distribution function (CDF) plots for this:

# Plot histograms
plt.figure(figsize=(10, 5))
sns.histplot(data1, kde=True, color='blue', label='Data1')
sns.histplot(data2, kde=True, color='red', label='Data2')
plt.title('Histograms')
plt.legend()
plt.show()

# Plot CDFs
plt.figure(figsize=(10, 5))
sns.ecdfplot(data1, color='blue', label='Data1')
sns.ecdfplot(data2, color='red', label='Data2')
plt.title('CDFs')
plt.legend()
plt.show()

Histograms provide a graphic representation of data distribution, while CDF plots show the probability that a random variable is less than or equal to a certain value.

Performing the One-Sample K-S Test

The one-sample K-S test compares a sample with a reference probability distribution. In this example, we’ll test if data1 follows a standard normal distribution.

# Perform the one-sample K-S test
D, p_value = stats.kstest(data1, 'norm')

# Print the test statistic and p-value
print(f'Test Statistic (D): {D}, P-Value: {p_value}')

The scipy.stats.kstest function performs the K-S test. The first argument is the dataset, and the second argument is the cumulative distribution function (in this case, ‘norm’ for the standard normal distribution). The function returns the K-S test statistic D and the p-value.

Performing the Two-Sample K-S Test

The two-sample K-S test compares two empirical distributions. We’ll compare data1 and data2.

# Perform the two-sample K-S test
D, p_value = stats.ks_2samp(data1, data2)

# Print the test statistic and p-value
print(f'Test Statistic (D): {D}, P-Value: {p_value}')

The scipy.stats.ks_2samp function carries out the two-sample K-S test. It takes as arguments the two datasets and returns the K-S test statistic D and the p-value.

Interpreting the Results

When interpreting the results, you’ll typically choose a significance level (alpha, α) before performing the test. A common choice is 0.05:

• If the p-value is less than α, you reject the null hypothesis and conclude that the data does not follow the reference distribution (one-sample) or that the two datasets have different distributions (two-sample).
• If the p-value is greater than α, you do not reject the null hypothesis and conclude that the data could come from the reference distribution (one-sample) or that the two datasets could have the same distribution (two-sample).
alpha = 0.05
if p_value < alpha:
print('The null hypothesis can be rejected.')
else:
print('The null hypothesis cannot be rejected.')

Conclusion

The Kolmogorov-Smirnov test is a powerful tool for comparing distributions and checking goodness-of-fit. It’s a non-parametric test, meaning it doesn’t make assumptions about the specific shape of the population distribution. However, like any statistical test, it’s not without its limitations. For example, it can be sensitive to the sample size, and it may not perform as well when there are many repeated values in the data (a characteristic of discrete distributions). As such, it’s always best to use it alongside other exploratory and inferential statistical methods when analyzing your data.