How to Perform a Kruskal-Wallis Test in Python

Spread the love

Introduction

The Kruskal-Wallis H-test is a non-parametric method used for comparing the median of two or more independent groups of data. It is named after its developers, William Kruskal and W. Allen Wallis, and is an extension of the Mann-Whitney U test. The test makes no assumptions about the distribution of data and provides good control over Type I error.

The Kruskal-Wallis test is often used in cases where the assumptions of one-way ANOVA are not met, such as when the data is not normally distributed. This test, therefore, provides a robust alternative to traditional parametric tests.

In this article, we will go through the steps involved in performing a Kruskal-Wallis test using Python, a popular programming language widely used in data science. We will use the SciPy library, which provides a function for carrying out this test.

Data Preparation

We will generate some synthetic data for our demonstration, but these steps could be easily adapted for any dataset that you might be working with. For this, we will use the numpy and pandas libraries:

import numpy as np
import pandas as pd

# Generate 3 different samples
np.random.seed(0)
group1 = np.random.normal(50, 10, 100)
group2 = np.random.normal(55, 10, 100)
group3 = np.random.normal(60, 10, 100)

# Combine them into a DataFrame
df = pd.DataFrame({
    'Value': np.concatenate([group1, group2, group3]),
    'Group': np.repeat(['Group 1', 'Group 2', 'Group 3'], repeats=100),
})

In the code above, we have created three groups of random data, each normally distributed with a different mean. The data is then combined into a pandas DataFrame with two columns: ‘Value’ and ‘Group’.

Data Visualization

Before running the Kruskal-Wallis test, it is a good idea to visualize the data. We can do this using a box plot, which shows the median, quartiles, and potential outliers for each group. For this, we will use the seaborn library:

import seaborn as sns
import matplotlib.pyplot as plt

# Create a box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Group', y='Value', data=df)
plt.title('Box plot of Groups')
plt.show()

Performing the Kruskal-Wallis Test

Now, we can perform the Kruskal-Wallis test using the kruskal function from the scipy.stats module. Here’s how you can do it:

from scipy.stats import kruskal

# Extract the individual groups from the DataFrame
group1 = df[df['Group'] == 'Group 1']['Value']
group2 = df[df['Group'] == 'Group 2']['Value']
group3 = df[df['Group'] == 'Group 3']['Value']

# Perform the Kruskal-Wallis test
stat, p = kruskal(group1, group2, group3)
print(f'Statistics={stat}, p={p}')

The function returns two values: the test statistic and the p-value. If the p-value is below a certain significance level, such as 0.05, it indicates that there is a significant difference in the medians of the groups.

Interpreting the Results

The test statistic represents the sum of the rank differences between groups. The p-value tells us whether to reject the null hypothesis. The null hypothesis for the Kruskal-Wallis test is that the medians of all groups are equal.

  • If the p-value is less than 0.05, you can reject the null hypothesis and conclude that at least one of the groups has a different median.
  • If the p-value is greater than 0.05, you fail to reject the null hypothesis and cannot conclude that there is a significant difference between the groups.

Post-hoc Analysis

If the p-value is significant, it’s good to perform post-hoc analysis to determine which groups are different. For this, we can use the Mann-Whitney U test for each pair of groups.

from scipy.stats import mannwhitneyu

# List of groups
groups = [group1, group2, group3]

# Perform Mann-Whitney U tests for each pair
for i in range(len(groups)):
    for j in range(i+1, len(groups)):
        stat, p = mannwhitneyu(groups[i], groups[j])
        print(f'Groups {i+1} and {j+1}: Statistics={stat}, p={p}')

This will perform the Mann-Whitney U test between each pair of groups and report the test statistic and p-value.

Conclusion

In this article, we walked through the steps of performing a Kruskal-Wallis test in Python, using the SciPy library. This non-parametric test is useful for comparing the medians of two or more independent groups, especially when the assumptions of ANOVA are not met. We also looked at how to visualize the data and how to perform post-hoc analysis to find out which specific groups differ.

Leave a Reply