Statistical hypothesis testing is an essential process in data analysis that helps in making decisions based on data. One such non-parametric method used for performing hypothesis testing on independent samples is the Kruskal-Wallis test. This article is a comprehensive guide on how to execute a Kruskal-Wallis test in R.
Understanding the Kruskal-Wallis Test
The Kruskal-Wallis test, named after William Kruskal and W. Allen Wallis, is a non-parametric method used for testing whether samples originate from the same distribution. It is a rank-based nonparametric test that can be used to determine if there are statistically significant differences between two or more groups of an independent variable on a continuous or ordinal dependent variable.
It’s an extension of the Mann-Whitney U test to more than two groups. The Kruskal-Wallis test doesn’t assume a normal distribution of the residuals, unlike the One-way ANOVA, and instead assumes that all groups have the same shape distribution.
The Kruskal-Wallis test works by ranking the data from all groups together, then comparing the sum of ranks for each group. If the sums of ranks between groups are significantly different, then it suggests that the groups are different.
Kruskal-Wallis Test in R
R provides the kruskal.test()
function as a built-in function to perform the Kruskal-Wallis test.
The basic syntax to perform Kruskal-Wallis test in R is as follows:
kruskal.test(formula, data)
Here, formula
is a formula object, with the response on the left of a ~
operator, and the group on the right, and data
is a data frame containing the variables specified in the formula.
Let’s illustrate this with an example:
# Defining the data
data1 <- c(2.9, 3.0, 2.5, 2.6, 3.2) # group 1 data
data2 <- c(3.1, 3.3, 3.4, 2.8, 3.5) # group 2 data
data3 <- c(3.6, 3.8, 3.4, 3.7, 3.6) # group 3 data
# Combine all the data to a data frame
df <- data.frame(
values = c(data1, data2, data3),
group = factor(rep(c("Group1", "Group2", "Group3"), each = 5))
)
# Perform the Kruskal-Wallis test
result <- kruskal.test(values ~ group, data = df)
# Print the result
print(result)
In this code, values
are the response variable, and group
is the group variable. The ~
symbol is used to indicate that values
are modeled as a function of group
. The data frame df
contains the variables specified in the formula.
The kruskal.test()
function calculates the Kruskal-Wallis rank sum statistic, degrees of freedom, and the p-value of the test.
The result of the kruskal.test()
function is an object of class “htest” that contains the following components:
statistic
: the value of the Kruskal-Wallis rank sum statistic.parameter
: the degrees of freedom of the approximate chi-squared distribution of the test statistic.p.value
: the p-value of the test.method
: a character string indicating the name of the test.data.name
: a character string giving the names of the data.
The p-value is the probability of getting a test statistic as extreme as, or more extreme than, the observed statistic under the null hypothesis. If the p-value is less than the significance level (usually 0.05), you reject the null hypothesis.
Post-Hoc Analysis
If you find a significant result with Kruskal-Wallis test, you might want to explore further and find out which groups are different. You can do this using post-hoc tests.
A common choice for post-hoc analysis after Kruskal-Wallis test is the Dunn test. It compares the difference in the sum of ranks between two groups to the expected difference under the null hypothesis.
The dunn.test()
function in R performs the Dunn post-hoc test. It is part of the dunn.test
package. You need to install and load the package to use the function.
# Install the package
install.packages("dunn.test")
# Load the package
library(dunn.test)
# Perform the Dunn post-hoc test
result_posthoc <- dunn.test(df$values, df$group, method = "bonferroni")
# Print the result
print(result_posthoc)
Conclusion
The Kruskal-Wallis test is a non-parametric method used to determine if there are statistically significant differences between two or more groups of an independent variable. Its primary advantage is that it doesn’t require the assumption of normal distributions and equal variances across the groups, unlike the ANOVA.
R provides the kruskal.test()
function to perform the Kruskal-Wallis test, which is simple and easy to use. Always remember to interpret the results appropriately and consider performing post-hoc analysis if needed to further analyze your data.