The Chi-Square Goodness of Fit Test is a versatile statistical tool, employed to determine how observed frequencies compare to the frequencies we would expect under a specified theoretical distribution. Using R, the test becomes a streamlined process, granting researchers and statisticians the ability to quickly evaluate data’s fit to hypothesized distributions. This guide offers a comprehensive look at this procedure in R.
1. Fundamentals of the Chi-Square Goodness of Fit Test
The test essentially allows us to determine if our data conforms to a particular distribution. For instance, one might want to know if a dice is fair by comparing the observed counts of each face to the expected counts (which would be equal for a fair dice).
2. Prerequisites and Assumptions
Before diving into the application, we must understand the assumptions:
- Categorical Data: The data should be categorical, not numerical.
- Independence: Observations must be independent of each other.
- Sample Size: Ideally, expected frequencies for each category should be at least 5.
3. Applying the Test in R
Let’s say we’ve rolled a dice 60 times, and we want to know if it’s fair.
3.1 Data Preparation
Firstly, record the observed frequencies for each face:
observed_freq <- c(8, 9, 11, 10, 12, 10)
For a fair dice, the expected frequency for each face after 60 rolls would be 10.
expected_freq <- rep(10, 6)
3.2 Running the Test
With the data set, you can now run the test:
chi_sq_gof <- chisq.test(observed_freq, p=expected_freq/sum(expected_freq))
print(chi_sq_gof)
4. Decoding the Results
Two primary results need your attention:
- Chi-Square Value: Represents the deviation of observed frequencies from expected frequencies.
- P-value: If this is less than a significance level (e.g., 0.05), you’d reject the null hypothesis, suggesting that the observed and expected frequencies are significantly different.
5. Visual Representations
Visualizing observed vs. expected frequencies can provide clarity:
barplot(rbind(observed_freq, expected_freq), beside = TRUE,
col = c("red", "blue"),
legend.text = c("Observed", "Expected"),
main = "Observed vs Expected Frequencies",
ylab = "Frequency")

6. Use-Cases and Examples
While the dice is a simple example, the test’s application spans:
- Election Polling: Checking if observed voting patterns match predictions.
- Genetic Research: Determining if observed genotype frequencies deviate from expected under Hardy-Weinberg equilibrium.
7. Limitations and Potential Issues
- Sample Size: Small samples can lead to expected frequencies below 5, making the test less reliable.
- Over-reliance: A significant result merely suggests a difference from the hypothesized distribution but doesn’t identify which categories contribute most to this discrepancy.
8. Conclusion
The Chi-Square Goodness of Fit Test in R is a potent tool for discerning if your observed data fits a specific theoretical distribution. Proper understanding of its application, assumptions, and limitations ensures that it serves as a reliable ally in your data analysis journey.