How to Perform a Chi-Square Test of Independence in Python

Spread the love

Introduction

The Chi-Square Test of Independence is a non-parametric statistical test used to determine if there is a significant association between two categorical variables in a contingency table. It is widely used in research fields like biology, marketing, sociology, and medicine. This article provides a detailed guide on how to perform a Chi-Square Test of Independence in Python.

Background and Significance of Chi-Square Test of Independence

In research, we often want to examine if there is a relationship between two categorical variables. For instance, one might be interested to know if there is a relationship between gender and voting preferences. The Chi-Square Test of Independence helps in testing the independence of these variables.

Understanding Chi-Square Test of Independence

a. Hypotheses

The null and alternative hypotheses for the Chi-Square Test of Independence are as follows:

  • Null Hypothesis (H0): The two categorical variables are independent.
  • Alternative Hypothesis (H1): The two categorical variables are dependent.

b. Test Statistic

The test statistic is a single number that describes how much the observed counts deviate from the counts you would expect if there was no association between the variables.

c. Assumptions

  • The samples are random and independent.
  • The variables under study are categorical.
  • The expected frequency count for at least 80% of the cell in a contingency table is at least 5.

d. Applications

  • Testing relationship between diseases and exposures.
  • Market research for understanding consumer preferences.
  • Social sciences to test relationships between demographics and attitudes.

Loading and Preparing Data

Before you can perform the Chi-Square Test of Independence, you need to have your data ready. Load your data from a CSV file, excel, SQL database, or any other source. The pandas library is useful for loading and managing data.

Example:

import pandas as pd

# Load data from a CSV file
data = pd.read_csv('your-data-file.csv')

Performing Chi-Square Test of Independence in Python

a. Using scipy.stats

The scipy library provides the chi2_contingency function for performing the Chi-Square Test of Independence.

from scipy.stats import chi2_contingency

# Contingency table
# [[A1, B1],
#  [A2, B2]]

table = [[10, 20], [20, 40]]

# Perform Chi-Square Test of Independence
chi_stat, p_value, dof, expected = chi2_contingency(table)

# Output the results
print(f"Chi-Square Statistic: {chi_stat}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies: {expected}")

b. Interpreting the Results

The p-value tells you whether the variables are independent or associated. If the p-value is below a threshold, usually 0.05, you can reject the null hypothesis and conclude that the variables are associated.

Practical Example

Let’s consider a practical example where you have data on smoking habits and exercise levels of individuals. You want to test if smoking habits are independent of exercise levels.

from scipy.stats import chi2_contingency

# Contingency table: [smoke, exercise],
# [[smoke_yes, exercise_yes], [smoke_no, exercise_yes],
#  [smoke_yes, exercise_no],  [smoke_no, exercise_no]]

data = [[15, 150], [5, 100], [25, 70], [35, 30]]

# Perform Chi-Square Test of Independence
chi_stat, p_value, dof, expected = chi2_contingency(data)

# Output the results
print(f"Chi-Square Statistic: {chi_stat}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies: {expected}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis - Smoking habits and exercise levels are not independent.")
else:
    print("Fail to reject the null hypothesis - Smoking habits and exercise levels are independent.")

Conclusion

The Chi-Square Test of Independence is an essential statistical tool for understanding the relationships between categorical variables. With Python’s scipy library, it is easy to perform this test efficiently. As with any statistical test, it is crucial to understand the assumptions and context of the data to interpret the results accurately. This test is widely used across different fields whenever there is a need to evaluate the independence of two categorical variables.

Leave a Reply