How to Perform a Two-Way ANOVA in Python

Spread the love

Introduction

Two-Way Analysis of Variance (ANOVA) is a powerful statistical test used to analyze the effect of two categorical independent variables on a continuous dependent variable. Unlike One-Way ANOVA, which considers one independent variable, Two-Way ANOVA considers how two factors impact a response variable and whether there is an interaction between these factors.

In this article, we will guide you step-by-step on how to perform a Two-Way ANOVA using Python. We will utilize the robust libraries available for statistical analysis, namely pandas, statsmodels, and seaborn.

Step 1: Import Necessary Libraries

First, let’s import the libraries we will need.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

Step 2: Loading the Dataset

For this tutorial, let’s consider an example dataset where we analyze the effects of different diets and workout regimes on weight loss. The independent variables are diet type and workout regime, and the dependent variable is weight loss.

data = {
    'Diet': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
    'Workout': ['Low', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low', 'Medium', 'High'],
    'WeightLoss': [3, 4, 5, 3.2, 5, 6, 5.2, 6, 5.5, 4, 5.5, 6.2]
}
df = pd.DataFrame(data)

Step 3: Exploratory Data Analysis

Before performing Two-Way ANOVA, it’s beneficial to perform some Exploratory Data Analysis (EDA) to understand the dataset.

Descriptive Statistics:

print(df.describe())

Data Visualization:

sns.boxplot(x='Diet', y='WeightLoss', data=df, hue='Workout')
plt.xlabel('Diet')
plt.ylabel('Weight Loss')
plt.title('Weight Loss by Diet and Workout Regime')
plt.show()

Step 4: Check Assumptions of Two-Way ANOVA

Like One-Way ANOVA, Two-Way ANOVA also has assumptions that need to be met:

  1. Normality: Each group of data should be normally distributed.
  2. Homogeneity of variance: All groups should have the same variance.
  3. Independence: Observations should be independent of each other.

Ensure that your data meets these assumptions before proceeding.

Step 5: Fit the Model

We need to fit a linear model before performing the Two-Way ANOVA test. Use the ols function from the statsmodels library to fit the model.

model = ols('WeightLoss ~ Diet + Workout + Diet:Workout', data=df).fit()

The formula 'WeightLoss ~ Diet + Workout + Diet:Workout' specifies that we want to analyze the effects of diet, workout, and their interaction on weight loss.

Step 6: Perform the Two-Way ANOVA Test

Now that we have our model fitted, we can perform the Two-Way ANOVA test using the anova_lm function.

anova_results = anova_lm(model, typ=2)
print(anova_results)

This will output a table with the F-statistic and p-values for each source of variation (diet, workout, and their interaction).

Step 7: Interpret the Results

There are three key columns to look at: sum_sq, F, and PR(>F) which is the p-value.

  • If the p-value for Diet is less than 0.05, it suggests that the different diet types have a significant effect on weight loss.
  • If the p-value for Workout is less than 0.05, it suggests that the different workout regimes have a significant effect on weight loss.
  • If the p-value for the interaction (Diet:Workout) is less than 0.05, it suggests that there is a significant interaction between the diet and workout.

Conclusion

In this tutorial, we learned how to perform a Two-Way ANOVA in Python using the statsmodels library. Two-Way ANOVA is a powerful technique that allows us to analyze how two factors impact a dependent variable. It’s crucial to ensure that the assumptions for the Two-Way ANOVA test are met and to carefully interpret the results within the context of your study. Keep in mind that a low p-value suggests a statistically significant effect, but it does not indicate the size or importance of this effect. Therefore, it is also beneficial to perform additional analysis and use domain knowledge to interpret the results.

Leave a Reply