How to Perform Multivariate Normality Tests in Python

Spread the love

Testing for multivariate normality is an important step when dealing with statistical models that assume the normal distribution. Examples of these models include linear regression, multivariate analysis of variance (MANOVA), and many machine learning algorithms.

In Python, several libraries, including SciPy and statsmodels, offer functions to perform univariate normality tests. However, the multivariate case, where we have multiple related variables, requires additional consideration. This guide will introduce several methods to perform multivariate normality tests in Python, including visual methods and statistical tests.

Required Libraries

Ensure you have the following libraries installed:

  1. NumPy: A library for numerical operations.
  2. pandas: A library for data manipulation and analysis.
  3. SciPy: A library for scientific computing and technical computing.
  4. statsmodels: A library for estimating statistical models.
  5. Matplotlib and Seaborn: Libraries for data visualization.
  6. Pingouin: A statistical package for Python that is based on pandas and NumPy.

You can install these libraries using pip:

pip install numpy pandas scipy statsmodels matplotlib seaborn pingouin

Importing the Libraries

After installing, import these libraries:

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import pingouin as pg

Generating a Sample Dataset

For this guide, we’ll generate a dataset from a multivariate normal distribution:

# Set the seed for reproducibility
np.random.seed(0)

# Mean values of three variables
mean = [0, 0, 0]

# Covariance matrix
cov = [[1, 0.5, 0.3],
       [0.5, 1, 0.4],
       [0.3, 0.4, 1]]

# Generate the random multivariate dataset
data = np.random.multivariate_normal(mean, cov, size=1000)

# Convert the data into a pandas DataFrame
df = pd.DataFrame(data, columns=['Var1', 'Var2', 'Var3'])

This generates 1000 data points from a multivariate normal distribution with the specified mean vector and covariance matrix.

Visual Methods for Checking Multivariate Normality

Pair Plot

A pair plot allows us to see both distribution of single variables and the relationships between two variables:

sns.pairplot(df)
plt.show()

In a pair plot of multivariate normally distributed data, you’d expect to see a bell curve in the histogram and a roughly elliptical shape in the scatter plots.

Statistical Methods for Checking Multivariate Normality

Multivariate Shapiro-Wilk Test

The Shapiro-Wilk test can be extended to the multivariate case. We apply the test to each variable in the dataset and combine the p-values using Fisher’s method:

from scipy.stats import combine_pvalues

# Perform Shapiro-Wilk test on each variable and combine p-values
p_values = df.apply(stats.shapiro).iloc[1]  # [1] to get the p-values
_, combined_p_value = combine_pvalues(p_values)
print(f'Combined p-value from Shapiro-Wilk tests: {combined_p_value}')

If the combined p-value is less than the chosen alpha level (typically 0.05), then there is evidence that the data is not from a normally distributed population.

Henze-Zirkler’s Multivariate Normality Test

Pingouin implements Henze-Zirkler’s test for multivariate normality, a popular choice due to its power:

hz_result = pg.multivariate_normality(df, alpha=0.05)
print(f"Henze-Zirkler's Test: p-value={hz_result.pval}, normal={hz_result.normal}")

The null hypothesis of Henze-Zirkler’s test is that the data is normally distributed. If the p-value is less than alpha (typically 0.05), the null hypothesis is rejected.

Conclusion

In this article, we discussed how to perform a multivariate normality test in Python using both visual and statistical methods. However, be mindful that testing for multivariate normality can be quite complex. Most real-world data are not normally distributed, but many statistical techniques are robust to violations of normality. Therefore, minor deviations from normality are generally not a concern.

Leave a Reply