How to Test for Normality in Python

Spread the love

Testing for normality is a crucial step in many statistical analyses because several statistical techniques assume that data is normally distributed. In Python, several libraries, including SciPy and statsmodels, offer functions to test for normality. In this guide, we will discuss various methods to perform a normality test in Python, including visual methods (like Q-Q plots) and statistical tests (like the Shapiro-Wilk test, the Anderson-Darling test, and the Kolmogorov-Smirnov test).

Required Libraries

Ensure you have the following libraries installed:

  1. NumPy: A library for numerical operations.
  2. pandas: A library for data manipulation and analysis.
  3. SciPy: A library for scientific computing and technical computing.
  4. statsmodels: A library for estimating statistical models.
  5. Matplotlib and Seaborn: Libraries for data visualization.

You can install these libraries using pip:

pip install numpy pandas scipy statsmodels matplotlib seaborn

Importing the Libraries

After installing, import these libraries:

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

Generating a Sample Dataset

For this guide, we’ll generate a dataset from a normal distribution:

# Set the seed for reproducibility
np.random.seed(0)

# Generate a dataset from a normal distribution
data = np.random.normal(loc=0, scale=1, size=1000)

This generates 1000 data points from a normal distribution with a mean (loc) of 0 and a standard deviation (scale) of 1.

Visual Methods for Checking Normality

Histogram

The histogram provides a graphical representation of data distribution:

sns.histplot(data, kde=True)
plt.title('Histogram')
plt.show()

In a histogram of normally distributed data, you’d expect to see a bell curve shape.

Q-Q Plot

A Q-Q (quantile-quantile) plot is a graphical tool to help us assess if a dataset comes from a certain distribution:

sm.qqplot(data, line='45')
plt.title('Q-Q Plot')
plt.show()

In a Q-Q plot, if the data is normally distributed, the points will fall along the 45-degree line.

Statistical Methods for Checking Normality

Shapiro-Wilk Test

The Shapiro-Wilk test is one of the most powerful normality tests:

W, p_value = stats.shapiro(data)
print(f'Shapiro-Wilk Test: W={W}, p-value={p_value}')

The null hypothesis of the Shapiro-Wilk test is that the data is normally distributed. So if the p-value is less than the chosen alpha level (typically 0.05), then the null hypothesis is rejected and there is evidence that the data is not from a normally distributed population.

Anderson-Darling Test

The Anderson-Darling test is another powerful statistical test for checking normality:

result = stats.anderson(data)
print(f'Anderson-Darling Test: statistic={result.statistic}, critical_values={result.critical_values}, significance_level={result.significance_level}')

For the Anderson-Darling test, if the returned statistic is larger than these critical values then for the corresponding significance level, the null hypothesis that the data come from the chosen distribution can be rejected.

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test can be applied to any continuous distribution to compare the sample distribution with the theoretical one:

D, p_value = stats.kstest(data, 'norm')
print(f'Kolmogorov-Smirnov Test: D={D}, p-value={p_value}')

Like the other tests, the null hypothesis of the K-S test is that the data is normally distributed. If the p-value is less than alpha (typically 0.05), the null hypothesis is rejected.

Conclusion

In this article, we discussed how to perform a normality test in Python using both visual and statistical methods. Remember, though, no statistical test is fully decisive, and they should be used in conjunction with graphical methods and ideally, domain knowledge. Even when a test suggests that data is normally distributed, this doesn’t guarantee that the data is normal—it simply provides strong evidence. Conversely, these tests are sensitive to sample size; with large data, they can detect insignificant deviations from normality, leading to rejection of the null hypothesis when the data is close to normal for practical purposes.

Leave a Reply