# How to Perform a Shapiro-Wilk Test in Python

The Shapiro-Wilk test is a popular statistical test for checking the normality of a dataset. It’s often used in data analysis to verify assumptions and select appropriate statistical methods. This article will provide a detailed guide on how to perform a Shapiro-Wilk test in Python, using popular libraries like NumPy and SciPy.

## Importing the Libraries

First, import the required libraries.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

## Generating a Sample Dataset

For demonstration purposes, let’s generate a random dataset with a normal distribution using NumPy:

# Set the seed for reproducibility
np.random.seed(0)

# Generate a sample dataset
data = np.random.normal(loc=0, scale=1, size=100)

In the above code, we’ve generated a dataset of size 100 from a normal distribution with a mean (loc) of 0 and a standard deviation (scale) of 1. The np.random.seed(0) is used for reproducibility.

## Visualizing the Dataset

Before performing the Shapiro-Wilk test, it’s often helpful to visualize the data. We’ll use a histogram and a Q-Q plot for this:

# Plot a histogram
plt.figure(figsize=(10, 5))
sns.histplot(data, kde=True)
plt.title('Histogram')
plt.show()

# Plot a Q-Q plot
plt.figure(figsize=(10, 5))
stats.probplot(data, plot=plt)
plt.title('Q-Q Plot')
plt.show()

The histogram provides a graphical representation of the data distribution. The Q-Q plot (quantile-quantile plot) is another graphical tool to help us assess if a dataset follows a particular theoretical distribution. If the points in the Q-Q plot fall perfectly on a straight diagonal line, it suggests that the data is normally distributed.

## Performing the Shapiro-Wilk Test

Now, we’ll perform the Shapiro-Wilk test using the scipy.stats.shapiro function:

# Perform the Shapiro-Wilk test
stat, p_value = stats.shapiro(data)

# Print the test statistic and p-value
print(f'Statistic: {stat}, P-Value: {p_value}')

The scipy.stats.shapiro function returns two values:

1. The test statistic (stat), which should be close to 1 for a sample from a normal distribution.
2. The p-value (p_value), which represents the probability that the data came from a normal distribution.

## Interpreting the Results

To interpret the results, you typically choose a significance level beforehand, which is often denoted as alpha (α). A common choice is 0.05:

• If the p-value is less than α, you reject the null hypothesis and conclude that the data does not come from a normal distribution.
• If the p-value is greater than α, you fail to reject the null hypothesis and conclude that the data comes from a normal distribution.
alpha = 0.05
if p_value < alpha:
print('The null hypothesis can be rejected. Data is not from a normal distribution.')
else:
print('The null hypothesis cannot be rejected. Data is from a normal distribution.')

## Conclusion

The Shapiro-Wilk test is a valuable tool for checking the normality of a dataset. However, it’s important to note that no statistical test is definitive. The Shapiro-Wilk test, like other normality tests, becomes more sensitive as the sample size grows. It’s more likely to reject the null hypothesis for larger, non-perfect datasets, even if they’re reasonably normal.

Always consider the results of the Shapiro-Wilk test in the context of your data and alongside other methods, such as visualizations and domain knowledge.