
The Shapiro-Wilk test is a popular statistical test for checking the normality of a dataset. It’s often used in data analysis to verify assumptions and select appropriate statistical methods. This article will provide a detailed guide on how to perform a Shapiro-Wilk test in Python, using popular libraries like NumPy and SciPy.
Importing the Libraries
First, import the required libraries.
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
Generating a Sample Dataset
For demonstration purposes, let’s generate a random dataset with a normal distribution using NumPy:
# Set the seed for reproducibility
np.random.seed(0)
# Generate a sample dataset
data = np.random.normal(loc=0, scale=1, size=100)
In the above code, we’ve generated a dataset of size 100 from a normal distribution with a mean (loc
) of 0 and a standard deviation (scale
) of 1. The np.random.seed(0)
is used for reproducibility.
Visualizing the Dataset
Before performing the Shapiro-Wilk test, it’s often helpful to visualize the data. We’ll use a histogram and a Q-Q plot for this:
# Plot a histogram
plt.figure(figsize=(10, 5))
sns.histplot(data, kde=True)
plt.title('Histogram')
plt.show()
# Plot a Q-Q plot
plt.figure(figsize=(10, 5))
stats.probplot(data, plot=plt)
plt.title('Q-Q Plot')
plt.show()
The histogram provides a graphical representation of the data distribution. The Q-Q plot (quantile-quantile plot) is another graphical tool to help us assess if a dataset follows a particular theoretical distribution. If the points in the Q-Q plot fall perfectly on a straight diagonal line, it suggests that the data is normally distributed.
Performing the Shapiro-Wilk Test
Now, we’ll perform the Shapiro-Wilk test using the scipy.stats.shapiro
function:
# Perform the Shapiro-Wilk test
stat, p_value = stats.shapiro(data)
# Print the test statistic and p-value
print(f'Statistic: {stat}, P-Value: {p_value}')
The scipy.stats.shapiro
function returns two values:
- The test statistic (
stat
), which should be close to 1 for a sample from a normal distribution. - The p-value (
p_value
), which represents the probability that the data came from a normal distribution.
Interpreting the Results
To interpret the results, you typically choose a significance level beforehand, which is often denoted as alpha (α). A common choice is 0.05:
- If the p-value is less than α, you reject the null hypothesis and conclude that the data does not come from a normal distribution.
- If the p-value is greater than α, you fail to reject the null hypothesis and conclude that the data comes from a normal distribution.
alpha = 0.05
if p_value < alpha:
print('The null hypothesis can be rejected. Data is not from a normal distribution.')
else:
print('The null hypothesis cannot be rejected. Data is from a normal distribution.')
Conclusion
The Shapiro-Wilk test is a valuable tool for checking the normality of a dataset. However, it’s important to note that no statistical test is definitive. The Shapiro-Wilk test, like other normality tests, becomes more sensitive as the sample size grows. It’s more likely to reject the null hypothesis for larger, non-perfect datasets, even if they’re reasonably normal.
Always consider the results of the Shapiro-Wilk test in the context of your data and alongside other methods, such as visualizations and domain knowledge.