
Introduction
As data scientists and statisticians, measuring the variability and dispersion within a dataset is a common and crucial task. While the standard deviation and variance are popular measures, they are sensitive to outliers. The Median Absolute Deviation (MAD), on the other hand, is a more robust metric for dispersion. In this comprehensive guide, we will explore the concept of MAD, and step-by-step, learn how to calculate it using Python.
Understanding Median Absolute Deviation
MAD is a measure of statistical dispersion, representing the median of the absolute deviations from the median of a dataset. In simpler terms, it measures how spread out the values in a dataset are from the median. The formula for MAD is:
MAD = median(|Xi – median(X)|)
Where:
- Xi represents each value in the dataset
- median(X) is the median of the dataset
Data Preparation
You need a dataset to work with for computing MAD. You can use real-world data or create synthetic data. In this example, we will create synthetic data using pandas:
import pandas as pd
# Create a DataFrame with sample data
data = {'Values': [3, 4, 5, 5, 2, 3, 4.5, 5.2, 7, 2.8, 4.9]}
df = pd.DataFrame(data)
Implementing MAD Calculation
Now, let’s create a function to calculate MAD using the formula mentioned.
import numpy as np
def calculate_mad(data):
"""
Calculate the Median Absolute Deviation (MAD)
:param data: list of values
:return: MAD
"""
# Calculate the median of the data
median = np.median(data)
# Calculate the absolute deviations from the median
absolute_deviations = [np.abs(x - median) for x in data]
# Calculate MAD
mad = np.median(absolute_deviations)
return mad
Using the function.
values = df['Values'].tolist()
mad = calculate_mad(values)
print(f'MAD: {mad}')
Leveraging Scikit-learn
Although scikit-learn doesn’t have a built-in function for MAD, we can leverage the robust_scale function to calculate MAD. The robust_scale function scales the dataset using parameters that are robust to outliers, which involves using the median and MAD.
from sklearn.preprocessing import robust_scale
# Note: The robust_scale function returns standardized values, so we need to extract MAD
mad = np.median(np.abs(robust_scale(values, with_centering=False)))
print(f'MAD (using scikit-learn): {mad}')
Using Pandas
Pandas provides a built-in method for calculating MAD, which is extremely convenient for datasets stored as DataFrame.
mad = df['Values'].mad()
print(f'MAD (using pandas): {mad}')
Note: Pandas uses the mean instead of the median in its calculation. To get the true MAD, we can still use the function we created earlier.
Conclusion
Through this extensive guide, we delved into the concept of Median Absolute Deviation (MAD), its importance as a robust measure of dispersion, and the various methods to calculate it in Python. With a custom function, leveraging scikit-learn, and using Pandas, we have a toolbox of methods for incorporating MAD into data analysis workflows. This understanding and utilization of MAD are vital for analyzing datasets, especially those with outliers that can affect dispersion metrics.