How to Calculate Median Absolute Deviation in Python

Spread the love

Introduction

As data scientists and statisticians, measuring the variability and dispersion within a dataset is a common and crucial task. While the standard deviation and variance are popular measures, they are sensitive to outliers. The Median Absolute Deviation (MAD), on the other hand, is a more robust metric for dispersion. In this comprehensive guide, we will explore the concept of MAD, and step-by-step, learn how to calculate it using Python.

Understanding Median Absolute Deviation

MAD is a measure of statistical dispersion, representing the median of the absolute deviations from the median of a dataset. In simpler terms, it measures how spread out the values in a dataset are from the median. The formula for MAD is:

MAD = median(|Xi – median(X)|)

Where:

  • Xi represents each value in the dataset
  • median(X) is the median of the dataset

Data Preparation

You need a dataset to work with for computing MAD. You can use real-world data or create synthetic data. In this example, we will create synthetic data using pandas:

import pandas as pd

# Create a DataFrame with sample data
data = {'Values': [3, 4, 5, 5, 2, 3, 4.5, 5.2, 7, 2.8, 4.9]}
df = pd.DataFrame(data)

Implementing MAD Calculation

Now, let’s create a function to calculate MAD using the formula mentioned.

import numpy as np

def calculate_mad(data):
    """
    Calculate the Median Absolute Deviation (MAD)
    
    :param data: list of values
    :return: MAD
    """
    # Calculate the median of the data
    median = np.median(data)
    
    # Calculate the absolute deviations from the median
    absolute_deviations = [np.abs(x - median) for x in data]
    
    # Calculate MAD
    mad = np.median(absolute_deviations)
    return mad

Using the function.

values = df['Values'].tolist()

mad = calculate_mad(values)
print(f'MAD: {mad}')

Leveraging Scikit-learn

Although scikit-learn doesn’t have a built-in function for MAD, we can leverage the robust_scale function to calculate MAD. The robust_scale function scales the dataset using parameters that are robust to outliers, which involves using the median and MAD.

from sklearn.preprocessing import robust_scale

# Note: The robust_scale function returns standardized values, so we need to extract MAD
mad = np.median(np.abs(robust_scale(values, with_centering=False)))
print(f'MAD (using scikit-learn): {mad}')

Using Pandas

Pandas provides a built-in method for calculating MAD, which is extremely convenient for datasets stored as DataFrame.

mad = df['Values'].mad()
print(f'MAD (using pandas): {mad}')

Note: Pandas uses the mean instead of the median in its calculation. To get the true MAD, we can still use the function we created earlier.

Conclusion

Through this extensive guide, we delved into the concept of Median Absolute Deviation (MAD), its importance as a robust measure of dispersion, and the various methods to calculate it in Python. With a custom function, leveraging scikit-learn, and using Pandas, we have a toolbox of methods for incorporating MAD into data analysis workflows. This understanding and utilization of MAD are vital for analyzing datasets, especially those with outliers that can affect dispersion metrics.

Leave a Reply