How to Calculate Sample & Population Variance in Python

Spread the love

Introduction

Variance is a statistical concept that quantifies the amount of dispersion in a set of data points. In other words, it measures how far each number in the dataset is from the mean and thus from every other number in the set. Variance is often denoted by the symbols σ² (for population variance) and s² (for sample variance).

There are two types of variance: population variance and sample variance. Population variance refers to the variance within an entire population. Sample variance, on the other hand, refers to the variance within a sample of the population.

In this article, we will learn how to calculate both sample and population variance in Python. We will use Python’s built-in functions, as well as the powerful libraries numpy, pandas, and statistics, for our computations. We will also discuss the concept of Bessel’s correction and its importance in the calculation of sample variance.

Calculating Variance

Before diving into Python, let’s quickly discuss how variance is calculated.

The formula for population variance is:

σ² = Σ ( xi - μ )² / N

And for sample variance, it’s:

s² = Σ ( xi - x̄ )² / (n - 1)

Where:

  • xi represents each value from the dataset,
  • μ is the population mean,
  • x̄ is the sample mean,
  • N is the size of the population,
  • n is the size of the sample.

The key difference between the two formulas is the denominator. For population variance, we divide by the size of the population (N), whereas for sample variance, we divide by the size of the sample minus one (n – 1). This adjustment is known as Bessel’s correction, which corrects the bias in the estimation of the population variance.

Calculating Variance in Python

Python, along with its libraries, provides several ways to calculate variance.

Using Built-in Python Functions

We can compute variance using plain Python code and built-in functions. Below is an example of how to do this.

# Population variance
def population_variance(data):
    # Number of observations
    n = len(data)
    # Mean of the data
    mean = sum(data) / n
    # Square deviations
    deviations = [(x - mean) ** 2 for x in data]
    # Variance
    variance = sum(deviations) / n
    return variance

# Sample variance
def sample_variance(data):
    # Number of observations
    n = len(data)
    # Mean of the data
    mean = sum(data) / n
    # Square deviations
    deviations = [(x - mean) ** 2 for x in data]
    # Variance
    variance = sum(deviations) / (n - 1)
    return variance

data = [2, 4, 6, 8, 10]

print("Population Variance:", population_variance(data))
print("Sample Variance:", sample_variance(data))

Using the statistics Library

Python’s statistics library, introduced in Python 3.4, provides functions to calculate mathematical statistics of numeric data. The functions pvariance and variance can be used to calculate population variance and sample variance, respectively.

import statistics as stats

data = [2, 4, 6, 8, 10]

print("Population Variance:", stats.pvariance(data))
print("Sample Variance:", stats.variance(data))

Using numpy

numpy is a fundamental package for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with arrays. The var function can be used to compute variance. By default, this function calculates the population variance. To calculate sample variance, we need to set the ddof (Delta Degrees of Freedom) parameter to 1.

import numpy as np

data = np.array([2, 4, 6, 8, 10])

print("Population Variance:", np.var(data))
print("Sample Variance:", np.var(data, ddof=1))

Using pandas

pandas is a powerful data manipulation library in Python. It provides data structures and functions needed to manipulate structured data. The var function of a pandas Series computes variance. Note that this function calculates the sample variance by default. To compute the population variance, set ddof to 0.

import pandas as pd

data = pd.Series([2, 4, 6, 8, 10])

print("Population Variance:", data.var(ddof=0))
print("Sample Variance:", data.var())

Conclusion

In this article, we have learned how to calculate both sample and population variance in Python using built-in functions as well as the numpy, pandas, and statistics libraries. Understanding variance is crucial for data analysis and machine learning tasks, as it provides insights into the data’s dispersion. The different libraries in Python provide flexible and efficient ways to calculate variance, making Python a powerful tool for statistical computing.

Leave a Reply