
Introduction
Variance is a statistical concept that quantifies the amount of dispersion in a set of data points. In other words, it measures how far each number in the dataset is from the mean and thus from every other number in the set. Variance is often denoted by the symbols σ² (for population variance) and s² (for sample variance).
There are two types of variance: population variance and sample variance. Population variance refers to the variance within an entire population. Sample variance, on the other hand, refers to the variance within a sample of the population.
In this article, we will learn how to calculate both sample and population variance in Python. We will use Python’s built-in functions, as well as the powerful libraries numpy
, pandas
, and statistics
, for our computations. We will also discuss the concept of Bessel’s correction and its importance in the calculation of sample variance.
Calculating Variance
Before diving into Python, let’s quickly discuss how variance is calculated.
The formula for population variance is:
σ² = Σ ( xi - μ )² / N
And for sample variance, it’s:
s² = Σ ( xi - x̄ )² / (n - 1)
Where:
- xi represents each value from the dataset,
- μ is the population mean,
- x̄ is the sample mean,
- N is the size of the population,
- n is the size of the sample.
The key difference between the two formulas is the denominator. For population variance, we divide by the size of the population (N), whereas for sample variance, we divide by the size of the sample minus one (n – 1). This adjustment is known as Bessel’s correction, which corrects the bias in the estimation of the population variance.
Calculating Variance in Python
Python, along with its libraries, provides several ways to calculate variance.
Using Built-in Python Functions
We can compute variance using plain Python code and built-in functions. Below is an example of how to do this.
# Population variance
def population_variance(data):
# Number of observations
n = len(data)
# Mean of the data
mean = sum(data) / n
# Square deviations
deviations = [(x - mean) ** 2 for x in data]
# Variance
variance = sum(deviations) / n
return variance
# Sample variance
def sample_variance(data):
# Number of observations
n = len(data)
# Mean of the data
mean = sum(data) / n
# Square deviations
deviations = [(x - mean) ** 2 for x in data]
# Variance
variance = sum(deviations) / (n - 1)
return variance
data = [2, 4, 6, 8, 10]
print("Population Variance:", population_variance(data))
print("Sample Variance:", sample_variance(data))
Using the statistics
Library
Python’s statistics
library, introduced in Python 3.4, provides functions to calculate mathematical statistics of numeric data. The functions pvariance
and variance
can be used to calculate population variance and sample variance, respectively.
import statistics as stats
data = [2, 4, 6, 8, 10]
print("Population Variance:", stats.pvariance(data))
print("Sample Variance:", stats.variance(data))
Using numpy
numpy
is a fundamental package for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with arrays. The var
function can be used to compute variance. By default, this function calculates the population variance. To calculate sample variance, we need to set the ddof
(Delta Degrees of Freedom) parameter to 1.
import numpy as np
data = np.array([2, 4, 6, 8, 10])
print("Population Variance:", np.var(data))
print("Sample Variance:", np.var(data, ddof=1))
Using pandas
pandas
is a powerful data manipulation library in Python. It provides data structures and functions needed to manipulate structured data. The var
function of a pandas
Series computes variance. Note that this function calculates the sample variance by default. To compute the population variance, set ddof
to 0.
import pandas as pd
data = pd.Series([2, 4, 6, 8, 10])
print("Population Variance:", data.var(ddof=0))
print("Sample Variance:", data.var())
Conclusion
In this article, we have learned how to calculate both sample and population variance in Python using built-in functions as well as the numpy
, pandas
, and statistics
libraries. Understanding variance is crucial for data analysis and machine learning tasks, as it provides insights into the data’s dispersion. The different libraries in Python provide flexible and efficient ways to calculate variance, making Python a powerful tool for statistical computing.