Variance is a statistical concept that quantifies the amount of dispersion in a set of data points. In other words, it measures how far each number in the dataset is from the mean and thus from every other number in the set. Variance is often denoted by the symbols σ² (for population variance) and s² (for sample variance).
There are two types of variance: population variance and sample variance. Population variance refers to the variance within an entire population. Sample variance, on the other hand, refers to the variance within a sample of the population.
In this article, we will learn how to calculate both sample and population variance in Python. We will use Python’s built-in functions, as well as the powerful libraries
statistics, for our computations. We will also discuss the concept of Bessel’s correction and its importance in the calculation of sample variance.
Before diving into Python, let’s quickly discuss how variance is calculated.
The formula for population variance is:
σ² = Σ ( xi - μ )² / N
And for sample variance, it’s:
s² = Σ ( xi - x̄ )² / (n - 1)
- xi represents each value from the dataset,
- μ is the population mean,
- x̄ is the sample mean,
- N is the size of the population,
- n is the size of the sample.
The key difference between the two formulas is the denominator. For population variance, we divide by the size of the population (N), whereas for sample variance, we divide by the size of the sample minus one (n – 1). This adjustment is known as Bessel’s correction, which corrects the bias in the estimation of the population variance.
Calculating Variance in Python
Python, along with its libraries, provides several ways to calculate variance.
Using Built-in Python Functions
We can compute variance using plain Python code and built-in functions. Below is an example of how to do this.
# Population variance def population_variance(data): # Number of observations n = len(data) # Mean of the data mean = sum(data) / n # Square deviations deviations = [(x - mean) ** 2 for x in data] # Variance variance = sum(deviations) / n return variance # Sample variance def sample_variance(data): # Number of observations n = len(data) # Mean of the data mean = sum(data) / n # Square deviations deviations = [(x - mean) ** 2 for x in data] # Variance variance = sum(deviations) / (n - 1) return variance data = [2, 4, 6, 8, 10] print("Population Variance:", population_variance(data)) print("Sample Variance:", sample_variance(data))
statistics library, introduced in Python 3.4, provides functions to calculate mathematical statistics of numeric data. The functions
variance can be used to calculate population variance and sample variance, respectively.
import statistics as stats data = [2, 4, 6, 8, 10] print("Population Variance:", stats.pvariance(data)) print("Sample Variance:", stats.variance(data))
numpy is a fundamental package for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with arrays. The
var function can be used to compute variance. By default, this function calculates the population variance. To calculate sample variance, we need to set the
ddof (Delta Degrees of Freedom) parameter to 1.
import numpy as np data = np.array([2, 4, 6, 8, 10]) print("Population Variance:", np.var(data)) print("Sample Variance:", np.var(data, ddof=1))
pandas is a powerful data manipulation library in Python. It provides data structures and functions needed to manipulate structured data. The
var function of a
pandas Series computes variance. Note that this function calculates the sample variance by default. To compute the population variance, set
ddof to 0.
import pandas as pd data = pd.Series([2, 4, 6, 8, 10]) print("Population Variance:", data.var(ddof=0)) print("Sample Variance:", data.var())
In this article, we have learned how to calculate both sample and population variance in Python using built-in functions as well as the
statistics libraries. Understanding variance is crucial for data analysis and machine learning tasks, as it provides insights into the data’s dispersion. The different libraries in Python provide flexible and efficient ways to calculate variance, making Python a powerful tool for statistical computing.