
Introduction
The standard deviation is a measure of the amount of variance or dispersion in a set of values. A low standard deviation indicates that the values tend to be close to the mean (average) of the set, while a high standard deviation indicates that the values are spread out over a broader range.
In statistics, two types of standard deviations are commonly used – population standard deviation and sample standard deviation. The population standard deviation is used when an entire population is available, and the sample standard deviation is used when only a sample is available.
This article will guide you on how to calculate the standard deviation in Python. We will explore different Python libraries, namely numpy
, statistics
, pandas
, and scipy
, which provide functionalities to efficiently calculate the standard deviation.
Standard Deviation Formula
The formula for calculating the population standard deviation is:
σ = sqrt[ Σ ( xi - μ )² / N ]
And for the sample standard deviation:
s = sqrt[ Σ ( xi - x̄ )² / (n - 1) ]
Where:
- xi represents each value in the dataset,
- μ is the population mean,
- x̄ is the sample mean,
- N is the size of the population,
- n is the size of the sample,
- Σ is the sum of the values.
The square root is used to bring the units of variance, which are squared, back to the original units of measurement.
Calculating Standard Deviation in Python
Using Built-in Python Functions
Standard deviation can be calculated using pure Python by following the standard deviation formula:
import math
# Sample data
data = [4, 2, 5, 8, 6]
# Calculate mean
mean = sum(data) / len(data)
# Calculate variance (average of squared differences from the mean)
variance = sum((xi - mean) ** 2 for xi in data) / len(data)
# Calculate standard deviation (square root of variance)
std_dev = math.sqrt(variance)
print("Standard Deviation:", std_dev)
This method works, but it can be somewhat lengthy, especially for large datasets.
Using the statistics
Library
Python’s statistics
library, which was introduced in Python 3.4, provides functions to calculate mathematical statistics of numeric data. It offers the pstdev
function to calculate the population standard deviation, and the stdev
function to calculate the sample standard deviation.
import statistics as stats
# Sample data
data = [4, 2, 5, 8, 6]
print("Population Standard Deviation:", stats.pstdev(data))
print("Sample Standard Deviation:", stats.stdev(data))
Using numpy
numpy
is a powerful library in Python for mathematical and scientific computing. It provides the std
function to calculate the standard deviation. By default, std
calculates the population standard deviation. For the sample standard deviation, we need to set the ddof
(Delta Degrees of Freedom) parameter to 1.
import numpy as np
# Sample data
data = np.array([4, 2, 5, 8, 6])
print("Population Standard Deviation:", np.std(data))
print("Sample Standard Deviation:", np.std(data, ddof=1))
Using pandas
pandas
is a data manipulation and analysis library in Python. It provides data structures and functions needed to manipulate structured data. The std
function of a pandas Series or DataFrame computes the standard deviation. By default, this function computes the sample standard deviation. To compute the population standard deviation, we need to set ddof
to 0.
import pandas as pd
# Sample data
data = pd.Series([4, 2, 5, 8, 6])
print("Population Standard Deviation:", data.std(ddof=0))
print("Sample Standard Deviation:", data.std())
Conclusion
In this tutorial, we have learned how to calculate the standard deviation in Python using several different methods and libraries. The standard deviation is a key statistical measure that shows the amount of variation in a dataset. Knowing how to calculate the standard deviation is a critical skill for anyone working in data analysis or statistics. Python provides several ways to calculate standard deviation efficiently, making it an excellent tool for such tasks.