
To compute the population standard deviation, we us the stddev_pop function in pyspark and to compute the sample standard deviation, we use the stddev_samp function.
Read a Dataset –
Let’s read a dataset to illustrate it. We will use the clothing store sales data.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv') \
.options(header='true', inferSchema='true') \
.load('../data/clothing_store_sales.csv')
df.show(5)

Population Standard deviation –
Let’s use the stddev_pop function to compute the Population standard deviation of the Age column.
from pyspark.sql.functions import stddev_pop
df.select(stddev_pop('Age')).show()

Sample Standard deviation –
To compute the sample standard deviation, we will use the stddev_samp function.
from pyspark.sql.functions import stddev_samp
df.select(stddev_samp('Age')).show()

Related Posts –
- How to Compute the Mean of a Column in PySpark?
- Compute Minimum and Maximum value of a Column in PySpark
- Count Number of Rows in a Column or DataFrame in PySpark
- describe() method – Compute Summary Statistics in PySpark