describe() method – Compute Summary Statistics in PySpark

Spread the love

To compute the summary statistics of a column in PySpark, we can use the describe() method. This method takes numeric columns and calculate the count, mean, standard deviation, min, and max of these columns.

Read a Dataset –

Let’s read a dataset to illustrate it. We will use the clothing store sales data.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv') \
    .options(header='true', inferSchema='true') \
    .load('../data/clothing_store_sales.csv')
df.show(5)

Compute Summary Statistics with describe() method –

To calculate the summary statistics, we can call the describe() method on the dataframe like this –

df.describe().show()

Let’s only select the numeric columns and then compute the summary statistics.

# select only numeric columns
df_numeric = df.select('Customer','Items','Net Sales','Age')
df_numeric.describe().show()

Related Posts –

  1. Count Number of Rows in a Column or DataFrame in PySpark
  2. How to Compute the Mean of a Column in PySpark?
  3. How to Compute Standard Deviation in PySpark?
  4. Compute Minimum and Maximum value of a Column in PySpark

Rating: 1 out of 5.

Leave a Reply