Count Number of Rows in a Column or DataFrame in PySpark

Spread the love

To count the number of rows in a column or dataframe in pyspark, we can use the count method or function.

Read a Dataset –

Let’s read a dataset to illustrate it. We will use the clothing store sales data.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv') \
    .options(header='true', inferSchema='true') \
    .load('../data/clothing_store_sales.csv')
df.show(5)

Count the number of Rows in a DataFrame in PySpark –

To count the number of rows in a dataframe, we can use the count method.

df.count()
#output
100

Count the number of Non-Null Values in a Column –

Let’s read a dataset that contains some null values in it’s column. We will use the fruits prices data.

df_new = spark.read.format('csv') \
    .options(header='true', inferSchema='true') \
    .load('../data/fruit_prices.csv')
df_new.show(5)

Let’s count the number of non-null values in the apple column.

from pyspark.sql.functions import count
df_new.select(count('apple')).show()
#output
7

Now, count the number of non-null values in the Orange column.

df_new.select(count('Orange')).show()
#output
8

Related Posts –

  1. How to Compute the Mean of a Column in PySpark?
  2. How to Compute Standard Deviation in PySpark?
  3. Compute Minimum and Maximum value of a Column in PySpark
  4. describe() method – Compute Summary Statistics in PySpark

Rating: 1 out of 5.

Leave a Reply