In this post you will learn how to get distinct values of a column in PySpark.
distinct() method –
The distinct() method in pyspark let’s you find unique or distinct values in a dataframe. You can find distinct values from a single column or multiple columns. The distinct() method allows us to deduplicate any rows that are in that dataframe.
Let’s read a dataset to illustrate it. We will work with clothing stores sales file.
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv') df.show(5)
Get Distinct or Unique Values from a single column in PySpark –
Let’s say you want to know how many types of Method of Payment are there in this dataframe.
df.select("Method of Payment").distinct().show()
Get Distinct or Unique Values from multiple columns in PySpark –
To find unique values from multiple columns first you have to select multiple column using the select function then you have to use the distinct method.
Let’s say we want to find out unique values from the Method of payment and Gender columns.
df.select("Method of Payment", "Gender").distinct().show()
Get Distinct or Unique Values from all the columns in PySpark –
You can also get the unique values from all the column in pyspark.